Put yourself inside: Tencent’s path to core creation

In hindsight, chip verification engineer Lynda felt that joining Tencent was a bit "hasty".

As a senior engineer who has worked in the semiconductor industry for many years, Lynda was a little surprised when she saw Tencent's announcement of chip job requirements for the first time. In January 2019, she joined this major Internet company out of curiosity, ready to roll up her sleeves and go big.

During the interview, Henry, who led the chip design work, gave her a shot of vaccination: "We are making chips from scratch." Lynda tried to understand this sentence through the usual low-key tone of Goose Factory, but Then on the first day of work, I was shocked by the conversation with my colleagues:

-"Where is our simulation tool?" -"No, we are still talking about it."

-" What about the verification environment? "-"Not yet."

-"What about the verification process? "

For a chip verification engineer, simulation tools, verification environment, and verification processes are essential productivity tools. Lynda wanted to participate in the entire chip research and development business, and was not afraid to start from scratch, but she did not expect to even have these All necessary items are "three nos".

When an Internet company invests in semiconductors, the lack of tools is not the most important thing. "Chip making" is not only a simple extension of the business, it often means more. Complex industrial chains, more time-consuming talent accumulation, and more different ecological cultures and technical concepts

For example, unlike software development, bugs can be continuously corrected in the later stage, and design problems are not discovered in the early stage of verification. , once it is tape-out, it can only be reduced to a "brick", and the verification engineer that Lynda serves is the gatekeeper to prevent the early efforts from being wasted.

The importance of this position is self-evident, and the design is. The ratio of engineers to verification engineers in many chip companies reaches 1:3. But after Lynda joined the company, she looked around and found that not only did she only have one colleague to work with, but there was not even a single line of verification code.

Only then did Lynda begin to understand what Henry meant by "starting from scratch" and what kind of difficult battle she faced.

01

The battle is as tough as iron and the battle has not yet begun. Jie

According to Xie Ming, vice president of Tencent Cloud and general manager of the cloud architecture platform department, there are more twists and turns behind the story of "starting from scratch"

Where Xie Ming is located. The cloud architecture platform department, standing behind Tencent's various front-end applications, is the front line of Tencent's massive business data flushing, effectively supporting national-level applications one after another such as QQ, email, WeChat, Weiyun, and streaming video.

In 2013, QQ Photo Album has developed into Tencent’s largest storage business. Making users access photo albums faster and with a smoother experience has become a very urgent need, which has turned into a corresponding technical issue. , that is, can images be transcoded faster? Can they be compressed without losing quality?

They asked repeatedly. > The team deeply understands the amplifying value of underlying technological innovation to upper-level applications. Although they must never stop self-transcendence in software architecture, they are keenly aware that only by innovating in hardware can they achieve deeper breakthroughs.

The question is: How can a team with a background in software develop hardware?

After a round of research, they decided to test the waters with FPGA (programmable array logic) first. Compared with the general-purpose chips in computers and mobile phones, FPGA is an application-specific integrated circuit (ASIC) that can achieve flexible "semi-customized" development.

Compared with chips, FPGA has a higher fault tolerance rate, but it is well balanced in terms of throughput, latency, power consumption and flexibility. Especially when processing massive amounts of data, FPGA has a significant advantage of ultra-low latency compared to GPU, and is very suitable for use in specific business scenarios.

Facts have verified this judgment. In 2015, the team concentrated on developing the image encoding FPGA, which achieved a higher compression rate and lower latency than CPU encoding and software encoding. It also helped QQ Photo Album reduce storage costs to a great extent. They saw the possibility of exploring and deepening in the FPGA direction.

Around 2016, the AI ??craze triggered by Alpha Go brought FPGA into the mainstream view. After the team accelerated the CNN algorithm of the deep learning model through FPGA, the processing performance reached 4 times that of general CPU, while the unit cost was only 1/3.

Although FPGA has good effects, the technical threshold is relatively high. "If FPGA is clouded, is it a solution that can expand its application?"

With such expectations, 2017 On January 20, 2020, Tencent Cloud launched the first FPGA cloud server in China, hoping to promote FPGA capabilities to more enterprises through cloud computing.

In terms of effect, companies that perform FPGA hardware programming on FPGA cloud servers can indeed improve performance to more than 30 times that of general-purpose CPU servers, while only paying a fee equivalent to about 40 for a general-purpose CPU. . Take a well-known genetic testing company as an example. Traditionally, it takes a week to detect a genetic sequence using a CPU, but FPGA can compress it to a few hours.

However, cloud-based FPGAs failed to take over the entire industry as quickly as expected.

On the one hand, FPGA is a "semi-customized" circuit after all, and many companies are still unable to independently develop FPGA and need higher-level services; on the other hand, the rapid decline in the cost of general-purpose chips has also made The cost-effectiveness advantage of FPGA is gradually lost.

The frustration of cloud commercialization poured cold water on the team’s enthusiasm from the peak to the bottom. At the same time, it also raised two issues nakedly in front of the entire team: the value of FPGA to the business. How big is it? Can FPGA still be used?

Suffering from this blow, the team almost fell apart in 2018, and people began to leave intensively. Tencent's first exploration in "core making" ended with a regrettable comma.

02

There is a bright future, and "Penglai" is born

After the setback of FPGA cloud server, Tencent needs to rethink how to go on the hardware road.

In 2018, when the team was almost disbanded, China’s chip industry ushered in a warm spring: Sino-US trade friction popularized the importance of chips to the whole people, and the establishment of the Science and Technology Innovation Board opened the door for semiconductor companies to go public. The entry of funds has made the country north and south even more prosperous.

However, for Internet companies, making chips is the same as making cloud computing, databases, storage systems, etc. It needs to be supported by specific business scenarios and cannot be "just for the sake of doing it." After experiencing an unsuccessful exploration, Tencent has to wait for the next opportunity brought by real demand.

Time has entered 2019. It was the first year of large-scale application of artificial intelligence, and both internal and external businesses put forward strong demands for AI chips. Do you want to make AI chips?

When this issue was raised, Tencent's management had objections, worrying that technical staff were just hot-headed and chasing hot spots. But at the same time, the management also gave enough grayscale and did not explicitly prohibit "exploration" at the small team level.

It has become everyone’s common sense to test the waters first in a small-scale, low-cost, and specific application scenario.

The Cloud Architecture Platform Department has finalized the direction of AI reasoning for the first chip and named it "Penglai". It is hoped that this chip can stand firmly on the turbulent waves like the overseas fairy mountains in ancient Chinese mythology. .

This hardware breakout team was also officially named "Penglai Laboratory".

With the experience accumulated during FPGA exploration, Penglai Laboratory has become quite proficient in hardware programming languages, and has also accumulated some platform-based designs in terms of standard interfaces and buses. However, the research and development requirements of the two are not the same.

If making FPGA is to build ready-made building blocks, then making chips is to start directly from cutting down trees to make building blocks. If there is a problem with the FPGA, it can be reprogrammed, but the chip only has one chance to tape out. Once something goes wrong, all the efforts will be in vain.

In addition, the resources of FPGA are ready-made and fixed, but the resources of the chip are defined by yourself. In one word, it means "picking": using the smallest resources to do the greatest thing.

Chip architecture engineer Rick changed the word "renovation" to "reconstruction" to describe the entire Penglai project. At first, the team thought they could easily convert the previous FPGA technology into chips. While doing it, I discovered that in the end, I just thought that not many FPGA architectures could be directly reused in chips. The team could only dismantle the original architecture and rewrite as much as 85% of the code.

For top priorities like DDR memory, chip manufacturers usually have dedicated verification personnel responsible for it. However, Penglai Laboratory, which is just starting out, does not have this condition and can only rely on grabbing time to make up for its homework. Lynda later recalled: "I wish I had 48 hours in a day."

In January 2020, Penglai chip tape-out was completed and was couriered to Shenzhen by the partner. The COVID-19 epidemic has just broken out across the country, and companies have started collective remote working.

Henry, the project leader, wore gloves to pick up the express delivery. After carefully disinfecting it with alcohol, he took it to an empty office building with the windows and fans wide open. Amidst the smell of disinfectant water, he and several Colleagues started the crucial lighting operation together.

The so-called lighting means powering up the chip. First, check whether there is any short circuit and smoke, and then test some basic functions. Whether it is a chip or a "brick", success or failure depends on this.

As a result, the clock frequency of the chip has not been released. You must know that the clock frequency is the "metronome" of the chip. Without the clock frequency, the different modules of the chip are not aligned and cannot work together.

Is it a problem with this chip? The experimenter changed a chip, but there was still no signal output.

Changed another piece, still nothing. There was total silence.

The experimenters no longer dared to do anything. Some people couldn't help but joke that it was time to go home and revise their resumes.

But in addition to frustration, everyone is more confused. Although the project was started almost from scratch with few people and resources, the Penglai team, from designers to verifiers, was confident that every step had been done well. What's wrong?

In an extremely solemn atmosphere, they continued to place the board, power on, and read signals...

The fourth chip lit up. All the remaining chips are fine.

The truth is actually very simple. The chip defect rate of the 28nm process is only 3, but the first three chips randomly tested were all bad chips. The small probability event just happened to allow them to catch up. This allows them to fully experience the tension of "giving birth to a child".

In the midst of applause and celebration after a false alarm, Tencent’s first chip was announced.

03

Going one step further, "Purple Sky" Lingyun

After mass production, the actual performance of Penglai chips has lived up to expectations, helping Tencent launch China's first A smart microscope approved for clinical use in hospitals can automatically recognize medical images, count cell numbers and display them directly on the field of view. Its performance fully meets the design requirements.

This clears away the haze of the FPGA cloud server project back then, indicating that Tencent can take this path of manufacturing chips that are directly oriented to applications and have excellent performance.

The advent of the terminal chip Penglai has only completed the task from 0 to 1. The team can’t wait to move from 1 to N and move towards large-scale cloud chips. Alex, the head of Penglai Laboratory, jokingly calls the application for large chip "Series A financing."

After the initial test, the team needs to explain to the company why it needs to invest more in large-scale chips? Can you stay ahead of the curve in the short and long term? How to combine with internal and external businesses to create value?

The decision Tencent faces this time is much easier to make.

The first is the maturity of Penglai Laboratory. By growing while marching, Penglai Laboratory has completed transformations one after another and established a complete, rigorous and standardized chip research and development system and process. This is already a "regular army" with a hard-core aura.

More importantly, the team proved Tencent’s advantages and position in making chips.

Xie Ming explained that from an industry perspective, in addition to considering technology and process when making chips, the biggest difficulty lies in the "definition" of chips. The advantage of traditional chip manufacturers lies in the former, but if the chips are made to match the needs, the real performance will be lost in many scenarios. The advantage of technology companies such as Google and Tencent is that they are the demand side and have the deepest and most thorough understanding and insight into demand.

There is no problem with the direction, technology and process. Lu Shan, Tencent’s senior executive vice president and president of TEG (Technology Engineering Division), has given full support and obtained more headcount and through the general office. funds.

With the support of the company's strategy, the team is full of ambition and rushes to a larger battlefield. Austin, deputy director of Penglai Lab, decided to split his efforts into two groups and advance AI reasoning and video encoding and decoding in parallel.

The AI ??team continues to work on Penglai’s 2.0 version of “Zixiao”. This is the name of the palace where ancestor Hongjun lived in "The Romance of the Gods". Building "Zixiao" firmly on the stable fairy mountain represents new ambitions:

This time, they directly set their goal to be the first in the industry.

All Zixiao’s architecture is built around effective computing power. The team optimized the on-chip cache design, abandoned the GDDR6 memory commonly used by competing products, and adopted advanced 2.5D packaging technology to package the HBM2e memory and AI chip together, thereby increasing the memory bandwidth by nearly 40%.

Technological iterations are advancing rapidly. After the Zixiao project was established, the highest performance in the industry was refreshed by competing products. Although Zixiao’s design performance is “safe” enough compared to this highest performance, the team plans to continue to increase its performance.

After research, they added a computer vision CV accelerator and a video encoding and decoding accelerator inside the chip, which can innovatively and significantly reduce the interaction and waiting between the AI ??chip and the x86 CPU.

Even though two complex self-developed modules were added as a result, the team still completed the entire process from architecture determination to verification and tape-out within the planned 6 months.

On September 10, 2021, Zixiao was successfully lit.

In application scenarios such as image and video processing, natural language processing, search recommendation, etc., this chip breaks the bottleneck that restricts the use of computing power, and finally reaches the performance of industry standard products in actual business scenarios. 2 times.

04

Independent self-research, "a smile from the sea"

The AI ??team named its chip "Zixiao", and the video codec was named "Canghai" means the connection between sea and sky.

Unlike Penglai and Zixiao, which focus on AI, Canghai is a video transcoding chip. If the transcoding problem of QQ photo album pictures was the earliest opportunity for the Penglai team to develop hardware, then the continued exploration of the video encoding and decoding team in this direction is a complete echo of the original intention.

The difference is that the application scenarios of "Canghai" have far exceeded the scope of those years.

When the multimedia business evolves from the picture era to the audio and video live broadcast era, the huge amount of 4K/8K ultra-high-definition digital content continues to impact the cloud computing infrastructure like a tide. Each additional bit of data will bring corresponding transcoding computing power and CDN bandwidth costs.

This is an intuitive and severe mathematical problem, and the Canghai team’s problem-solving goal is also very clear, which is to make the industry’s most powerful video transcoding chip and maximize the compression rate. .

Fortunately, Tencent’s rich multimedia application scenarios and the many live interactive top customers covered by Tencent Cloud provide unique analysis and verification conditions for Canghai’s research and development.

The team first launched Canghai's core self-developed module - the hardware video encoder "Yaochi", and decided to give Yaochi a big test before Canghai completed the research and development.

This big test is the 2020 MSU World Codec Competition, which is hosted by Moscow State University (MSU). It has been the most influential top event in the global video compression field for more than ten years, attracting Well-known domestic and foreign technology companies including Intel, Nvidia, Google, Huawei, Alibaba and Tencent participated.

As a result, Yaochi achieved 1080P@60Hz video real-time encoding, surpassing the competition and obtaining various technologies such as SSIM (structural similarity), PSNR (peak signal-to-noise ratio) and VMAF (video multi-method evaluation fusion). The first place in the objective index evaluation and the first place in the subjective evaluation of the human eye are one position ahead of the second place.

After this tough battle, Canghai has been fully reviewed technically.

On March 5, 2022, Derick and the video encoding and decoding team he led received the chip "Canghai" from tape-out, which coincided with Shenzhen's full remote working due to the epidemic.

They applied for special permission to enter the empty office building. This scene is very similar to when Penglai was lit up two years ago.

I never thought that the twists and turns of lighting up Penglai would also happen again. After overcoming some accidents during debugging, Tencent's third chip and the first completely independently developed chip Canghai was successfully lit up amid cheers.

Turn the ocean into a drop. Canghai finally achieved the goal of providing the same quality video with a smaller amount of data and smaller bandwidth, and the compression rate was improved by more than 30% compared to the industry's best performance.

From Penglai to Zixiao to Canghai, from 28nm process to 12nm process, from 8 people to more than 100 people, from having no simulation tools to the official completion of the "Tianjian Verification Platform", from hard work Keep up with the pace of partners to independently build a complete SOC.

The two teams successfully met. The Penglai team has completed a "core" road evolution.

05

In the "100G" era, two trees are towering

The cloud architecture platform department is not the only one jumping into the core-making trend.

While multimedia and AI processing are actively seeking changes, the underlying cloud servers are also facing similar problems: when the performance improvement brought by software optimization cannot make the product clearly competitive from competing products. , how to make performance break through the existing ceiling?

In 2019, Tencent ushered in a milestone in its cloud computing business—the number of cloud servers exceeded 1 million. Zou Xianneng, vice president of Tencent Cloud and general manager of Tencent Network Platform Department, keenly observed that as server access bandwidth continues to increase, servers use more and more CPU resources for network processing.

Can server network processing be implemented in a more cost-effective way while still providing higher network performance? Tencent’s Network Platform Department has also set its sights on software and hardware collaboration and hardware acceleration.

Faced with such a challenge of "both need and need", Zou Xianneng decided to make a subtraction to the server: "offload the burden of network data processing from the CPU."

The idea of ??"smart network card" was born.

The so-called smart network card, on the one hand, is responsible for the external network access of the server like an ordinary network card, and realizes network interconnection between different servers and data centers. On the other hand, it has additional intelligent units such as CPU/FPGA/memory, which can share some of the virtualized computing tasks of the server and accelerate the overall network and storage performance of the server.

In other words, what the network platform department needs to do is to install a new server in the network card.

Initially, the team hoped to find an off-the-shelf commercial board to reduce the workload.

Hayden, the person in charge of network card hardware, took the lead in carrying out program demonstration and research. However, the acceleration engine of commercial chips did not support private protocols, which became the first challenge and the biggest obstacle at that time. Some well-known network card equipment manufacturers shook their heads after hearing Tencent's request:

"The functions of network cards are very simple now, but your request is too complicated and difficult to achieve."

Also Some bluntly questioned: "With so many network cards and high reliability requirements, can you handle it yourself?"

Will the smart network card project be aborted just after it starts?

Zou Xianneng pointed out the direction for the team: "Since smart network cards are a key component for cloud data centers to pursue ultimate performance and cost, if there is no product on the market that meets Tencent's needs, then we will build one ourselves."

After the direction was clear, the route quickly became clear: start with self-developed smart network cards based on FPGA, and then develop smart network card chips.

In September 2020, Tencent’s first-generation FPGA-based self-developed smart network card was officially launched, named “Metasequoia”, embodying the team’s hope that the product can be as adaptable and fast-growing as this rare tree. .

During the epidemic, various sudden demands came, but the newly born metasequoia was not bent by the challenge.

Hayden recalled that a large customer adopted the UDP audio and video protocol, which was "unreliable" in terms of properties and allowed packet loss. It relied heavily on network throughput and stability, but required high concurrency and High-quality audio and video transmission effects.

Metasequoia Intelligent Network Card rose to the challenge and helped the customer complete a 24-hour extreme stress test with zero packet loss by greatly improving the network performance of the server, running it online stably and delivering a beautiful answer. .

After Metasequoia was put into use, the research and development of the second-generation smart network card "Yinshan" was also launched in full swing and will be officially launched in October 2021.

The network ports of this generation of smart network cards have doubled to 2*100G.

With the support of another towering tree, Tencent Cloud launched the industry's first self-developed sixth-generation 100G cloud server. Its computing performance is improved by up to 220, and storage performance is improved by up to 100. Compared with the previous generation, the single-node access network bandwidth is increased by up to 4 times, and the delay is reduced by 50%.

The team is excited about the huge gains "Two Trees" has made in network hardware offloading.

When the FPGA route gradually approached the bottleneck of performance and power consumption, the Network Platform Department decided to once again take the initiative in its own hands. Tencent's fourth chip and the first smart network card chip came into being. It also has a "fairy" name - "Xuanling".

06

"Xuanling" suddenly appears, but the core business is not over yet

According to the plan, this 7-nanometer chip will be tape-out by the end of 2022.

Hayden was ordered to quickly establish a Xuanling chip research and development team, constantly challenging multiple "mission impossible".

From the perspective of performance indicators, the number of devices supported by Xuanling will increase to more than 10K, which is 6 times higher than that of commercial chips. At the same time, its performance can be improved by 4 times compared to commercial chips. By offloading virtualization, network/storage IO and other functions originally running on the host CPU to the chip, 0 usage of the host CPU can be achieved.

This short and compact chip fully explains the "mystery" of ultimate performance for the future and the "spirit" of flexible acceleration for various business needs.

Currently, the Xuanling project is intensively conducting verification and testing of smart network cards before tape-out to build Tencent Cloud’s next-generation high-performance network infrastructure;

AI inference at Penglai Lab The chip Zixiao and the video transcoding chip Canghai will be mass-produced and deeply integrated with Tencent's business;

There are also some new chip projects that are also brewing and growing, and we will continue to explore the necessary technical directions to enrich this A "Book of Mountains and Seas".

The new challenges faced by Tencent’s massive business and the inevitable requirements for the rapid development of cloud computing have “forced” Tencent to embark on this path of core creation. These chips, which are based on business needs, will definitely go deep into practical applications to prove their value.

"We are not just making chips out of thin air. We knew from the beginning that Tencent's demand was big enough for us to do this," Lu Shan said.

Since 2010, Tencent has begun to open up its digital technology and connection capabilities to the outside world in the form of cloud services, rushing into the era of digital transformation and upgrading of the industry. Stepping into the game, Tencent sees that deep digital-real integration is leading the technology trend of Quanzhen Internet.

In addition to Tencent, China's technology companies are advancing into the deep water area of ??innovation, and efforts to break through bottlenecks have become increasingly important. Whether it is the integration of data and reality or upstream innovation, hundreds of companies are competing in the sea of ??hard technology, and they are all riding the wave of history.

Being involved in this trend, Tencent’s core issues will inevitably be echoed in the sea of ????stars.