Origin and evolution of [[SHARC processor] Origin of processor temperature.

1 Introduction When it comes to cutting-edge applications that require ultra-high performance, we have to mention the SHARC processor of analog devices. With the increasing market pressure such as higher dynamic range, higher performance and lower cost, the demand for floating-point processors in various applications is also increasing. This paper will introduce the history behind the first SHARC processor and discuss the innovation of its architecture, which makes this processor always in the leading position in the digital signal processing history of 18.

History of 2SHARC processor

"SHARC" is the abbreviation of Super Harvard ARChitecture, which is the name given to their floating-point processor by ADI. SHARC processor is improved on the basis of standard Harvard architecture, which not only facilitates data transmission on PM (program memory) bus, but also optimizes the throughput performance of calculation process based on tight loop by adding an instruction cache. The improved architecture can access data and coefficients at the same time, and execute instructions selected from the instruction cache at the same time, thus realizing the three-bus operation mode of the processor efficiently.

As we all know, the SHARC processor originated from ADSP-2 1020. This floating-point single instruction single data (SISD)DSP is actually an independent computing core without embedded memory or peripherals. Access PM and DM (data memory) storage space through external bus connected to SRAM chip, and program and debug the processor through JTAG interface.

ADSP-2 1020 can work at a clock frequency of 33MHz and execute instructions for one cycle. ADSP-2 1020 can complete 32-bit or 40-bit floating-point and 32-bit fixed-point operations with an 80-bit accumulator, which is a breakthrough product that ADI introduced to the market in 199 1 year. This core technology is the starting point of ADI's commitment to floating-point performance and innovation.

3 Integration and innovation: the birth of SHARC

The first real SHARC processor was ADSP-2 1060. Based on ADSP-2 1020 core products, ADI developed a fully integrated processor, including on-chip SRAM and I/O processor, which is used to control the DMA flow of integrated peripherals.

ADSP-2 1060 floating-point processor entered the market in 1994, which was considered as the top level of DSP performance and innovation at that time.

SHARC core can calculate at a speed of up to 40MHz in one cycle, and I/O processor is added, which can transfer data between peripherals and dual-port 4Mb SRAM memory at high speed without increasing any core overhead.

In order to further improve the system performance and scalability of end users, ADI's design team set out to create a mechanism that allows multiprocessor systems to enjoy data with extremely low overhead. A cluster bus controller is added to the external port logic, which can seamlessly carry out parallel data communication between processors, and each cluster can have up to 6 processors. This breakthrough technology allows system architects to transmit a large amount of data directly from the master processor to the memory of the designated slave processor with a bandwidth of up to 240MBps, or to send data directly to all slave devices in the cluster using broadcast mode.

High-speed communication between processors can also be realized by using ADI's patented link port technology. Each ADSP-2 1060 integrates six independent link ports for point-to-point communication, so an additional I/O bandwidth of 240MBps can be realized.

With this truly balanced architecture and expanded functions, SHARC processors are widely used in computing-intensive applications, such as medical imaging, military radar and electronic game machines.

The processor with this function was put on the market 15 years ago, which may be unbelievable, but what is even more surprising is that this processor is still being used by users! This is the best proof of the good scalability of SHARC architecture and ADI's commitment to quality and user satisfaction.

4 the second generation SHARC processor

The second generation of SHARC processor has improved the processing performance to a new level. It extends the kernel architecture to a single instruction multiple data (SIMD) system and increases the kernel clock frequency to 100MHz. ADSP-2 1 16x series processors are completely compatible with the source code of ADSP-2 106x SISD processors. Users can play the role of a newly added parallel operation unit (register file+multiplier +ALU+ barrel shifter) with only a few code modifications, which can double the cycle performance index compared with the previous generation SHARC.

In order to facilitate data transmission to this newly added arithmetic unit without reducing the cycle performance, the width of internal PM and DM data buses is increased to 64 bits, and a 48-bit SDRAM controller with the width of 1 16 1 is integrated on ADSP-265438 to increase the I/O data transmission bandwidth, so that the data transmission capacity with the bandwidth of 600MBps can be realized.

Just like the previous generation SISD SHARC, the second generation SHARC retains the seamless connection of multiprocessors supporting the cluster bus system architecture and the point-to-point connection through the link port, which makes the performance upgrade roadmap simpler and clearer.

Just like the previous generation SISD SHARC, the second generation SHARC series devices are widely used in medical, industrial and military applications, and due to the additional integration of serial port (SPORT) supporting time division multiplexing (TDM) and I2S format, professional audio and high-end consumer electronics/car audio devices can quickly take advantage of the large dynamic range provided by floating-point operation of processors.

5 the third generation SHARC processor

The third generation of SHARC processors began to jump out of the multiprocessor application space and take the initiative to meet new challenges. Due to the obvious advantages of floating-point processing in audio applications, the focus of the development of SHARC technology began to shift to increasing the on-chip processing function at the lowest system cost.

The first batch of processors developed and marketed for this purpose are ADSP-2 126x series. Just like ADSP-2 1 16x, ADSP-2 126x adopts SIMD architecture to maximize computing performance. Besides doubling the core performance to 200MHz, ADSP-2 1266 processor is also the first product with built-in mask ROM in SHARC series. Integrating 4Mb ROM reduces the complexity and cost of the system, and pushes the floating-point DSP that once gave the impression of "high cost" to the field of consumer audio.

In order to further reduce the complexity of hardware system design, ADI has developed an innovative peripheral named Digital Application Interface (DAI). Unlike previous SHARC and similar competitive products, these products have fixed pin functions, and DAI allows users to assign any peripheral function to any external pin they want. For the audio system, this means that when the input and output requirements of the system change, the audio clock domain can be assigned to the pin and routed to the serial port at any time through software. This flexibility can significantly reduce the number of external pins required to support special system specifications, help simplify hardware design, and help users further reduce costs.

ADSP-2 136x inherits the cost-saving advantages of ADSP-2 126x and adds advanced audio signal chain integration methods. The performance of the core is improved by more than 60%, reaching 333MHz, and the internal SRAM can be improved to 3Mb. In addition, many audio peripherals are integrated, such as high-performance asynchronous sampling rate converter (ASRC), SPDIF transceiver and DTCP encryption engine, which further optimizes the programmable performance and BOM cost of the audio system and consolidates ADI's leading position in the audio market. In this series of high-performance products, a 32-bit SDRAM interface with working frequency as high as 166MHz is also integrated to increase the I/O bandwidth and facilitate the use of mass production memory in data-intensive applications.

Based on this breakthrough audio system integration and the leading edge of high cost performance, the third generation SHARC series is widely used not only in the professional audio field, but also in consumer audio applications (such as home theater systems and AV amplifiers), and has played an important role in bringing the new generation of high-definition audio standards (DTS Master Audio and Dolby Tru-HD) to the market.

6 Fourth generation SHARC series ――ADSP-2 146x

The success of the third generation SHARC processor in optimizing the cost performance has pushed floating-point processors to cost-sensitive consumer applications, which were once considered impossible to use expensive floating-point processors.

ADI now faces an interesting challenge: how to further improve the cost performance of floating-point processors?

When defining the fourth generation processor, the product development team pays attention to the core values, which keep SHARC at the forefront of floating-point DSP technology:

● Market leading performance

● Building balance

● Performance scalability

● Intelligent integration

Each of these key aspects will be described in detail below.

6. 1 ADSP-2 146x performance enhancement

Based on the improvement of ADSP-2 136x series cores, ADI's SHARC development team set higher performance targets and adopted TSMC's 65nm silicon process to continuously optimize performance and balance cost. After careful engineering design and planning, ADI officially released ADSP-2 146x series processors in June 2008. Its core performance can reach 450MHz, which is almost 30% higher than the nearest competitive product. However, ADI's design team is not satisfied with just improving performance. They begin to seek innovative ways to greatly improve computing performance while minimizing the impact on power consumption and cost.

Many engineers use the wide dynamic range provided by floating-point processors to implement various algorithms, such as pattern detection, data compression/decompression, encryption/decryption and adaptive filtering. In many computation-intensive algorithms, some basic signal processing units, such as FFT, FIR filter and IIR filter, have been widely used and are the basis of most digital signal processing applications. ADI focused on these core signal processing building blocks and began to integrate these functions into 2 146x DMA architecture to further enhance the 450MHz performance of the SHARC core.

Based on a simple programming model, DSP engineers can treat each of these accelerators as a simple peripheral. Each accelerator is equipped with its own local memory for data and coefficient storage, so as not to increase the overhead of the core processor. In addition, there is a set of accelerator special registers for setting the accelerator, including information such as the coefficient start address and counter in the main memory. After the setting is completed, the program starts to run in sequence, and the user only needs to wait for the interrupt indicating the end of processing.

The FIR accelerator includes a local memory with the word 1K for storing coefficients, and another memory with the word 1K for storing delay line data. The FIR arithmetic unit includes four parallel MAC (Multiplication and Accumulation) units, and the working frequency of each unit is half of the kernel clock frequency. By using an 80-bit precision accumulator, the arithmetic unit can perform 32-bit floating-point or 32-bit fixed-point processing. Theoretically, in addition to the 2.7GFlops performance provided by the kernel, this engine can also provide the processing power of 1.8Gflops. Therefore, compared with the third generation products, the available floating-point performance of the fourth generation products will generally be doubled.

The FIR accelerator can be used in single iteration mode, which means that the complete filter implementation can be put into local memory (filter length