Ask the CPU for knowledge.

CPU, also called Central Processing Unit, is the abbreviation of the English word Central Processing Unit, which is responsible for calculating and processing information and data and realizing the automation of its own running process. In early computers, CPU was divided into two parts: arithmetic unit and controller. Later, due to the improvement of circuit integration, when the microprocessor came out, it was all integrated into one chip. CPU will be used where intelligent control and a lot of information processing are needed.

There are general CPU and embedded CPU, and the difference between general CPU and embedded CPU is mainly divided according to the different application modes. General CPU chips are generally powerful and can run complex operating systems and large-scale application software. Embedded CPU has a wide range of functions and performance. With the improvement of integration, in embedded applications, people tend to integrate CPU, memory and some peripheral circuits into one chip to form a so-called system chip (SOC), and the CPU on SOC becomes the CPU core.

At present, there are two opposite directions in the optimal design of teaching system. One is to enhance the function of instructions, set some instructions with complex functions, and realize some common functions originally realized by software by using the instruction system of hardware. This kind of computer becomes a computer with a complex instruction system. The early Intel X86 instruction system was CISC instruction structure.

RISC is the abbreviation of Reduced Instruction Set Computer, which was developed in 1980s. It simplifies the instruction function as much as possible, and only keeps those instructions with simple functions that can be executed in one shot. The more complicated functions are realized by a subroutine. This computer system becomes a simplified instruction system computer. At present, the chip manufacturers of RISC architecture processors include SUN, SGI, IBM's Power PC series, DEC's Alpha series, Motorola's Dragon Ball and Power PC.

Introduce MIPS system.

MIPS is a very popular RISC processor in the world. MIPS means "microprocessor without interlocking pipeline stages", and its mechanism is to avoid data-related problems in the pipeline by using software as much as possible. It was originally developed by a research team led by Professor Hennessy of Stanford University in the early 1980s. The R series of MIPS company is the microprocessor of RISC industrial products developed on this basis. These series products are used by many computer companies to form various workstations and computer systems.

instruction repertoire

To talk about CPU, we must first talk about instruction system. Instruction system refers to all instructions that the CPU can process.

Command set is the basic attribute of CPU. For example, the CPU we are using now adopts x86 instruction set, and they are all of the same type, whether it is PIII, Athlon or Joshua. We also know that there are CPUs in the world that are much faster than PIII and Athlon, such as Alpha, but they don't use x86 instruction set and can't use a lot of programs based on x86 instruction set, such as Windows98. The reason why the instruction system is the fundamental attribute of CPU is because the instruction system determines what programs the CPU can run.

All programs written in high-level languages need to be translated (compiled or interpreted) into machine language to run. These machine languages contain instructions.

1, instruction format

An instruction generally includes two parts: operation code and address code. Opcode is actually the instruction serial number, which is used to tell the CPU which instruction needs to be executed. The address code is complex, mainly including the address of the source operand, the destination address and the address of the next instruction. In some instructions, the address code can be partially or completely omitted, for example, an empty instruction has only the operation code and no address code.

For example, the instruction length of an instruction system is 32 bits, the operation code length is 8 bits, and the address length is 8 bits. The first instruction is addition and the second instruction is subtraction. When it receives the instruction "0000000100000001000000100000", it first takes out its first eight opcodes, that is, 000000/kloc-0. Then CPU takes out the minuend of memory address 00000 100, takes out the minuend of 000001,sends it to ALU for subtraction, and then sends the result to 0000010.

This is just a rather simplistic example, and the actual situation is much more complicated.

2. Classification and addressing of instructions

Generally speaking, the current instruction system has the following types of instructions:

(1) arithmetic logic operation instruction

Arithmetic logic operation instructions include arithmetic operation instructions such as addition, subtraction, multiplication and division, and logical operation instructions such as AND or XOR. Now the instruction system has also added some decimal operation instructions and string operation instructions.

(2) floating-point operation instruction

Used to operate on floating-point numbers. Floating-point operation is much more complicated than integer operation, so there is usually a floating-point operation unit in CPU responsible for floating-point operation. At present, vector instructions are generally added to floating-point instructions to directly manipulate matrices, which is very useful for multimedia and 3D processing.

(3) bit operation instruction

Anyone who has studied C should know that there is a set of bit operation statements in C language, and correspondingly, there is also a set of bit operation instructions in the instruction system, such as moving one bit left and one bit right. This operation is very simple and fast for the data represented by binary code in the computer.

(4) Other explanations

The above three instructions are all operational instructions, and many other instructions are inoperable. These instructions include: data transmission instructions, stack operation instructions, transfer instructions, input and output instructions and some special instructions, such as privileged instructions, multiprocessor control instructions and waiting, shutdown, empty operation and so on.

For the address code in the instruction, there are many different addressing methods, mainly including direct addressing, indirect addressing, register addressing, base address addressing, index addressing and so on. Some complex instruction systems will have dozens or even more addressing modes.

3.CISC and RISC

CISC, complex instruction set computer, complex instruction set computer. Reduced instruction set computer, reduced instruction set computer. Although these two terms are aimed at computers, we will only study instruction sets below.

The emergence, development and present situation of (1)CISC

At the beginning, the instruction system of a computer has only a few basic instructions, and other complex instructions are realized by the combination of simple instructions during software compilation. For the simplest example, an operation of multiplying a by b can be converted into the addition of a and b, so there is no need for a multiplication instruction. Of course, the earliest instruction system already had multiplication instructions. Why? Because it is much faster to implement multiplication in hardware than to add and merge.

Because the computer components at that time were quite expensive and slow, more and more complex instructions were added to the instruction system in order to improve the speed. However, another problem soon appeared: the number of instructions in an instruction system is limited by the number of instruction opcodes. If the opcode is 8 bits, the maximum number of instructions is 256 (2 to the 8th power).

So what should we do? It is difficult to increase the instruction width, and the clever designer has come up with a scheme: opcode expansion. As mentioned earlier, the opcode is followed by the address code, and some instructions do not need the address code or only use a small number of address codes. You can then extend the opcode to these locations.

For a simple example, if the opcode of an instruction system is 2 bits, there can be four different instructions: 00, 0 1, 10, 1 1. At present, 1 1 is kept, and the operation code is expanded to 4 bits, so it can have 00,0 1100,1kloc-0165430. Include 1 100,101,1 1 1/.

Then, in order to realize the premise of opcode expansion: reducing address codes, designers use their brains to invent various addressing methods such as base address addressing and relative addressing, so as to minimize the length of address codes and leave room for opcodes.

In this way, slowly, CISC instruction system was born. A large number of complex instructions, variable instruction length and various addressing methods are the characteristics of CISC, and they are also its shortcomings: because these have greatly increased the difficulty of decoding, and with the development of high-speed hardware, the speed increase brought by complex instructions is already less than the waste of time in decoding. Apart from the x86 instruction set still used in the PC market, CISC is no longer used in servers and larger systems. The only reason why x86 still exists is to be compatible with a lot of software on x86 platform.

]: (2) The emergence, development and present situation of RISC.

1975, john cocke, the designer of IBM, studied the IBM370CISC system at that time, and found that simple instructions, which only accounted for 20% of the total number of instructions, accounted for 80% of the program calls, while complex instructions, which accounted for 80% of the instructions, only took 20% of the time. Therefore, he put forward the concept of RISC.

It turns out that RISC is successful. At the end of 1980s, RISC CPU of various companies appeared like mushrooms after rain, occupying a large number of markets. In 1990s, x86 CPUs such as Pentium and k5 also began to use advanced RISC cores.

RISC is characterized by fixed instruction length, few instruction formats and few addressing methods. Most of them are simple instructions that can be completed in one clock cycle. Superscalar and pipeline are easy to design, with a large number of registers and a large amount of operation between registers. Because most of the CPU cores mentioned below are about RISC cores, I won't introduce them here. Let me talk about the design of RISC cores in detail.

RISC is now in full swing, and Intel's An Teng will eventually abandon x86 and turn to RISC architecture.

Second, the CPU kernel structure

Ok, let's look at CPU. The CPU core is mainly divided into two parts: arithmetic unit and controller.

(1) arithmetic unit

1, arithmetic logic unit (ALU)

ALU mainly completes fixed-point arithmetic operation (addition, subtraction, multiplication and division), logical operation (AND or NOT XOR) and shift operation on binary data. In some CPUs, there are also shifters that specialize in shifting operations.

Usually ALU consists of two inputs and one output. Integer units are sometimes called IEU (Integer Execution Unit). We usually say that "CPU is XX bits" refers to the number of data bits that ALU can handle.

2. Floating-point unit

FPU is mainly responsible for floating-point operations and high-precision integer operations. Some fpu also have the function of vector operation, while others have special vector processing units.

3. General Register Group

General register group is a group of fastest memories, which are used to store operands and intermediate results involved in operations.

RISC and CISC are quite different in the design of general registers. CISC usually has few registers, which is mainly limited by the hardware cost at that time. For example, the x86 instruction set has only eight general-purpose registers. Therefore, most of the CPU execution of CISC is to access the data in memory, not the data in registers. This will slow down the whole system. RISC systems often have many general-purpose registers, and use overlapping register windows and register files to make full use of register resources.

Aiming at the shortcoming that x86 instruction set only supports 8 general registers, the latest CPU of Intel and AMD both adopt a technology called "register renaming", which makes x86CPU break through the limit of 8 registers and reach 32 or more. However, compared with RISC, the register operation of this technology needs one more clock cycle to rename the register.

4. Dedicated registers

Special registers are usually some state registers, which cannot be changed by programs, and are controlled by CPU itself to indicate a certain state.

(2) Controller

The arithmetic unit can only complete the operation, while the controller is used to control the whole CPU.

1, command controller

The instruction controller is a very important part of the controller. It needs to fetch and analyze instructions, and then hand them over to the execution unit (ALU or FPU) for execution. At the same time, the address of the next instruction needs to be formed.

2. Timer

The function of the time sequence controller is to provide control signals for each instruction in the time sequence. The timing controller includes a clock generator and a frequency doubling definition unit. The clock generator sends out a very stable pulse signal through a quartz crystal oscillator, which is the main frequency of the CPU. The frequency doubling definition unit defines that the main frequency of CPU is several times that of memory (bus frequency).

3. Bus controller

The bus controller is mainly used to control the internal and external buses of CPU, including address bus, data bus and control bus.

4. Interrupt controller

The interrupt controller is used to control all kinds of interrupt requests, and queue the interrupt requests according to priority and give them to the CPU for processing one by one.

(the design of CPU kernel.

What determines the performance of CPU? The speed of a single ALU does not play a decisive role in a CPU, because the speed of ALUs is similar. The decisive factor of CPU performance lies in the design of CPU core.

1, exceeding the standard.

Since the speed of ALU cannot be greatly improved, what alternative is there? The method of parallel processing once again played a powerful role. The so-called superscalar CPU is a CPU that only integrates multiple ALUs, multiple FPUs, multiple decoders and multiple pipelines, and improves performance through parallel processing.

The technology of exceeding the standard should be well understood, but one thing needs to be noted, that is, don't pay attention to the numbers before "exceeding the standard", such as "9 roads exceeding the standard". Different manufacturers have different definitions of this number, and more it is just a means of commercial propaganda.

2. Pipeline

Pipeline is an important design of modern RISC core, which greatly improves the performance.

For a specific instruction execution process, it can usually be divided into five parts: instruction fetching, instruction decoding, operand fetching, ALU operation, and result writing. The first three steps are generally completed by the instruction controller, and the last two steps are completed by the arithmetic unit. According to the traditional way, all instructions are executed in sequence, so first, the instruction controller is instructed to complete the first three steps of the first instruction, then the operator is instructed to complete the last two steps, and then the operator is instructed to complete the first three steps of the second instruction. Now the operator is instructed to complete the last two parts of the second instruction ... Obviously, when the instruction controller is working, the operator is basically resting, but when the operator is working, the instruction controller is resting, resulting in a considerable waste of resources. The solution is easy to think of. When the instruction controller completes the first three steps of the first instruction, it will directly start the operation of the second instruction, and so will the arithmetic unit. In this way, a pipeline system is formed, which is a two-stage pipeline.

If it is a superscalar system, assuming that there are three instruction control units and two arithmetic units, the address of the first instruction can be directly started after the address of the second instruction is completed. At this time, the first instruction is decoding, then the third instruction is addressing, the second instruction is decoding, and the first instruction has operands ... This is the five-level waterline. Obviously, the average theoretical speed of five-stage pipeline is four times that of no pipeline.

The pipeline system makes maximum use of CPU resources, making each component work in each clock cycle, which greatly improves the efficiency. However, pipelines have two very big problems: correlation and transmission.

In a pipelined system, if the second instruction needs the result of the first instruction, this situation is called dependency. Take the above five-stage pipeline as an example. When the second instruction needs to fetch operands, the operation of the first instruction has not been completed. If the second instruction fetches the operand at this time, it will get the wrong result. So, at this time, the whole pipeline had to stop, waiting for the completion of the first instruction. This is a very annoying problem, especially for a long pipeline, such as 20 stages, this pause usually loses more than a dozen clock cycles. At present, the solution to this problem is out-of-order execution. The principle of out-of-order execution is to insert unrelated instructions into two related instructions to make the whole pipeline smooth. For example, in the above example, after the first instruction is executed, the third instruction will be executed directly (assuming that the third instruction is irrelevant), and then the second instruction will be executed, so that when the second instruction needs to fetch operands, the first instruction has just finished and the third instruction is almost finished, and the whole pipeline will not stop. However, pipeline blocking cannot be completely avoided, especially when there are many related instructions.

Another big problem is conditional transfer. In the above example, if the first instruction is a conditional branch instruction, then the system will not know which instruction should be executed next. At this time, you must wait for the judgment result of the first instruction before executing the second instruction. The pipeline stall caused by conditional branch is even more serious than correlation. Therefore, branch prediction technology is now used to deal with the transfer problem. Although our program is full of branches and any branch is possible, in most cases, we always choose a branch. For example, there is a branch at the end of the loop. We always choose to continue the loop except the last time we need to jump out of the loop. According to these principles, branch prediction technology can predict what the next instruction is before the result is obtained and executed. The current branch prediction technology can achieve more than 90% accuracy, but once the prediction is wrong, CPU still has to clean up the whole pipeline and return to the branch point. This will lose many clock cycles. Therefore, further improving the accuracy of branch prediction is also a subject under study.

The longer the pipeline is, the more serious the problems of correlation and transfer are. Therefore, the longer the assembly line, the better, and the more redundant it is. Finding the balance between speed and efficiency is the most important thing.

1, decoding unit

This is unique to x86CPU, and its function is to convert the x86 instructions with indefinite length into RISC-like fixed-length instructions and give them to the RISC kernel. Decoding can be divided into hardware decoding and micro-decoding. For simple x86 instructions, hardware decoding is enough, which is faster, while for complex x86 instructions, it is slow and complicated to perform micro-decoding and divide them into several simple instructions. Fortunately, these complicated instructions are rarely used.

Athlon, PIII and the old CISC x86 instruction set severely limited their performance.

2. Level 1 cache and Level 2 cache

Cache and secondary cache are produced to alleviate the contradiction between faster CPU and slower memory. Cache is usually integrated in the CPU kernel, while secondary cache runs faster than memory in the form of OnDie or onboard. For some jobs with large data exchange, CPU cache is particularly important.