I am new to computers, please give me some advice. What is a CPU?

Some knowledge about CPU

1. Main frequency

Main frequency is also called clock frequency, and the unit is MHz, which is used to indicate the computing speed of the CPU. CPU main frequency = FSB × multiplication factor. Many people think that the main frequency determines the running speed of the CPU. This is not only one-sided, but also for servers, this understanding is also biased. So far, there is no definite formula that can realize the numerical relationship between the main frequency and the actual computing speed. Even the two major processor manufacturers Intel and AMD have great disputes on this point. We start from Intel Looking at the product development trends, it can be seen that Intel attaches great importance to strengthening the development of its own main frequency. Like other processor manufacturers, someone once compared it with a 1G Transmeta processor. Its operating efficiency is equivalent to a 2G Intel processor.

Therefore, the main frequency of the CPU is not directly related to the actual computing power of the CPU. The main frequency indicates the speed of the digital pulse signal oscillation in the CPU. In Intel's processor products, we can also see examples: 1GHz Itanium chip can perform almost as fast as 2.66GHz Xeon/Opteron, or 1.5GHz Itanium2 is about as fast as 4GHz Xeon/Opteron. The computing speed of the CPU also depends on the performance indicators of various aspects of the CPU's pipeline.

Of course, the main frequency is related to the actual computing speed. It can only be said that the main frequency is only one aspect of CPU performance and does not represent the overall performance of the CPU.

2. FSB

The FSB is the base frequency of the CPU, and its unit is also MHz. The CPU's FSB determines the running speed of the entire motherboard. To put it bluntly, in desktop computers, what we call overclocking refers to overclocking the CPU's FSB (of course, under normal circumstances, the CPU multiplier is locked). I believe this is well understood. But for server CPUs, overclocking is absolutely not allowed. As mentioned earlier, the CPU determines the running speed of the motherboard. The two run synchronously. If the server CPU is overclocked and the FSB is changed, asynchronous operation will occur. (Many desktop motherboards support asynchronous operation.) This will cause the entire server to run asynchronously. System instability.

In most current computer systems, the FSB is also the synchronous running speed between the memory and the motherboard. In this way, it can be understood that the CPU's FSB is directly connected to the memory, realizing both synchronized operating status. It is easy to confuse FSB and FSB frequency. Let’s talk about the difference between the two in the following FSB introduction.

3. Front-side bus (FSB) frequency

Front-side bus (FSB) frequency (i.e. bus frequency) directly affects the speed of direct data exchange between the CPU and memory. There is a formula that can be calculated, that is, data bandwidth = (bus frequency × data bandwidth)/8. The maximum bandwidth of data transmission depends on the width and transmission frequency of all data transmitted simultaneously. For example, the current Xeon Nocona that supports 64-bit has a front-side bus of 800MHz. According to the formula, its maximum data transmission bandwidth is 6.4GB/second.

The difference between FSB and FSB frequency: The speed of FSB refers to the speed of data transmission, and the FSB is the speed of synchronous operation between the CPU and the motherboard. In other words, the 100MHz FSB specifically refers to the digital pulse signal oscillating ten million times per second; while the 100MHz front-side bus refers to the amount of data transmission that the CPU can accept per second, which is 100MHz×64bit÷8Byte/bit=800MB/ s.

In fact, the emergence of the "HyperTransport" architecture has changed the actual front-side bus (FSB) frequency. We previously knew that the IA-32 architecture must have three important components: Memory Controller Hub (MCH), I/O Controller Hub and PCIHub, like Intel's typical chipsets Intel7501 and Intel7505 chipsets, which are dual Xeon processors. Tailor-made, the MCH they contain provides the CPU with a front-side bus frequency of 533MHz. With DDR memory, the front-side bus bandwidth can reach 4.3GB/second.

However, as processor performance continues to improve, it also brings many problems to the system architecture. The "HyperTransport" architecture not only solves the problem, but also improves the bus bandwidth more effectively, such as the AMD Opteron processor. The flexible HyperTransport I/O bus architecture allows it to integrate the memory controller, so that the processor does not pass the system bus to the chip. The group directly exchanges data with memory. In this case, the front-side bus (FSB) frequency does not know where to start talking about the AMD Opteron processor.

4. CPU bits and word length

Bits: Binary is used in digital circuits and computer technology, and the codes are only "0" and "1", whether it is "0" Or "1" is a "bit" in the CPU.

Word length: In computer technology, the number of binary digits that the CPU can process at one time per unit time (at the same time) is called the word length. Therefore, a CPU that can process data with a word length of 8 bits is usually called an 8-bit CPU. In the same way, a 32-bit CPU can process binary data with a word length of 32 bits per unit time. The difference between byte and word length: Since commonly used English characters can be represented by 8-bit binary, 8 bits are usually called a byte. The length of the word length is not fixed, and the length of the word length is different for different CPUs. An 8-bit CPU can only process one byte at a time, while a 32-bit CPU can process 4 bytes at a time. Similarly, a 64-bit CPU can process 8 bytes at a time.

5. Multiplier coefficient

Multiplier coefficient refers to the relative proportional relationship between the CPU main frequency and the external frequency. Under the same FSB, the higher the frequency multiplier, the higher the CPU frequency. But in fact, under the premise of the same FSB, a high-multiplier CPU itself is of little significance. This is because the data transmission speed between the CPU and the system is limited. A CPU that blindly pursues high multipliers and obtains a high main frequency will have an obvious "bottleneck" effect - the maximum speed at which the CPU obtains data from the system cannot satisfy the CPU's computing requirements. speed. Generally speaking, except for the engineering samples, Intel's CPUs have locked multipliers, but AMD has not locked them before.

6. Cache

Cache size is also one of the important indicators of the CPU, and the structure and size of the cache have a great impact on the CPU speed. The cache in the CPU runs at an extremely high frequency. Generally, it operates at the same frequency as the processor, and its working efficiency is much greater than that of system memory and hard disk. In actual work, the CPU often needs to read the same data block repeatedly, and the increase in cache capacity can greatly improve the hit rate of reading data within the CPU without having to search for it in the memory or hard disk, thereby improving system performance. . However, due to factors such as CPU chip area and cost, the cache is very small.

L1Cache (level one cache) is the first level cache of the CPU, which is divided into data cache and instruction cache. The capacity and structure of the built-in L1 cache have a greater impact on the performance of the CPU. However, the cache memory is composed of static RAM and has a complicated structure. When the CPU die area cannot be too large, the capacity of the L1 cache is not sufficient. Probably made too big. The capacity of the L1 cache of a general server CPU is usually 32-256KB.

L2Cache (Level 2 Cache) is the second level cache of the CPU, which is divided into internal and external chips. The internal on-chip L2 cache runs at the same speed as the main frequency, while the external L2 cache only runs at half the main frequency. The L2 cache capacity will also affect the performance of the CPU. The principle is that the bigger the better. The largest capacity of the current home CPU is 512KB, while the L2 cache of the CPU on servers and workstations is as high as 256-1MB, and some are as high as 2MB or 3MB. .

L3Cache (three-level cache) is divided into two types. The early one was external, and the current ones are built-in. Its actual effect is that the application of L3 cache can further reduce memory latency and improve processor performance when calculating large amounts of data. Reducing memory latency and improving large-data computing capabilities are helpful for games. In the server field, adding L3 cache still has a significant improvement in performance. For example, a configuration with a larger L3 cache will use physical memory more efficiently, so it can handle more data requests than a slower disk I/O subsystem.

Processors with larger L3 caches provide more efficient file system cache behavior and shorter message and processor queue lengths.

In fact, the earliest L3 cache was applied to the K6-III processor released by AMD. The L3 cache at that time was limited by the manufacturing process and was not integrated into the chip, but was integrated on the motherboard. The L3 cache, which can only be synchronized with the system bus frequency, is actually not much different from the main memory. Later, the L3 cache was used by Intel's Itanium processor for the server market. Then there are P4EE and Xeon MP. Intel also plans to launch an Itanium2 processor with 9MBL3 cache, and later a dual-core Itanium2 processor with 24MBL3 cache.

But basically the L3 cache is not very important to improve the performance of the processor. For example, the XeonMP processor equipped with 1MBL3 cache is still not the opponent of Opteron. It can be seen that the increase of the front-side bus is greater than the increase of the cache. Bring more effective performance improvements.

7. CPU extended instruction set

The CPU relies on instructions to calculate and control the system. Each CPU specifies a series of instruction systems that match its hardware circuit during design. The strength of instructions is also an important indicator of the CPU. The instruction set is one of the most effective tools to improve the efficiency of microprocessors. From the current mainstream architecture, the instruction set can be divided into two parts: complex instruction set and simplified instruction set. From the perspective of specific applications, such as Intel's MMX (MultiMediaExtended), SSE, SSE2 (Streaming-Singleinstructionmultipledata-Extensions2), SEE3 and AMD's 3DNow! are all extended instruction sets of the CPU, which respectively enhance the CPU's multimedia, graphics and Internet processing capabilities. We usually refer to the extended instruction set of the CPU as the "CPU instruction set". The SSE3 instruction set is also the smallest instruction set currently. Previously, MMX contained 57 commands, SSE contained 50 commands, SSE2 contained 144 commands, and SSE3 contained 13 commands. Currently, SSE3 is also the most advanced instruction set. Intel Prescott processors already support the SSE3 instruction set. AMD will add support for the SSE3 instruction set to future dual-core processors. Transmeta processors will also support this instruction set.

8. CPU core and I/O working voltage

Starting from the 586CPU, the working voltage of the CPU is divided into two types: core voltage and I/O voltage. Usually the core voltage of the CPU is less than Equal to the I/O voltage. The size of the core voltage is determined based on the CPU's production process. Generally, the smaller the production process, the lower the core operating voltage; I/O voltages are generally 1.6~5V. Low voltage can solve the problems of excessive power consumption and excessive heat generation.

9. Manufacturing process

The micron of the manufacturing process refers to the distance between circuits within the IC. The trend in manufacturing processes is towards higher density. Higher-density IC circuit designs mean that ICs of the same size can have circuit designs with higher density and more complex functions. Now the main ones are 180nm, 130nm and 90nm. Recently, officials have stated that they have a 65nm manufacturing process.

10. Instruction set

(1) CISC instruction set

CISC instruction set, also known as complex instruction set, the English name is CISC, (ComplexInstructionSetComputer) abbreviation). In a CISC microprocessor, each instruction of the program is executed serially in order, and each operation in each instruction is also executed serially in order. The advantage of sequential execution is simple control, but the utilization rate of various parts of the computer is not high and the execution speed is slow. In fact, it is the x86 series (that is, IA-32 architecture) CPU produced by Intel and its compatible CPUs, such as AMD and VIA. Even the new X86-64 (also called AMD64) belongs to the category of CISC.

To know what an instruction set is, we have to start with today's X86 architecture CPU.

The X86 instruction set was specially developed by Intel for its first 16-bit CPU (i8086). The CPU in the world's first PC—i8088 (simplified version of i8086) launched by IBM in 1981 also used X86 instructions. At the same time, the computer The X87 chip was added to improve floating-point data processing capabilities. From now on, the X86 instruction set and the X87 instruction set will be collectively referred to as the X86 instruction set.

Although with the continuous development of CPU technology, Intel has successively developed newer i80386 and i80486, up to the past PII Xeon, PIII Xeon, Pentium3, and finally to today's Pentium4 series, Xeon (not Including Xeon Nocona), but in order to ensure that the computer can continue to run various applications developed in the past to protect and inherit rich software resources, all CPUs produced by Intel continue to use the X86 instruction set, so its CPUs still belong to X86 series. Since the Intel X86 series and its compatible CPUs (such as AMD AthlonMP, etc.) all use the x86CPU currently mainly includes Intel server CPU and AMD server CPU.

(2) RISC instruction set

RISC is the abbreviation of "ReducedInstructionSetComputing" in English, which means "reduced instruction set" in Chinese. It was developed on the basis of the CISC instruction system. Someone tested the CISC machine and showed that the frequency of use of various instructions is quite different. The most commonly used instructions are some relatively simple instructions, which only account for 20% of the total number of instructions. But the frequency of occurrence in the program accounts for 80%. A complex instruction system will inevitably increase the complexity of the microprocessor, making the development of the processor long and costly. And complex instructions require complex operations, which will inevitably reduce the speed of the computer. Based on the above reasons, RISC CPUs were born in the 1980s. Compared with CISC CPUs, RISC CPUs not only streamlined the instruction system, but also adopted something called "superscalar and super-pipeline structure", which greatly increased parallel processing capabilities. . The RISC instruction set is the development direction of high-performance CPUs. It is opposed to traditional CISC (Complex Instruction Set). In comparison, RISC has a unified instruction format, fewer types, and fewer addressing methods than complex instruction sets. Of course, the processing speed is greatly improved. At present, CPUs with this instruction system are commonly used in mid-to-high-end servers, especially high-end servers all use CPUs with the RISC instruction system. The RISC instruction system is more suitable for UNIX, the operating system of high-end servers. Now Linux is also a UNIX-like operating system. RISC-type CPUs are not compatible with Intel and AMD CPUs in software and hardware.

At present, the CPUs that use RISC instructions in mid-to-high-end servers mainly include the following categories: PowerPC processors, SPARC processors, PA-RISC processors, MIPS processors, and Alpha processors.

(3) IA-64

There has been a lot of debate about whether EPIC (Explicitly Parallel Instruction Computers) is the successor to the RISC and CISC systems. Take the EPIC system alone , it is more like an important step for Intel's processors to move towards the RISC system. Theoretically speaking, the CPU designed by the EPIC system can handle Windows application software much better than Unix-based application software under the same host configuration.

Intel's server CPU using EPIC technology is Itanium (development codename: Merced). It is a 64-bit processor and the first in the IA-64 series. Microsoft has also developed an operating system codenamed Win64 to support it in software.

After Intel adopted the set, so the IA-64 architecture using the EPIC instruction set was born. IA-64 has made great progress over x86 in many aspects. It breaks through many limitations of the traditional IA32 architecture and achieves breakthrough improvements in data processing capabilities, system stability, security, usability, and considerable rationality.

The biggest flaw of IA-64 microprocessors is their lack of compatibility with x86. In order for Intel IA-64 processors to better run software from both dynasties, it The x86-to-IA-64 decoder is introduced on (Itanium, Itanium2...), so that x86 instructions can be translated into IA-64 instructions. This decoder is not the most efficient decoder, nor is it the best way to run x86 code (the best way is to run x86 code directly on the x86 processor), so the performance of Itanium and Itanium2 when running x86 applications Very bad. This has also become the fundamental reason for the emergence of X86-64.

(4) X86-64 (AMD64/EM64T)

Designed by AMD, it can handle 64-bit integer operations at the same time and is compatible with the X86-32 architecture. It supports 64-bit logical addressing and provides the option of converting to 32-bit addressing; however, the data operation instructions default to 32-bit and 8-bit, and provides the option of converting to 64-bit and 16-bit; supports general-purpose registers, if it is a 32-bit operation , it is necessary to expand the result to a complete 64 bits. In this way, there is a difference between "direct execution" and "conversion execution" in the instruction. The instruction field is 8 bits or 32 bits, which can avoid the field being too long.

The emergence of x86-64 (also called AMD64) is not groundless. The 32-bit addressing space of x86 processors is limited to 4GB of memory, and IA-64 processors are not compatible with x86. AMD fully considers the needs of customers and enhances the functions of the x86 instruction set so that this instruction set can support 64-bit computing modes at the same time. Therefore, AMD calls their structure x86-64. Technically, in order to perform 64-bit operations in the x86-64 architecture, AMD has introduced a new R8-R15 general-purpose register as an expansion of the original Use these registers. The original registers such as EAX and EBX have also been expanded from 32 bits to 64 bits. Eight new registers have been added to the SSE unit to provide support for SSE2. The increase in the number of registers will lead to performance improvements. At the same time, in order to support both 32- and 64-bit codes and registers, the x86-64 architecture allows the processor to work in the following two modes: LongMode (long mode) and LegacyMode (genetic mode). Long mode is divided into two sub-modes ( 64bit mode and Compatibilitymode (compatibility mode). The standard has been introduced in AMD's Opteron server processors.

This year, EM64T technology that supports 64-bit was also launched. Before it was officially named EM64T, it was IA32E. This is the name of Intel's 64-bit extension technology to distinguish the X86 instruction set. Intel's EM64T supports 64-bit sub-mode, which is similar to AMD's X86-64 technology. It uses 64-bit linear plane addressing, adds 8 new general-purpose registers (GPRs), and adds 8 registers to support SSE instructions. Similar to AMD, Intel's 64-bit technology will be compatible with IA32 and IA32E. IA32E will only be used when running a 64-bit operating system. IA32E will be composed of 2 sub-modes: 64-bit sub-mode and 32-bit sub-mode, which are backward compatible with AMD64. Intel's EM64T will be fully compatible with AMD's X86-64 technology.

Now the Nocona processor has added some 64-bit technology, and Intel's Pentium4E processor also supports 64-bit technology.

It should be said that both of them are 64-bit microprocessor architectures compatible with the x86 instruction set, but there are still some differences between EM64T and AMD64. The NX bit in the AMD64 processor is not processed by Intel. will not be provided in the server.

11. Superpipeline and superscalar

Before explaining superpipeline and superscalar, let’s first understand the pipeline. The pipeline was first used by Intel in the 486 chip. The assembly line works like an assembly line in industrial production. In the CPU, an instruction processing pipeline is composed of 5-6 circuit units with different functions, and then an X86 instruction is divided into 5-6 steps and then executed by these circuit units respectively, so that one instruction can be completed in one CPU clock cycle. , thus increasing the computing speed of the CPU. Each integer pipeline of the classic Pentium is divided into four levels of pipeline, namely instruction prefetching, decoding, execution, and writing back results. The floating point pipeline is divided into eight levels of pipeline.

Superscalar uses built-in multiple pipelines to execute multiple processors at the same time. Its essence is to trade space for time. The super pipeline is to complete one or more operations in one machine cycle by refining the pipeline and increasing the main frequency. Its essence is to exchange time for space. For example, Pentium4's pipeline is as long as 20 stages. The longer the steps (stages) of the pipeline are designed, the faster it can complete an instruction, so it can adapt to CPUs with higher operating frequencies. However, an excessively long pipeline also brings certain side effects. It is very likely that the actual computing speed of a CPU with a higher frequency will be lower. This is the case with Intel's Pentium 4, although its main frequency can be as high as 1.4G or more. , but its computing performance is far inferior to AMD's 1.2G Athlon or even Pentium III.

12. Packaging form

CPU packaging is a protective measure that uses specific materials to solidify the CPU chip or CPU module to prevent damage. Generally, the CPU must be packaged before it can be delivered to the user. use. The packaging method of the CPU depends on the CPU installation form and device integration design. From a broad classification point of view, CPUs usually installed using Socket sockets are packaged using PGA (grid array), while CPUs installed using Slotx slots are all packaged using SEC ( Single-sided junction box) form of packaging. There are also packaging technologies such as PLGA (PlasticLandGridArray) and OLGA (OrganicLandGridArray). Due to increasingly fierce market competition, the current development direction of CPU packaging technology is mainly cost saving.

13. Multithreading

Simultaneous multithreading, referred to as SMT. SMT can copy the structural state on the processor, allowing multiple threads on the same processor to execute simultaneously and fully share the processor's execution resources. It can maximize wide-issue, out-of-order superscalar processing and improve The utilization of the processor's computing components alleviates memory access delays caused by data dependencies or cache misses. When multiple threads are not available, SMT processors are almost the same as traditional wide-issue superscalar processors. The most attractive thing about SMT is that it only requires a small change in the design of the processor core, which can significantly improve performance at almost no additional cost. Multi-threading technology can prepare more data to be processed for the high-speed computing core and reduce the idle time of the computing core. This is undoubtedly very attractive for low-end desktop systems. Starting from 3.06GHz Pentium4, all Intel processors will support SMT technology.

14. Multi-core

Multi-core also refers to single-chip multiprocessors (Chipmultiprocessors, referred to as CMP). CMP was proposed by Stanford University in the United States. Its idea is to integrate SMP (symmetric multi-processor) in large-scale parallel processors into the same chip, and each processor executes different processes in parallel. Compared with CMP, the flexibility of SMT processor structure is more prominent.

However, when the semiconductor process enters 0.18 micron, the line delay has exceeded the gate delay, requiring the design of the microprocessor to be carried out by dividing many basic unit structures with smaller scale and better locality. In contrast, since the CMP structure has been divided into multiple processor cores for design, each core is relatively simple, which is conducive to optimized design, and therefore has more development prospects. Currently, IBM's Power4 chip and Sun's MAJC5200 chip both use CMP structure. Multi-core processors can share cache within the processor, improve cache utilization, and simplify the complexity of multi-processor system design.

In the second half of 2005, new processors from Intel and AMD will also be integrated into the CMP structure. The development code of the new Itanium processor is Montecito. It adopts a dual-core design, has at least 18MB of on-chip cache, and is manufactured using a 90nm process. Its design is definitely a challenge to today's chip industry. Each of its individual cores has independent L1, L2 and L3 caches and contains approximately 1 billion transistors.

15. SMP

SMP (SymmetricMulti-Processing), short for symmetric multi-processing structure, refers to a group of processors (multiple CPUs) assembled on a computer. Each CPU The memory subsystem and bus structure are shared between them. With the support of this technology, a server system can run multiple processors at the same time and share memory and other host resources. Like dual Xeon, which is what we call two-way, this is the most common type in symmetric processor systems (Xeon MP can support up to four-way, AMD Opteron can support 1-8 way). There are also a few that are number 16. But generally speaking, the scalability of machines with SMP structure is poor, and it is difficult to achieve more than 100 multi-processors. The conventional ones are generally 8 to 16, but this is enough for most users. It is most common in high-performance server and workstation-class motherboard architectures, such as UNIX servers that can support systems with up to 256 CPUs.

The necessary conditions for building an SMP system are: hardware that supports SMP including motherboard and CPU; system platform that supports SMP, and application software that supports SMP.

In order to enable the SMP system to perform efficiently, the operating system must support SMP systems, such as 32-bit operating systems such as WINNT, LINUX, and UNIX. That is, the ability to perform multitasking and multithreading. Multitasking means that the operating system can enable different CPUs to complete different tasks at the same time; multithreading means that the operating system can enable different CPUs to complete the same task in parallel.

To build an SMP system, there are very high requirements for the selected CPU. First, the APIC (Advanced Programmable Interrupt Controllers) unit must be built inside the CPU. The core of the Intel multiprocessing specification is the use of Advanced Programmable Interrupt Controllers (APICs); again, the same product model, the same type of CPU core, the exact same operating frequency; finally, keep the same product sequence as much as possible number, because when two production batches of CPUs are run as dual processors, it may happen that one CPU is overloaded and the other is underloaded very little, and the maximum performance cannot be exerted. What's worse, it may cause crash.

16. NUMA technology

NUMA is non-uniform access distributed shared storage technology. It is a system composed of several independent nodes connected through high-speed dedicated networks. Each node Can be a single CPU or an SMP system. In NUMA, there are multiple solutions for Cache consistency, which require support from the operating system and special software. Figure 2 is an example of Sequent's NUMA system. There are three SMP modules connected by a high-speed dedicated network to form a node, and each node can have 12 CPUs. Systems like Sequent can go up to 64 CPUs or even 256 CPUs. Obviously, this is based on SMP and then expanded with NUMA technology. It is a combination of these two technologies.

17. Out-of-order execution technology

Out-of-order execution (out-of-order execution) means that the CPU allows multiple instructions to be sent separately to each corresponding device in an order that is not specified by the program. Technology for circuit unit processing. In this way, after analyzing the status of each circuit unit and the specific situation of whether each instruction can be executed in advance, the instructions that can be executed in advance are immediately sent to the corresponding circuit unit for execution. During this period, the instructions are not executed in the specified order, and then the rearrangement unit Rearrange the results of each execution unit in instruction order. The purpose of using out-of-order execution technology is to make the CPU's internal circuits operate at full capacity and accordingly increase the speed of the CPU's running programs. Branching technology: (branch) instructions need to wait for the results when performing operations. Generally, unconditional branches only need to be executed in the order of instructions, while conditional branches must decide whether to proceed in the original order based on the processed results.

18. Memory controller inside the CPU

Many applications have more complex read patterns (almost randomly, especially when cachehit is unpredictable), and Bandwidth is not used efficiently. A typical application of this type is business processing software. Even if it has CPU features such as out-of-order execution (out-of-order execution), it will still be limited by memory latency. In this way, the CPU must wait until the dividend of the data required for the operation is loaded before it can execute the instruction (whether the data comes from the CPU cache or the main memory system). The memory latency of current low-end systems is about 120-150ns, and the CPU speed has reached more than 3GHz. A single memory request may waste 200-300 CPU cycles. Even with a 99% cache hit rate, the CPU may spend 50% of the time waiting for memory requests to complete - for example due to memory latency.

You can see that the latency of the Opteron integrated memory controller is much lower than the latency of the chipset supporting dual-channel DDR memory controllers. Intel also plans to integrate the memory controller inside the processor, which will make the northbridge chip less important. But it changes the way the processor accesses main memory, which helps to increase bandwidth, reduce memory latency and improve processor performance