The rapid acquisition of life genetic information is of great significance to the research of life science. The above picture 1 (right-click the picture to see a larger picture, the same below) describes the development of the whole sequencing technology since Watson and Crick established the DNA double helix structure in 1953.
First generation sequencing technology
The first generation of DNA sequencing technology adopted the chain termination method initiated by Sanger and Coulson in 1975 or the chemical method (chain degradation) invented by maxam and Gilbert in 1976- 1977, which was completed in 1977. Since then, human beings have gained the ability to spy on the nature of genetic differences in life and began to enter the era of genomics. Researchers have continuously improved Sanger's method in practice for many years. 200 1, the first human genome map was completed on the basis of the improved Sanger method. The core principle of Sanger method is that since both 2' and 3' of ddNTP do not contain hydroxyl groups, phosphodiester bonds cannot be formed during DNA synthesis, so it can be used to interrupt DNA synthesis. A certain proportion of ddNTP (including ddtp, ddCTP, ddGTP and ddTTP) with radioisotope labels was added to four DNA synthesis reaction systems. After gel electrophoresis and autoradiography, the DNA sequence of the molecule to be detected can be determined according to the position of the electrophoresis band (Figure 2). This website made a short film for Sanger sequencing, which is vivid.
It is worth noting that in the early stage of sequencing technology development, besides Sanger method, some other sequencing technologies appeared, such as pyrosequencing method and ligase method. Among them, pyrophosphate sequencing method is the sequencing method 2-4 used by Roche Company's 454 technology, and ligase sequencing method is the sequencing method 2-4 used by ABI Company's solid phase technology, but their concentric means are dNTP which can interrupt DNA synthesis reaction in Sanger 1.
Second generation sequencing technology
Generally speaking, the main feature of the first generation sequencing technology is that the reading length of sequencing can reach 1000bp, and the accuracy rate is as high as 99.999%. However, the disadvantages of high sequencing cost and low throughput seriously affect its real large-scale application. So the first generation sequencing technology is not the best sequencing method. After continuous technical development and improvement, the second generation sequencing technology marked by Roche's 454 technology, illumina's Solexa, Hiseq technology and ABI's Solid technology was born. The second generation sequencing technology greatly reduces the sequencing cost, and at the same time greatly improves the sequencing speed and maintains high accuracy. It used to take three years to complete the sequencing of a human genome, but the second-generation sequencing technology only takes 1 week, but the sequence reading length is much shorter than that of the first-generation sequencing technology. Table 1 and Figure 3 make a simple comparison between the characteristics of the first generation sequencing technology and the cost of the second generation sequencing. 5. I will briefly introduce the main principles and characteristics of these three main second-generation sequencing technologies.
illuminate
Solexa and Hiseq of Illumina Company should be said to be the second-generation sequencing machines with the largest usage in the world at present, and the technical core principles of these two series are the same. These two series of machines adopt the method of combining and sorting, and the sorting process is mainly divided into the following four steps, as shown in Figure 4.
? Construction of (1)DNA library to be tested
At present, apart from assembly and other special requirements, it is mainly to break the DNA sample to be tested into 200-500bp long sequence fragments, and add different linkers at both ends of these small fragments to construct a single-stranded DNA library.
? (2) Flow cell
The flow cell is a channel for adsorbing flowing DNA fragments. When constructing libraries, the DNA in these libraries will randomly attach to the channels on the surface of the flow cell when passing through the flow cell. Each flow cell has 8 channels, and there are many connectors attached to the surface of each channel, which can be paired with the connectors added at both ends of DNA fragments during database construction (which is why the flow cell can absorb DNA after database construction), and can support bridge PCR amplification of DNA on its surface.
? (3) bridge PCR amplification and denaturation
Bridge PCR uses the linker fixed on the surface of the flow cell as a template for bridge amplification, as shown in Figure 4. A .. After repeated cycles of amplification and denaturation, each DNA fragment will eventually gather into bundles in its own position, and each bundle contains many copies of a single DNA template. The purpose of this process is to amplify the signal intensity of bases to meet the signal requirements of sequencing.
(4) sorting
The sequencing method adopts the method of sequencing while synthesizing. DNA polymerase, linker primer and 4-dNTP with base-specific fluorescence label are simultaneously added into the reaction system (such as Sanger sequencing method). The 3'-OH of these dNTP is chemically protected, so only one dNTP can be added at a time. After dNTP is added to the synthetic chain, all unused free dNTP and DNA polymerase will be washed away. Then, the buffer solution needed for exciting fluorescence is added, the fluorescence signal is excited by laser, and the fluorescence signal is recorded by optical equipment. Finally, the optical signal is converted into sequencing base by computer analysis. After the fluorescence signal was recorded, chemical reagents were added to quench the fluorescence signal and the dNTP 3'-OH protecting group was removed for the next sequencing reaction. Illumina's sequencing technology can solve the problem of accurate measurement of homopolymer length by adding only one dNTP at a time. The main source of sequencing error is base substitution. At present, the sequencing error rate is between 1% and 1.5%. Taking human genome resequencing as an example, the sequencing depth of 30x is about 1 week.
Roche 454
Roche 454 sequencing system is the first platform for commercial operation of the second generation sequencing technology. The main sorting principle is (Figure 5 abc)2:
Preparation of (1)DNA Library
The file construction method of 454 sequencing system is different from that of illumina. It breaks the DNA to be detected into small fragments with a length of 300-800bp by spraying, and adds different linkers at both ends of the fragments, or performs PCR amplification after denaturing the DNA to be detected, and connects the vectors to construct a single-stranded DNA library (Figure 5a).
(2) emulsion polymerase chain reaction (emulsion polymerase chain reaction is actually a unique process of oil formation by water injection)
Of course, the DNA amplification process is very different from that of illumina. It combines these single-stranded DNA with water-oil coated magnetic beads with a diameter of about 28um, incubating and annealing.
The biggest feature of emulsion PCR is that it can form a large number of independent reaction spaces for DNA amplification. The key technology is "water injection into oil" (oil in water). The basic process is to inject an aqueous solution containing all the reaction components of PCR onto the surface of high-speed rotating mineral oil before the PCR reaction, and the aqueous solution instantly forms countless small water droplets wrapped in mineral oil. These water droplets form an independent PCR reaction space. Ideally, each droplet contains only one DNA template and one magnetic bead.
The surface of these beads coated with water droplets contains DNA sequences complementary to the linker, so these single-stranded DNA sequences can specifically bind to the beads. At the same time, the incubation system contains PCR reagents, thus ensuring that each small fragment combined with magnetic beads can be independently amplified by PCR, and the amplified products can still be combined with magnetic beads. When the reaction is completed, the incubation system can be destroyed and the magnetic beads with DNA can be enriched. After amplification, each small fragment will be amplified by about 654.38+00000 times, thus reaching the amount of DNA needed for the next sequencing.
(3) sequencing of pyrophosphate
Before sequencing, magnetic beads with DNA need to be treated with polymerase and single-stranded binding protein, and then placed on PTP plate. This plate is specially equipped with many small holes with a diameter of about 44um, and each hole can only accommodate one magnetic bead. In this way, the position of each magnetic bead is fixed so as to detect the next sequencing reaction process.
The sequencing method adopts pyrophosphate sequencing method, and the magnetic beads whose diameter is smaller than the holes on the PTP plate are put into the holes to start the sequencing reaction. The sequencing reaction takes a large number of single-stranded DNA amplified on magnetic beads as a template, and dNTP is added to each reaction for synthesis. If dNTP can be paired with the sequence to be detected, pyrophosphate groups will be released after synthesis. The released pyrophosphate group will react with ATP sulfate chemical enzyme in the reaction system to generate ATP. The co-oxidation of ATP and luciferase makes fluorescein molecules in the sequencing reaction fluoresce, which is recorded by the CCD camera on the other side of PTP board, and finally the final sequencing result is obtained through optical signal processing by computer. Because the fluorescence color produced by each dNTP in the reaction is different, the sequence of the detected molecule can be judged according to the fluorescence color. After the reaction, the free dNTP will degrade ATP under the action of bisphosphatase, leading to fluorescence quenching, thus making the sequencing reaction enter the next cycle. In 454 sequencing technology, each sequencing reaction is carried out in an independent well on PTP plate, so mutual interference and sequencing deviation can be greatly reduced. The biggest advantage of 454 technology is that it can obtain a long reading length. At present, the average reading length of 454 technology can reach 400bp, which is different from Solexa and Hiseq technology of illumina. Its main disadvantage is that it cannot accurately measure the length of homopolymer. For example, when something similar to PolyA exists in the sequence, multiple T's will be added in the sequencing reaction at one time, and the number of added T's can only be estimated by fluorescence intensity, which may lead to inaccurate results. It is also for this reason that 454 technology will introduce insertion and deletion sequencing errors in the sequencing process.
Solid technology
Solid-state sequencing technology is an instrument that ABI Company began to put into commercial sequencing application in 2007. It is based on ligase method, that is, DNA ligase is used for sequencing during the ligation process (Figure 6) 2,4. Its principle is:
Construction of (1)DNA Library
The fragment was interrupted, sequencing adapters were added at both ends of the fragment, and the vector was connected to construct a single-stranded DNA library.
(2) emulsion PCR
The PCR process of solid is similar to that of 454, using the same method, but these beads are much smaller than those of 454 system, only 1um. At the same time, the 3' end of the amplified product was modified to prepare for the next sequencing process. 3' modified microspheres will be deposited on the glass slide. In the process of loading microspheres, the deposition chamber divides each slide into 1, 4 or 8 sequencing regions (Figure 6-a). The biggest advantage of the solid system is that each slide can hold beads with higher density than 454, and it is easy to achieve higher flux in the same system.
(3) ligase sequencing
This step is unique in solid-state sequencing. It does not use DNA polymerase commonly used in previous sequencing, but uses ligase. The substrate of solid-phase connection reaction is an 8-base single-stranded fluorescent probe mixture, which is simply expressed as 3'-XXnnnzzz-5' here. In the ligation reaction, these probes are paired with single-stranded DNA template chains according to the rule of base complementarity. The 5' end of the probe was labeled with four fluorescent dyes, namely CY5, Texas Red, CY3 and 6-FAM (Figure 6-a). In this 8-base single-stranded fluorescent probe, the bases at 1 and the second base (XX) were determined, and different fluorescent labels were added at 6-8 positions (zzz) according to different kinds. This is a unique solid-state sequencing method. Two bases determine a fluorescence signal, which is equivalent to determining two bases at a time. This sequencing method is also called double base sequencing. When the fluorescent probe can be connected to the DNA template chain, it will emit a fluorescent signal representing 1, 2 bases. The color swatches in Figure 6-a and Figure 6-b show the relationship between different combinations of 1 2 bases and fluorescence color. After the fluorescence signal is recorded, it is cut between the 5th and 6th bases by chemical method, so that the fluorescence signal can be removed and used for sequencing the next position. However, it is worth noting that through this sorting method, the position of each sorting is different by 5 digits. That is, 1 and 2 for the first time, and 6 and 7 for the second time ... After the end is measured, the newly synthesized chain should be denatured and eluted. Next, the primer n- 1 was used for the second sequencing. Primer n- 1 differs from primer n in that they are paired with the linker by one base (Figure 6-a. 8). That is to say, primer n- 1 moves the sequencing position to the 3' end on the basis of primer n, so that the 0th, 1, 5th and 6th positions can be determined to complete the second sequencing, and so on until the fifth sequencing, and finally the base sequencing of all positions can be completed, and the base of each position can be detected twice. The reading length of this technology is 2×50bp, and the subsequent sequence splicing is also complicated. Due to the double detection, the original sequencing accuracy of this technology is as high as 99.94%, and the coverage accuracy of 15x is 99.999%, which should be said to be the highest in the second generation sequencing technology at present. However, in the fluorescence decoding stage, because the fluorescence signal is determined by two bases, once an error occurs, it is easy to produce linkage decoding errors.
Third generation sequencing technology
Sequencing technology has reached a new milestone in the last two or three years. SMRT and Oxford nanopore technology of PacBio Company, as nanopore single molecule sequencing technologies, are called the third generation sequencing technologies. Compared with the previous two generations, their biggest feature is single molecule sequencing, and PCR amplification is not needed in the sequencing process.
PacBio SMRT technology actually applies the idea of sequencing while synthesizing, and takes SMRT chip as sequencing carrier. The basic principle is that DNA polymerase binds to the template, and four bases (namely dNTP) are labeled with four colors of fluorescence. In the base pairing stage, adding different bases will emit different light, and the input base type can be judged according to the wavelength and peak value of light. At the same time, this DNA polymerase is one of the keys to achieve ultra-long reading length, which is mainly related to the maintenance of enzyme activity and is mainly affected by laser damage. One of the keys of PacBio SMRT technology is how to distinguish the reaction signal from the strong fluorescence background of surrounding free alkali. They use ZMW (Zero Mode Waveguide Hole) principle: many dense holes can be seen on the wall of microwave oven. The diameter of the small hole is very delicate. If the diameter is larger than the microwave wavelength, the energy will penetrate the panel and leak out under the action of diffraction, thus disturbing the surrounding pores. If the aperture is smaller than the wavelength, the energy will not radiate to the surroundings, but keep straight (light diffraction principle), thus playing a protective role. Similarly, in a reaction tube (SMRTCell: single molecule real-time reaction hole), there are many circular nano-holes, namely ZMW (zero-mode waveguide hole), the outer diameter of which is larger than 100 nm and smaller than the wavelength of detection laser (several hundred nm). After the laser hits from the bottom, it can't penetrate the micropore and enter the upper solution area, and the energy is limited to a small range (volume is 20x 10) which can just cover the part to be detected, so that the signal only comes from this small reaction area, and excessive free nucleotide monomers outside the micropore remain in the dark, thus minimizing the background. In addition, the modification of some bases can be detected by detecting the sequencing time between two adjacent bases, that is, if the base is modified, the speed of polymerase will slow down and the distance between two adjacent peaks will increase, so that information such as methylation between them can be detected (Figure 7). The sequencing speed of SMRT technology is very fast, about 10 dNTP per second. But at the same time, its sequencing error rate is relatively high (this is almost a common problem of single molecule sequencing technology at present), reaching 15%, but fortunately, its errors are random, unlike the second-generation sequencing technology, which has the bias of sequencing errors, so it can be effectively corrected by multiple sequencing.
The nano-single molecule sequencing technology developed by Oxford Nano-pore Technology Company is different from the previous sequencing technology, which is based on electrical signals rather than optical signals. A key point of this technology is that they have designed a special nanopore in which the valence of * * * is combined with a molecular linker. When DNA bases pass through the nanopore, they will change the charge, thus temporarily affecting the current intensity flowing through the nanopore (the current change amplitude affected by each base is different), and sensitive electronic equipment will detect these changes to identify the passing bases (Figure 8).
Last year, the company launched the first commercial nanopore sequencer at the annual meeting of Genome Biotechnology Progress (AGBT), which attracted great attention from the scientific community. Nanopore sequencing (and other third-generation sequencing technologies) is expected to solve the shortcomings of the current sequencing platform. The main characteristics of nanopore sequencing are: the reading length is very long, about tens of kb, even100 kb; The error rate is currently between 1%-4%, and it is a random error, rather than gathering at both ends of the reading; Data can be read in real time; Qualcomm quantity (30x human genome is expected to be completed in one day); The initial DNA will not be destroyed during sequencing; The sample preparation is simple and cheap. In theory, RNA can also be sequenced directly.
Another major feature of nano-porous single molecule sequencing calculation is that methylated cytosine can be read directly without bisulfite treatment of genome like traditional methods. This is of great help to directly study epigenetic related phenomena at the genome level. Moreover, the sequencing accuracy of the improved method can reach 99.8%, and once the sequencing error is found, it is easy to correct it. However, there seems to be no relevant report on the application of this technology.
Other sequencing techniques
At present, there is a new generation of revolutionary sequencing technology based on semiconductor chips-Ion Torrent 6. This technology uses a high-density semiconductor chip covered with small holes, one of which is a sequencing reaction cell. When the DNA polymerase polymerizes the nucleotide to the extended DNA chain, it will release a hydrogen ion, and the PH value in the reaction cell will change, and the ion receiver under the cell will feel the H+ ion signal and directly convert it into a digital signal, thus reading the DNA sequence (Figure 9). -Jonathan Rosberg, the inventor of this technology, is also one of the inventors of 454 sequencing technology. Its library and sample preparation are very similar to that of 454 technology, and it can even be said to be a copy of 454, except that in the sequencing process, the sequence base information is obtained by detecting the change of H+ signal, instead of detecting the fluorescent color of pyrophosphate. Compared with other sequencing technologies, ion torrent does not need expensive physical imaging equipment, so it is relatively low in cost, relatively small in size, simpler in operation and quite fast. Excluding the two-day database preparation time, the whole computer sequencing can be completed in 2-3.5 hours, but the throughput of the whole chip is not high, which is about 10G at present, but it is very suitable for sequencing small genomes and exons.
summary
The principle of each generation of sequencing technology is briefly described above, and the following tables 1 and 2 summarize the characteristics of these three generations of sequencing technology. Among them, sequencing cost, reading length and throughput are three important indexes to evaluate advanced sequencing technology. Except for the differences in throughput and cost between the first and second generation sequencing technologies, the core principle of sequencing (except that Solid is sequencing while connecting) is based on the idea of sequencing while synthesizing. The advantage of the second generation sequencing technology is that the cost is greatly reduced and the throughput is greatly improved compared with the first generation, but the disadvantage is that the introduced PCR process will increase the sequencing error rate to a certain extent, and it has systematic bias and short reading length. The third generation sequencing technology was developed to solve the shortcomings of the second generation. Its basic feature is single molecule sequencing without any PCR process. This is to effectively avoid the systematic error caused by PCR bias, increase the reading length, and maintain the advantages of Qualcomm and low cost of the second generation technology.
Table 1: Comparison of sequencing technologies
Table 2: Cost sequencing comparison of mainstream sequencers
The following figure 10 shows the current distribution of the global sequencer. The hot spots in the picture are mainly distributed in China, Shenzhen (mainly in Huada), southern Europe, western Europe and the United States.
refer to
Original link: http://www.huangshujia.me/2013/08/02/2013-08-02-an-introduction-of-ngs-sequence.html.