The storage unit of life information is actually what we call genes, and the carrier is deoxyribonucleic acid DNA. In multicellular organisms, information flows between different cells, tissues composed of different cells, and organs formed by different tissues. This is what we call: the central law of genetics. Genome is the sum of genetic information of living organisms. The discovery of DNA double helix may be the greatest breakthrough in life science in the 20th century. The four different bases of ATCG constitute a complex genetic language.
In fact, most human diseases are controlled by multiple genes. The official launch date of the Human Genome Project is 1990, which means that it will take 15 years to complete the determination of the complete sequence of human genome DNA by 2005. So far, we have no such technology, saying that we can directly sequence a chromosome. So the whole human genome project is actually a process from complex to simple, and then from simple to complex. At the beginning of human genome sequencing, the DNA sequence analysis method used at that time was mainly gel electrophoresis, which was basically manual operation. But after 1990s, a new sequencing technology came into being, that is capillary electrophoresis. Thereby greatly accelerating the sequencing speed. The sequence of 6.5438+0 million base pairs can be detected in one day. China also joined the sequencing project, and we undertook the task of 1%. In April 2000, the sketch sequence of chromosome 2 1 was completed.
Now we can click on the human genome on the fingernail-sized biochip, and all the genes have been clicked. If you want to see a doctor in the future, you should not only bring a sick number card, but also bring your own chip. Before the doctor makes a diagnosis, use the chip to see what disease you may get. By comparing the genomics of organisms in different evolutionary stages, we can find the functional regulation law of genome structure. In fact, genes related to human diseases are also important information for the structural and functional integrity of the human genome. In fact, in the past few years, the study of diseases has long been the study of the human genome.
Part of an important plan. 1997 put forward two projects: one is the anatomy project of tumor genome, and the other is the environmental genome project. Actually, it's all about health. The contribution of the human genome project to medicine is one in diagnosis and the other in gene therapy. For developing countries like us, we should pay more attention to prevention.
The genome project of our country started from 1994, and started from the perspective of functional genomics. Pay equal attention to structure and function, establish interdisciplinary key technologies, and study genome diversity and disease genes. This was the initial strategy. We can proudly say that all chromosomes except Y chromosome are covered with genes discovered and named by scientists in China. Recently, we have carried out a large-scale SNP study in China. This work has shifted from population genetics to the study of genetic information of the occurrence and development of diseases, which is the characteristic of China people. Therefore, if we can now make the systematic catalogue and database of the variation of life elements in China, we can obtain the intellectual property rights of technological innovation in China's biomedical industry, benefit future generations and contribute to all mankind.
full text
Of course, Tsinghua is one of the highest universities in our country. So, it's a little scary to be here today. Then I'm mainly here for advice. Now I want to introduce you to the study of the human genome, which can be said to be the first time. In life science, the concept of a big science has been realized. In other words, analyze genetic information as a whole and study the function of genome. Therefore, I say that the characteristics of biology have reached a new platform from the 1970s and 1980s, mainly focusing on analysis, discipline refinement and division of labor refinement. This platform is a big synthesis. In fact, our science in China pays attention to grand synthesis from the very beginning. You see, our art is the same, and our freehand brushwork is a great synthesis. This integration of East and West is very important. If we combine the rigorous analysis of the West with the comprehensive thinking of China thousands of years ago, I think it may bring some new breakthrough opportunities. Then I think this picture, I'm afraid, is not only engaged in life science, but also our non-life science students are very familiar with the central law of genetics.
As we all know, the essence of life activities is information flow. It has always been said that we are all engaged in life sciences. But suddenly a sentence popped up: "What is life"? This can make people think. Personally, I realize that life information, as its storage unit, is one of the important characteristics of life and has memory function. Then its storage unit is actually what we call a gene. In most organisms, we know that its carrier is deoxyribonucleic acid DNA. But its implementation unit is mainly protein. It uses different information languages, one is the language of nucleic acid, and the other is the language of amino acid. Therefore, in this spatial information flow, some adjustment mechanisms are needed. As we all know, the first step of this regulation is transcription. At this time, the language of life information has not changed, but the language of nucleic acid. Just from DNA to MRNA, this process is called transcription. Then the language will change, change and need translation. So, from the life language on MRNA to the life language of protein. Of course, we know that protein and many protein have metabolic activity. An important difference between living things and non-living things is that there is metabolism and metabolism, and then protein can form the configuration of advanced space. So here, different parts of the cell interact, and the nucleus and cytoplasm interact. Then in a multicellular organism, information flows between different cells, between tissues composed of different cells, and between organs formed by different tissues. I think this is what we call "the central law of genetics". Then the concept of gene is clear to everyone, or the basic concept is clear, and the exact definition may not be clear today.
So what does the genome mean? Genome is the sum of genetic information of living organisms. Then we don't have a single gene here, but all genes. It encodes the relationship between all amino acids, so the meaning is completely different. The discovery of DNA double helix structure may be the biggest breakthrough in life science in the 20th century. Then four different bases, A, T, C and G, constitute a complex genetic language and the most basic symbol of life information. This basic symbol really makes us feel simple. With these four simple words, nature has formed countless phenomena of life diversity in the vast world that amazes us. Then its genetic information, in most organisms, what I just said is the molecule of DNA. Then its arrangement and combination there determines, or to a considerable extent, the activities of life in the human body, which is what we call: life, old age, illness, death and so on. So when we talk about the double helix structure, we all know that base pairs and DNA are biological macromolecules. Generally speaking, we don't use mass units to express its volume, but its length. Then a bp is called a base pair in Chinese. But as far as genes are concerned, a gene often needs thousands of base pairs. So we introduced the scale of "thousand base pairs". So when we do the genome, we all know that the genome is a very large scale, so we invented some new scale units, such as Mb refers to millions of base pairs.
This is our understanding of the human genome before the genome project. We know the length of the human genome. Haploid genome is about 3 billion base pairs in length. General textbooks say that the coding sequence in the sequence, that is to say, what we just said is transcribed and expressed, can be called gene sequence. Probably actually refers to mature MRNA, and the sequence in the processed MRNA is less than 5%. In other words, non-coding sequences account for the vast majority. In human nucleus, genetic information is organized in the form of chromosomes, which are distributed in 22 autosomes and 2 sex chromosomes. We all know the characteristics of biological science in the past, which is basically a workshop-style operation of master and apprentice. By the mid-1980s, I think one is the great expansion of scientific thinking in life science, and the other is the progress of technology. For example, genetic engineering was very mature at that time, and DNA sequencing was also relatively mature at that time, and then PCR technology began to appear there. As a result, the ambitions of scientists and life scientists germinated there, and they were determined to break through the original workshop-style operation mode that was not valued by physics or even chemistry and make something that could be called big science.
Of course, I think the conditions of scientific research and thinking are on the one hand. But in fact, if we look back at the history of science, many important events are still driven by the demand there. Some of our scientists criticize this practice, which means that we should pay attention to combining basic research with major social needs. I think this is actually a bit biased, that is to say, there are various types of research: some are free to explore, so this can be very detailed, and everyone can have the idea of Nobel Prize in his mind. But there are also some studies that really hope to benefit mankind. However, the challenge brought by this kind of research may actually breed the idea of awarding Nobel Prize to unknown people. Then the human genome project is such a typical example.
Look at the first one first, which can be considered as a formal tender. Generally speaking, to do this project, we must first have a tender. Then the first tender for the Human Genome Project can be considered as a short article published in Science by Nobel Prize winner Dulbecco 1986. What is the title of this article? Turning point of tumor research-human genome research. In fact, we know that an ambitious young president Kennedy came to power in the United States. At that time, he had two big plans in science: one was to send a man to the moon, and the other was to beat cancer. Then, with the smooth implementation of the Apollo program, humans landed on the moon in 1969. But the plan to conquer the tumor failed. Why? It turns out that scientists think the problem is too simple, thinking that tumor is a problem of one or two genes. But in fact, most tumors are polygenic problems. The problem involving the whole genome is the disorder of the whole genetic information. As I said just now, don't think that a fusion gene is enough to cause leukemia in mice. Because in that case, you will have leukemia as soon as you hit it. Actually, we have Lola leukemia. After the fusion gene is injected into the fertilized egg, it will take one year for leukemia to occur, and it will not happen every hour. So this shows that there are other decisive factors in it. We now know that sometimes several genes are infected together, and the speed of leukemia will be greatly accelerated.
The article Dulbecco said that if we want to know more about tumors, we must pay attention to the genome of cells from now on. Which species should we start with? If you want to understand human tumors, you must start with humans. A detailed understanding of DNA will greatly promote the study of human tumors. In fact, most human diseases are polygenic. The Human Genome Project was officially launched, and now the common saying is 1990. Then 1990, because it was the US Congress that officially launched such a plan. This ambitious plan is to complete DNA sequencing in 2005 15. How much is this investment? Three billion dollars. The calculation at that time was based on the fact that it took about one dollar to measure a base pair. The whole plan is actually a narrow plan in this place, and this plan is actually a sequencing plan. In fact, we say that sequencing and reading gobbledygook are only the first step to understand human beings themselves, and the most important thing is reading gobbledygook. But even if you look at a gobbledygook plan like this, you will experience many difficulties and hardships. That is to say, up to today, we have no technology to say that we can directly sequence a chromosome, and we can't do this from one end to the other. Therefore, the whole human genome project can be simply said to be from complex to simple, from simple to complex, and finally to simple. In other words, a chromosome that cannot be directly sequenced is broken down into smaller, operable units. So how to break it down? That's painting. It can be drawn by genetic method or physical method. As we know, genetic mapping uses genetic markers to determine the relative distance between DNA markers. Another concept is to form some so-called DNA continuous clones, so that these fragments, overlapping with each other, can cover the whole chromosome from one end to the other. In this way, a unit that cannot be directly used for sequencing is analyzed into a relatively small and operable unit. Finally, it is recombined to be faithful to the arrangement of life information in the original chromosome, and this situation is among them, identifying all human genes. So the human genome is mapping, or human genome project in a narrow sense, that is, mapping plan, gene map, physical map, sequence map and then gene map.
There are two large-scale sequencing strategies in the human genome project. One is the idea I just mentioned, which is actually called cloning one by one. As I said just now, you build a continuous cloning system for DNA cloning, covering the whole chromosome, and then you clone one by one. The most commonly used one is called BAC-bacterial artificial chromosome, and its length is about 100 KB. Then pick out the clones one by one and subclone them later. This subclone is like this, so it can be sequenced and then assembled and reduced. This strategy is used to sort projects in the international public domain. In fact, it is a historical evolution, that is to say, it evolved from mapping, heredity and physical mapping. We all know Siral Company in America, and we also know Quikmart. Then it developed a method called whole genome shotgun method, which directly decomposed the genome into small fragments for random sequencing on the basis of certain mapping information, constructed a continuous cloning system around large fragments, and then assembled it with a supercomputer. Can make the human genome, after preliminary mapping, quickly enter sequencing, especially large-scale sequencing. March towards people's expectations. There are two major factors contributing to this, and we have to admit that the contribution of industry is very great. For example, at the beginning of the human genome, the analysis method of this DNA sequence used at that time was mainly based on gel electrophoresis, which was basically manual operation. But in the first half of 1990s, a new sequencing technology-capillary electrophoresis appeared. In addition, automatic operation and systems including industrial management are also introduced. Therefore, the sequencing speed is greatly accelerated. You are such a tester, and its name is Megabace. What do you mean? It is capillary electrophoresis, which can read a sequence in almost two hours and hundreds of bases, so it can do ten classes a day, so it is 96 channels, so it can do 960 channels a day. According to their propaganda, each line can reach one KB, which is actually very difficult to achieve. This is the most ideal state. So you can produce 6.5438+0 million base pairs a day. However, another problem that once troubled academic circles is that if we are now in an era of knowledge explosion, it can be said that the explosion of biological information is the most impressive.
We see that the growth of DNA sequences in public databases was very slow before the start of the genome project. Then after 1990, it is the exponential growth period. Moreover, I counted this thing in the public domain of the two worlds last year and 2000, and the sequencing project and Siral announced the completion of the so-called working sketch respectively. This is the case at this time, and it is probably the case now. At that time, 1999 was faced with a forced challenge from Siral. Founded in 1998, it claimed to win the human genome in three years, and the International Human Genome Project decided to meet the challenge. There just happens to be 16 group sharing the task of sequencing the human genome in the world, and China has also joined this sequencing project. Of course, we undertake the task of 1%, and 1% is still very important. Because it is not easy for a developing country to squeeze into this club belonging to a developed country. Some things we want to squeeze may not be able to squeeze in, such as the plan of the space station, and people are still guarding you.
Here I want to introduce what is a work frame diagram. Because everyone is talking about the work frame diagram, what is the work frame diagram? It's actually a working sketch. So what does this mean? That is to say, more than 90% of the gene sequences in the genome are obtained by 4-5 times coverage sequencing of BAC (bacterial artificial chromosome) continuous clones with clear chromosome position, and the error rate should be less than 65438 0%. In other words, your coverage should reach more than 90% of the genome. Second, the error rate should be lower than 1%. 100 base opposition makes you have less than one base pair error. Although this is only a sketch, it is already very useful, that is, the basic understanding of genome structure, gene identification and analysis, location and cloning of disease genes, the discovery of single nucleotide polymorphism and so on.
Then when it comes to sketches, there must be a final picture. Therefore, the definition of this figure requires that the clones used for sequencing can faithfully represent the genomic structure of euchromatin, with a coverage rate of over 99.9%, and then the sequence error rate should be less than one in ten thousand. The relationship with the working frame diagram is actually to increase the coverage of sequencing, fill in gaps and increase the accuracy of sequencing on the basis of the working frame diagram, which can achieve such standards. In other words, it is the next step of sketch. What was the sequencing status on June 25th, 2000? We saw that at that time, in the public domain, that is to say, the plan supported by the governments of the United States, Britain, Germany, Japan, France and China covered about 86.8% of the human genome. Part of it has been completed, that is, the final sequence diagram we just mentioned. The standard sequence is about a little more than 20%, and then about 66% of the sequence is in the so-called working sketch stage. Then it can be said that it is not completed. Because we said to reach more than 90%, but at the same time Siral claimed that his coverage rate has exceeded 95%. Of course, his coverage actually includes this contribution from various fields, plus his contribution, so the two add up. I think we should believe that more than 90% of the sequence is covered by the quality of such a sequence above the working sketch. Let's take a look at the public domain sequencing project at that time, and its distribution on 24 chromosomes. We know that in fact, from June 5438+0999 to February 19, as one of the smallest chromosomes in human beings, the complete sequence of chromosome 22 was determined, or its autosome refers to a partial complete sequence. We notice its short arm, which is an easy chromosome region and actually very unpredictable. Because there are many empty sequences and not many genes. In April 2000, the complete sequence of chromosome 2 1 was completed, which is the same definition, that is, this part of the autosome. We saw that this place was represented by a deep red color, and it was almost finished. And this yellow represents the working sketch we just mentioned, which is part of the working sketch of most chromosome regions. In fact, what we are talking about now is to complete the determination of the whole human genome sequence, that is to say, the autosomal part, so some people say that maybe the human genome sequence will never end.
2001February 15 We know that in the field of nature, there is a feeling that soldiers are fighting against soldiers and will fight against generals. On February 16, Siral sequence was published. Obviously, after a new round of competition, the quality of the completed sequence is much higher than that in June 2000. So in this case, we should think that the information put together by these two companies should go further than the general definition I just said. So there is an intermediate state between the working sketch and the final finished product drawing, which is called high quality sketch. But it is such a high-quality sketch that we have basically known how much information my family has in life. Finally, we found that our wealth seems to be relatively small, more than we expected, because our gene number is only the number of nematodes, only about twice that of living organisms with more than 900 cells, and we are twice that of a bug. The complexity of its genome from lower organisms to higher organisms is not so much determined by the number of genes as by the length of genes. We recently completed the sequencing of a bacterium called leptospira, which can cause infectious diseases. It has an average of one gene per KB, such a small thing, a genome of 5 million base pairs, and 5,000 genes. We have 3 billion base pairs, but we only have 30,000, at most close to 40,000. But if you look at yeast, when it comes to eukaryotic cells, the average is about 5 to 10 KB of a gene. Then there is the fruit fly, although it seems to have fewer genes than nematodes. But its gene length reached more than 100 KB, and then it reached a gene in mammals, probably just like in humans, and it is now more than 100 KB. So the possibility of replacement and splicing is greatly increased. In addition, the complexity of these sequences is greatly increased because of the regulation of time and space, that is, the developmental stage and tissue-specific expression. Although genes can reach the fifth order of ten in higher organisms, there are tens of thousands to hundreds of thousands of such higher organisms. But in fact, its protein domain, in fact, if the genome is compared to a building, the number of prefabricated parts that make up the building is actually limited. In addition, some advanced organisms have more abundant domain combinations, and the genes of neural function, tissue-specific development, regulation, hemostasis and immune system have been greatly expanded in vertebrates. Hundreds of human genes originated from the lateral transfer of bacterial genes at a certain time in the evolution of vertebrates. There are great differences in genomes among different individuals-single nucleotide polymorphism. The haploid gene difference is11250, and less than 1% can cause protein variation.
This genetic gobbledygook has been placed in front of us, and the next step is to understand it. To understand it, we must consider how to interpret it from the concept of large-scale system. A piece of information from this genome interacts with the external environment there. In addition, the information of this genome did not fall from the sky, but developed through a long evolutionary process of billions of years, so it should be interpreted in a comparative way. In addition, it should be considered that there is variation between individuals and groups, and this variation is also regulated by the external environment. Therefore, although there is no strict definition of functional genomics at present, personally, it includes at least these aspects: SNP is the core content of the study of human genome DNA sequence variation, because it is the most common variation type, and of course there are many other variations. Then there is the study on the regulation of genome expression, which is the variation of tissues and organs in the development stage, and then there is the study of model organisms, including the significance of evolution and the use of model organisms for functional research. Of course, we do all these studies, just like we do sequencing studies. Bioinformatics is not only a basic tool, but also a new discipline. Because in the end, to integrate this information into what we call systems biology, we must rely on theoretical means and large-scale information processing means.