What are the applications of bioinformatics?

1, sequencing and sequence alignment) \x0d\ \ Sequencing is the basis and main data source of bioinformatics, which can be human data or other data. The basic problem of sequence alignment is to compare the similarity or dissimilarity of two or more symbol sequences. From the biological point of view, this problem contains the following meanings: (1) reconstructing the complete sequence of DNA from overlapping sequence fragments; Under various experimental conditions, physical and genetic map storage is determined from probedata, DNA sequences in the database are traversed and compared, similarities of two or more sequences are compared, related sequences and subsequences are searched in the database, continuous generation patterns of nucleotides are found, information components in protein and DNA sequences are found, and biological characteristics of DNA sequences are compared, such as local insertion, deletion (the former two are referred to as indel for short) and replacement. The objective function of sequences obtains the minimum distance weighted sum or maximum similarity sum of mutation sets between sequences, and the comparison methods include global comparison, local comparison, generation gap punishment and so on. Dynamic programming algorithm is often used to compare two sequences, which is suitable for short sequence length, but not for massive gene sequences (such as human DNA sequence as high as 109bp). Even if the complexity of the algorithm is linear, it is difficult to work. Therefore, the introduction of heuristic methods is inevitable, and the famous BALST and FASTA algorithms and the corresponding improved methods are based on this premise. \ x0d \ x0d \ 2。 The basic problem of protein structure comparison and prediction \ x0d \ x0d \ is to compare the similarities or differences of the spatial structures of two or more protein molecules. The structure and function of protein are closely related. It is generally believed that protein structures with similar functions are generally similar. Protein is a long chain composed of amino acids, with the length ranging from 50 to1000 to 3000 aa. Protein has many functions, such as storage and transportation of enzymes and substances, signal transmission, antibodies and so on. It is generally believed that the sequence of amino acids inherently determines the three-dimensional structure of protein. Protein has four different structures. The reasons for studying the structure and prediction of protein are as follows: in medicine, we can understand the function of organisms, find the target of docking drugs, get better genetic engineering of crops in agriculture, and use enzyme synthesis in industry. The reason why protein's structure is directly compared is that protein's three-dimensional structure is more stable than the first-order structure in evolution. At the same time, it also contains more information than AA sequence. The premise of protein's three-dimensional structure research is that the internal amino acid sequence corresponds to the three-dimensional structure one by one (not necessarily true). Physically, it can be explained by minimum energy. The structure of unknown protein is predicted by observing and summarizing the protein structure law of known structures. Homologous modeling and threading methods fall into this category. Homology modeling is used to find protein structures with high similarity (more than 30% amino acids are the same), and the latter is used to compare different protein structures in evolutionary families. However, the current situation of structural prediction research in protein is far from meeting the actual needs. \x0d\3。 Study on gene recognition and non-coding region analysis. The basic problem of \ x0d \ x0d \ gene identification is to correctly identify the range and exact position of the gene in the genome sequence after the genome sequence is given. Non-coding region is composed of introns, which are usually discarded after protein formation. However, from the experiment, if the non-coding region is removed, it is obvious that DNA sequence, as a genetic language, is included in both coding region and non-coding sequence. At present, there is no general guiding method to analyze the DNA sequence of non-coding region. In the human genome, not all sequences are encoded, that is, some kind of protein template, and the encoded part only accounts for 3~5% of the total sequence of human genes. Obviously, it is inconceivable to search such a large gene sequence manually. The methods of detecting the password region include measuring the frequency of the codon in the password region, first-order and second-order Markov chains, ORF(OpenReadingFrames), promoter recognition, HMM(HiddenMarkovModel) and GENSCAN, splicing comparison, etc. X0d\\x0d\4. Molecular evolution and comparative genomics \ X0d \ X0d \ Molecular evolution is to study the evolution of organisms by using the similarities and differences of the same gene sequence in different species and construct an evolutionary tree. This can be done by using their coded DNA sequences or amino acid sequences, or even by comparing the structures of related protein. Provided that the genes of similar races are similar. By comparison, we can find out which races are the same and which are different. Early research methods often use external factors, such as size, skin color, number of limbs and so on. As the basis of evolution. In recent years, with the completion of many model organism genome sequencing tasks, people can study molecular evolution from the perspective of the whole genome. When matching genes of different races, we usually have to deal with three situations: orthologous. Collateral homology: Homologous genes with different functions; Heterologous gene: a gene that spreads between organisms by other means, such as a virus injection gene. The common method in this field is to construct a phylogenetic tree, which is realized by methods based on features (that is, the specific positions of amino acid bases in DNA sequences or protein) and distances (alignment scores) and some traditional clustering methods (such as UPGMA). \x0d\5, contig assembly \ According to the current sequencing technology, only 500 or more base pairs can be detected in each reaction. For example, the short-shot method is used to measure human genes, which requires a large number of short sequences to form an overlapping group. The process of splicing them gradually to form a longer contig until a complete sequence is obtained is called contig assembly. The overlapping group of sequences is a NP-complete problem. \x0d\6, the origin of genetic code \x0d\ Generally, the research on genetic code thinks that the relationship between codons and amino acids is caused by an accidental event in the history of biological evolution and has been fixed in the same ancestor of modern organisms until now. Different from this "freezing" theory, some people put forward selection optimization respectively. Chemistry and history are three theories to explain the genetic code. With the completion of various biological genome sequencing tasks, it provides new materials for studying the origin of genetic code and testing the authenticity of the above theory. \x0d\ x0d \ 7。 Structure-based drug design \ x0d \ One of the purposes of human genetic engineering is to understand the structure, function, interaction and various human diseases of about 654.38 million protein in human body. Seek all kinds of treatment and prevention methods, including drug treatment. Drug design based on biomacromolecules and micromolecules is an extremely important research field in bioinformatics. In order to inhibit the activity of some enzymes or protein, we can use molecular permutation algorithm to design inhibitor molecules as candidate drugs on the computer based on the known tertiary structure of their proteins. The purpose of this field is to find new gene drugs, which have great economic benefits. \x0d\ 8。 Modeling and Simulation of Biological Systems \x0d\ With the development of large-scale experimental technology and the accumulation of data, studying and analyzing biological systems from the global and systematic levels and revealing their development laws has become another research hotspot in the post-genome era-system biology. At present, its research contents include the simulation of biological system (CurrOpinRheumatol, 2007, 463-70), system stability analysis (non-linear dynamics Spsycholliffesci, 2007, 4 13-33) and system robustness analysis (Ernst Schering ResfoundWorkshop, 2007, 69-88). The modeling languages represented by SBML (Bioinformatics, 2007, 1297-8) have developed rapidly, including Boolean networks (PLoSComputBiol, 2007, e 163) and differential equations (MolBiolCell, 2004, 3841-. 2007, 3262-92) and discrete dynamic event system (Bioinformatics, 2007, 336-43) have been applied to system analysis. Many models are based on the modeling methods of physical systems such as circuits, and many studies try to solve the complexity of the system from the macroscopic analysis ideas such as information flow, entropy and energy flow (AnalQuantCytolHistol, 2007, 296-308). Of course, it will take a long time to establish the theoretical model of biological system. Although the experimental observation data are increasing greatly, the data needed for biological system model identification far exceeds the output capacity of current data. For example, for the chip data of time series, the number of sampling points is not enough to use the traditional time series modeling method, and the huge experimental cost is the main difficulty of system modeling at present. System description and modeling methods also need pioneering development. \x0d\9。 Research on technical methods of bioinformatics \x0d\ Bioinformatics is not only a simple arrangement of biological knowledge and a simple application of knowledge in mathematics, physics, information science and other disciplines. Massive data and complex background lead to the rapid development of machine learning, statistical data analysis and system description under the background of bioinformatics. The huge amount of calculation, complex noise patterns and massive time-varying data have brought great difficulties to traditional statistical analysis, and more flexible data analysis techniques are needed, such as nonparametric statistics (BMCBioinformatics, 2007,339) and cluster analysis (QualLifeRes, 2007, 1655-63). The analysis of high-dimensional data requires feature space compression techniques, such as partial least squares (PLS). In the development of computer algorithm, it is necessary to fully consider the time and space complexity of the algorithm, and use parallel computing, grid computing and other technologies to expand the realizability of the algorithm. \x0d\ 10, biological image \x0d\ Why do people who are not related by blood look so alike? \x0d\ Appearance is composed of points. The more points overlap, the more they look alike. Why do two unrelated portrait points overlap? What is the biological basis of \x0d\? Are the genes similar? I don't know, I hope experts can answer. \x0d\ 1 1, others \x0d\ such as gene expression profile analysis and metabolic network analysis; Gene chip design and protein omics data analysis have gradually become new important research fields in bioinformatics. In terms of disciplines, disciplines derived from bioinformatics, including structural genomics, functional genomics, comparative genomics, protein's research, pharmacogenomics, traditional Chinese medicine genomics, oncology, molecular epidemiology and environmental genomics, have become important research methods in systems biology. It is not difficult to see from the current development that genetic engineering has entered the post-genome era. We also have a clear understanding of how to deal with the possible misleading in machine learning and mathematics closely related to bioinformatics.