Is the era of big data a double-edged sword?

Application Research and Practice of Big Data and Bioinformatics

On February 20th, Li Jinhua, a professor, doctor and vice president of the School of Data Science and Software Engineering of Qingdao University, gave a theme sharing entitled "Application Research and Practice of Big Data and Bioinformatics" in the APP Micro-lecture column of CIO era. He talked about the background of big data and related scientific research work (big data teaching and research work in bioinformatics).

Key words:? CIO era APP? mini-lecture

On February 20th, Li Jinhua, a professor, doctor and vice president of the School of Data Science and Software Engineering of Qingdao University, gave a theme sharing entitled "Application Research and Practice of Big Data and Bioinformatics" in the APP Micro-lecture column of CIO era. He talked about the background of big data and related scientific research work (big data teaching and research work in bioinformatics).

I. Relevant background

(A) the background of bioinformatics

As we all know, bioinformatics is an abnormal interdisciplinary subject with the start of the human genome project in the late 1980s. Through the acquisition, processing, storage, retrieval and analysis of biological experimental data, the biological significance of the data can be explained. At present, the main driving force for the development of bioinformatics comes from molecular biology, and the research of bioinformatics mainly focuses on the storage, classification, retrieval and analysis of nucleotide and amino acid sequences. Therefore, the current bioinformatics can be narrowly defined as an interdisciplinary subject that applies computer science and mathematics to the acquisition, processing, storage, classification, retrieval and analysis of biomacromolecule information in order to understand the biological significance of these biomacromolecule information, which is essentially a subject that pays equal attention to both theoretical concepts and practical applications.

Bioinformatics has been produced and developed for more than 30 years. The definition of genomic informatics by the American Human Genome Project is a discipline field, which includes all aspects of acquisition, processing, storage, distribution, analysis and interpretation of genomic information. Since 1990, the human genome project was launched in the United States, the genome detection of human and model organisms has developed rapidly, and the whole gene detection and work of about 40 organisms have been completed ahead of schedule. Up to now, the total number of DNA sequences registered in GeneBank in the United States alone has exceeded 7 billion base pairs. In addition, up to now, the spatial structures of more than10,000 protein have been measured with different resolutions. More than one million EST databases have been established based on cDNA sequence testing, and more than 5,000 databases have been derived and sorted out based on these data.

All these constitute an ocean of biological data. This rapid and massive accumulation of scientific data is unprecedented in the history of scientific development, but data is not equal to information and knowledge. Of course, it is the source of information and knowledge, and the key lies in how to mine it. Compared with the exponential growth of biological data, the growth of human knowledge is very slow. On the one hand, it is a huge amount of data, on the other hand, it is eager for new knowledge in medicine, medicine, agriculture and environment to help people improve their living environment and quality of life. This constitutes a great contradiction. This contradiction gave birth to a new interdisciplinary subject, which is bioinformatics.

The research work of informatics big data mainly aims at analyzing massive multivariate data, which brings unprecedented opportunities for life sciences and is of great significance in studying gene function, disease mechanism, precision medicine and so on. The scale, diversity and high speed of big data bring new challenges to bioinformatics. In data computing, it is urgent to solve the elastic demand of small and medium-sized laboratories for computing resources. In data analysis, it is urgent to integrate multi-omics analysis system to solve biological problems. The lack of corresponding biological tools is the main bottleneck in the field of life science in the era of big data.

(B) Research background of bioinformatics in Qingdao University

1.In 2009, the State Key Laboratory of Software Engineering located in Wuhan University held a summer school in Qingdao. This is the first time that western scholars have mentioned interdisciplinary research in computer biology, mainly including gene sequencing and visualization of biological big data.

2.20 1 1 Since then, Qingdao University and shenzhen huada gene research institute have jointly established the Huada Gene Innovation Class of Qingdao University to cultivate top-notch innovative talents in the fields of biogenetics and bioinformatics in the era of big data. Within one month after college students enter school, 30 students are selected from more than 9,000 different majors. According to the requirements of thick foundation, wide caliber, comprehensiveness and internationalization, there are two elective modules in the basic course and professional course stage, one is medical examination and the other is information processing.

3.20 16 cooperated with the professor of Qingdao University School of Medicine, and obtained master's degrees in bioinformatics in two disciplines. Research interests: sequence and genomics analysis, drug research and development, biological network integration, data mining and data analysis (mainly in the field of biological applications), and bioinformatics software methodology research.

Second, the main contents, main problems and key technologies of bioinformatics research

(A) the main content of bioinformatics research

1. Genomics research

Genomics contains the basic information necessary for the formation and maintenance of a living organism, which is transformed into real life phenomena through various molecular biological reactions in cells. One part of the genome encodes protein and RNA, and the other part regulates the expression of these macromolecules. The expressed protein and RNA are folded into highly specific three-dimensional structures, and these functions are realized at specific positions in the body. Many details of these processes were revealed in the laboratory of molecular biology research, and a large number of data were formed and stored in the database. Bioinformatics attempts to extract new biological information and knowledge from these data, which is a theoretical biology rooted in comprehensive and in-depth experimental facts and data.

2. Collection, storage, management and provision of biological information.

Including the establishment of an international basic biological information base and an international biological information transmission network system; Establish a quality evaluation and testing system for bioinformatics data; Biological information online service; Visualization of biological information and expert system.

3. Extraction and analysis of genome sequence information.

Including the discovery and identification of genes, such as the discovery of new genes, new SNPs and various functional sites through large-scale parallel calculation by using the corresponding data measured in the international EST database and their respective laboratories; Analyze the information structure of non-coding regions in genome, put forward theoretical models and clarify the important biological functions of these regions; Analyze and compare the information structure of the whole genome of model organisms; Using biological information to study the origin of genetic code, the evolution of genome structure, the relationship between genome spatial structure and DNA folding, and the relationship between genome information and biological evolution.

4. Research on bioinformatics analysis technology and method.

Including the development of effective software, databases and some database tools that can support large-scale mapping and sequencing, such as electronic grid and other remote communication tools; Improve the existing theoretical analysis methods, such as statistical method, pattern recognition method, hidden Markov process method, neural network method, complexity analysis method, cryptography method, multi-sequence comparison method and so on. Create all new methods and technologies suitable for genome analysis. Including the introduction of complex system analysis technology and information system analysis technology;

5. Application development research.

Collect human genetic information related to diseases, develop techniques for detecting patient sample sequence information and selecting expression vectors and primers based on the sequence information, and establish databases related to animal and plant breeding, macromolecular design and drug design.

(2) research questions

1. Storage and management of biological big data

Including the storage structure, storage standard and management technology of biological big data. Bio-big data is huge in quantity, complex in structure and diverse in storage standards. There are many data structures such as unstructured data, semi-structured data and structured data. How to choose distributed file system, distributed data combination and distributed parallel database system is also one of the main problems of biological big data storage and management technology.

2. Biobig data visualization

Biological big data has universal biological significance because of its huge amount. Reasonable visualization can help biologists understand and analyze biological data quickly.

3. Analysis and processing of biological big data

By integrating multiple omics data for calculation and analysis, practical biological problems are solved.

(3) Key technologies

Key technologies in the field of biological big data are:

1. Biobig Data Standardization, Integration and Fusion Technology

Research the key technologies of integration and fusion of histology data, medical data and health data, research and develop the information model and integration engine of histology, medicine and health data, study the interface realization technology of messages and documents based on domestic and foreign standards and norms, network security technology based on next generation Internet technology and Qualcomm transmission technology.

2. Biological big data expression index, search and storage access technology.

Focus on breaking through the description and parallel access technology of biological big data resources, build an efficient index and reliable and scalable biological big data storage management system, and establish a biological big data resource search and acquisition service system based on key technologies such as semantic-based biological big data resource retrieval and biomedical data association search.

3. Analysis and application of big data processing for cardiovascular diseases and tumor diseases.

For cardiovascular diseases and tumor diseases, we will integrate electronic medical records, images, clinical test data and other data (covering more than 500,000 individuals, with a total data volume of 50TB), and carry out the processing, storage, analysis and application research of medical big data to provide big data support for improving the diagnosis and treatment level of major diseases.

4. Analysis and application of regional medical and health big data processing.

Select the regional medical and health data covering more than10000000 individuals, and the total amount of data shall not be less than 100TB. Through processing, storage, analysis and integration, we will build a health service knowledge base and support platform to provide application services.

5. Organize the construction and service technology of big data center and knowledge base.

Integrate omics data, including genome and protein Group, with a total data volume of not less than 100TB, and at least 60% of the data will be provided for external access, focusing on breaking through personal genome visualization technology, omics annotation and disease risk assessment technology, and establishing omics big data knowledge base, search engine, data mining and visualization analysis platform.