"-Informatics" in bioinformatics refers to the process of mining knowledge from massive data, as shown in the following figure. In this process, it will involve data management, data operation, data mining and modeling and simulation. The data management part is mainly database, and the data operation part mainly refers to various software tools of biological information. These two parts are very important resources for bioinformatics research, and also the basic knowledge that students need to know when they get started. The following is a brief introduction to these resources. (This article is based on the video of Peking University Bioinformatics Open Class, and the picture is from the video screenshot.)
According to different characteristics, these resources can be divided into different categories. For example, according to the nature of data, databases can be divided into primary data databases and secondary data databases. For example, software tools can be divided into stand-alone programs and web servers according to whether the software is a stand-alone tool or a web server.
According to the category of publishers, it can be divided into centralized resources and personal resources. The relatively large centralized resources mainly include NCBI (National Biotechnology Information Center), EBI (European Institute of Bioinformatics) and UCSC (University of California, Santa Cruz) genome browser. The following will introduce these three largest databases and other bioinformatics data resources respectively.
1 introduction. NCBI
NCBI genome database:
Most sequenced genomes have been preserved, and the 1000+ genome has been sequenced.
Nbi-nucleotide/protein (RefSeq):
Reference sequence after integrating different versions. Where NM_* stands for nucleic acid sequence and NP_* stands for protein sequence. Among them, the nucleic acid gives information such as id number, name, species, characteristics, coding region and sequence. Protein also gives the information of the functional interval.
NCBI gene:
Taking gene as a unit, the information of pathway, variation and phenotype is integrated.
For human genes, GeneCards has better annotations (expression, interaction, homologous protein, function, genetic variation, etc.). ) is better than NCBI in human genes and protein.
NCBI- Sla
The short sequence database of the new generation sequencing technology will double the data every five months.
NCBI- taxonomy
Taxonomic trees of all species with at least one gene sequenced, and 10% of all described species have been sequenced.
NCBI public medicine
Used for consulting literature.
NCBI grid
(medical subject title) a controlled sound used to index articles in a published structured thesaurus.
NCBI- my NCBI
For interested keywords, after setting NBCI, relevant documents will be pushed every week, which is very useful for document tracking in the project.
NCBI explosion
The most famous tool of NCBI, two articles about BLAST, has been cited more than 42,000 times. The different versions of BLAST include:
Online: NCBI explosion
Stand-alone version: BLAST+
Embedded web page: wwwblast
2. introduction to 2.ebi
The following table lists some resources of EBI:
EBI- en Senbul:
The resources between NCBI and UCSC integrate different resources of many species. The quantity types in Ensembl include:
EBI-UniProtKB
UniProt is a comprehensive resource of protein sequence and annotation data.
UniProt knowledge base (UniProtKB) is the central hub for collecting functional information of protein, with accurate, consistent and rich annotations. )
UniProtKB -Swiss-Prot (manual proofreading)
UniProtKB -TrEMBL (no manual proofreading)
EBI- intact
intermolecular interaction
EBI- Cruz omega
multiple sequence alignment
EBI inter proscan
Enter a sequence to see if it contains an area of protein with known functions at present.
Introduction to 3 UCSC
Take the genome as the coordinate. Contains many tracks, including SNP, mRNA, cleaved EST, uncut EST, Qualcomm number, passing through the pool.