Population Structure —— Construction of Phylogenetic Tree

Recently, chores are really full, and I finally have time to update. . . .

Through the introduction of the last article, the basic concept of evolutionary tree has been clearly understood, so how to obtain a credible evolutionary tree?

For population genetic analysis, phylogenetic tree is usually constructed based on population SNPs locus data. So, next, I mainly take SNPs data as an example to introduce the construction method of evolutionary tree.

Sequence alignment->; Tree selection method->; Calculate the best alternative model->; Establishment of phylogenetic tree->; Evolutionary tree beautification

Common sequence alignment softwares are Clustal and Muscle.

Clustal not only has its own independent software (supported by various operating systems), but also is often integrated into some commonly used software, such as Bioedit and MEGA.

Muscle also supports multiple operating systems.

Both softwares are frequently cited, so there is no absolute who is good or bad, whichever is convenient.

1, distance-based method distance method:

Distance-based method: firstly, through the comparison between species, the evolutionary distance between taxonomic groups is deduced according to certain assumptions (evolutionary distance model), and an evolutionary distance matrix is constructed. The construction of evolutionary tree is based on the evolutionary distance relationship in this matrix.

2. Character-based method characteristic method:

Feature-based method: instead of calculating the distance between sequences, different sites in the sequence are regarded as independent features, and a tree is constructed according to these features.

The basis of model selection is as follows:

UPGMA method has been used less. Generally speaking, if the model is suitable, the effect of ML is better. For related sequences, some people like MP because it uses the least assumptions. MP is generally not used for remote sequences, and NJ or ML is generally used at this time. For sequences with low similarity, long branch attraction (LBA) often appears in NJ, which sometimes seriously interferes with the construction of phylogenetic tree. Bayesian method is too slow. Regarding the accuracy of constructing molecular phylogenetic trees by various methods, there is a review (Hall BG, 2005) that Bayesian method is the best, followed by ML and MP. In fact, if the sequence similarity is high, all methods will get good results, and the differences between models are not big. However, NJ is the ML model widely used in the article.

In phylogenetic analysis, maximum likelihood method (ML) and Bayesian method (BI) are two algorithms that are very sensitive to alternative models. Therefore, the selection of alternative models is an essential process before the phylogenetic tree is reconstructed by ML method or BI method.

For the use of jModelTest under Win operating system, please refer to this article: Illustrate the choice of nucleotide substitution model-jModeltest of Raindy.

For the usage of PROTECT, please refer to this article: Choose the best amino acid substitution model with PROTECT.

I basically use the Linux version of jModelTest myself, which is extremely simple to use. These commands are as follows:

Parameter description:

-d: input file. Attention! This software needs to input a file. Phy format, no. Fasta format.

-f: Including models with different basic frequencies.

-g: rate change model including different locations and categories.

-i: Includes models with scale-invariant sites.

-s: number of alternatives

-v: importance of model average and parameters.

-a: estimating the model average system development of each effective standard.

-BIC: criterion for computing Bayesian information

-AIC: standard for calculating red pool information

At the bottom of the results, there is a list as shown in the figure, which is the model with the highest score.

After calculating the best model, we began to make achievements. For the construction of ML tree, we recommend you to use the new generation of RAXML-RAXML-NG.

RAxML has always been a classic tool for ML tree construction, which was developed by Alexandros Stamatakis from Heidelberg Institute of Theory in Germany. In recent years, its Jianghu status has also been challenged by other software, especially IQ-Tree. Zhou et al.' s article "Using Intelligent System Development Dataset to Evaluate System Development Program Based on Fast Maximum Likelihood" systematically compares the actual effects and performance of RAxML, IQ-TREE, FastTree and Phyml, and one of the conclusions is that IQTREE is slightly better in accuracy.

Recently, an upgraded version of RAxML-NG has been released!

Compared with the previous generation products, raxml-ng has the following advantages:

Not much to say, direct results:

Parameter description:

-all: Perform integrated analysis (ML tree search+nonparametric guidance)

-msa: used for subsequent sequence files

-Model: directly enter the best model generated in the previous step.

-bs-trees: check the robustness of the tree, conduct bootstrap test, and conduct 1000 bootstrap sampling.

-thread: a given thread

The result after running is shown in the following figure, in which. BestTree is the tree file we want, just import the tree visualization tool (I usually use MEGA and iTOL), and write about how to beautify the evolutionary tree next time.

Workers who do evolutionary analysis may have a feeling that many analyses have to wait for several days, especially the results (everyone who has done it knows the pain), and sometimes suddenly add a sample and have to start all over again. Therefore, a powerful server is an essential tool. For example, the SNP phylogenetic tree mentioned above, I only do similar species, and the genome is very small (9M), with 40,000 SNP sites. If I want to use my software MEGA to call the 8-core CPU of my computer, the self-developed value 1000 may run to graduation.

From a biological background, I copied that poor computer knowledge, and I did a lot of homework when our research group bought the server. Of course, I mainly listened to the advice of the company's technicians, and through my very, very long-term tests, I used the commonly used bioinformatics analysis software many times (mainly engaged in the research of parasite genome, host transcriptome, 16S metagenome, etc.). Finally, I found a server configuration with high cost performance, and the specific configuration is as follows:

I sincerely thank the technical brothers and sisters of Fengwei for answering all kinds of low-level questions. If you need anything, you can contact their technology, which feels quite reliable. Official website: Fengwei Technology.

Put their logo on it to show their gratitude.

This article is my study notes, and I hope it will help you. This paper refers to a large number of online articles, and the source of the article is listed at the end of the full text.

Reference:

Read the phylogenetic tree in an article

Selecting the best amino acid substitution model by ProtTest

RAxML-ng, a new generation of RAxML phylogenetic tree construction