Model of genome-wide selection

Before introducing the GS model, we need to first understand the Mixed Linear Model (MLM). The mixed linear model is a variance component model. Since it is a linear model, it means that the relationship between the various quantities is linear. The superposition principle can be applied, that is, several different input quantities act on the response of the system at the same time, which is equal to several inputs. The sum of the responses of the quantities acting alone (Equation 1).

?= The design matrix of , β is the fixed effect parameter vector corresponding to At the same time, the conditions need to be met: E(y)=Xβ, Var(y)=σ 2 I, y obeys the normal distribution.

Since it is a mixed effects model, it contains both fixed effects and random effects. The so-called fixed effect means that all possible grades or levels are known and observable, such as gender, age, breed, etc. The so-called random effect refers to the level that may occur when a sample is randomly selected from the population, and is uncertain, such as individual additive effects, maternal effects, etc. (Formula 2).

y = Xβ Zμ ? (Formula 2)

In the formula, y is the observation value vector; β is the fixed effect vector; The covariance matrix is ??the normal distribution μ ~ N(0, G) of G; ~N(0,R). At the same time, it is assumed that Cov(G, R)=0, that is, there is no correlation between G and R, and the variance and covariance matrix of y becomes Var(y)=ZGZ R. If Zμ does not exist, it is a fixed effects model. If Xβ does not exist, it is a random effects model.

In the traditional linear model, in addition to the linear relationship, the response variable also has the assumptions of normality, independence and homogeneity of variances. The mixed linear model not only retains the normality assumption in the traditional linear model, but also does not require independence and homogeneity of variances, thereby expanding the scope of application and has been widely used in genome selection.

C.R. Henderson theoretically proposed the statistical method of Best Linear Unbiased Prediction (BLUP) a long time ago, but the application was limited due to the lag in computing technology. Until the mid-1970s, the development of computer technology made it possible to apply BLUP in breeding. BLUP combines the advantages of the least squares method. When the covariance matrix is ??known, BLUP is an ideal method to analyze the target traits of animal and plant breeding. The meaning of its name is as follows:

In the mixed linear model, BLUP It is the prediction of the random factor in the random effect, and BLUE (Best Linear Unbiased Estimation) is the estimation of the fixed factor in the fixed effect. Both fixed effects and random genetic effects can be estimated in the same system of equations.

The BLUP method was originally used in animal breeding. The traditional animal model solves the mixed model equations (MME) based on the kinship matrix (also known as the A matrix) constructed from pedigree information, so it is called ABLUP.

The MME proposed by Henderson is as follows:

In the formula, X is the fixed effect matrix, Z is the random effect matrix, and Y is the observation value matrix. Among them, R and G:

Among them, A is the kinship matrix, so the formula can be transformed into:

It can further be transformed into:

By solving the system of equations, calculate The variance components of the residual and additive variance can be used to obtain the fixed factor effect value (BLUE) and the random factor effect value (BLUP).

As a traditional BLUP method, ABLUP is completely based on pedigree information to construct a kinship matrix and then calculate the breeding value. This method was widely used in early animal breeding, and is now basically not used alone.

VanRaden proposed the GBLUP (Genomic Best Linear unbiased prediction) method based on the G matrix in 2008. The G matrix is ??constructed from all SNP markers, and the formula is as follows:

GBLUP constructs genomic relationships by constructing Matrix G replaces the kinship matrix A constructed based on pedigree information to directly estimate individual breeding values.

The GBLUP solution process is the same as the traditional BLUP method, except that the G matrix construction is different. In addition to VanRaden's genome relationship construction G matrix, there are other G matrix construction methods, but the method proposed by VanRaden is the most commonly used. For example, the calculation of G matrix based on weights proposed by Yang et al.:

The calculation of G matrix based on pedigree A matrix proposed by Goddard et al.:

Currently, GBLUP has been widely used in animal and plant breeding, and because Its advantages such as high efficiency and robustness are still popular today. GBLUP assumes that all markers have the same effect on the G matrix, but in the actual genome range only a small number of markers have main effects, and most of the marker effects are small, so GBLUP still has a lot of room for improvement.

In animal breeding, due to various reasons, a large number of individuals with pedigree records and phenotypic information do not have genotypes. The single-step GBLUP (ssGBLUP) method is to solve the problem of breeding groups. The problem of estimating the genomic breeding value of individuals without genotypes and individuals with genotypes.

ssGBLUP combines traditional BLUP and GBLUP, that is, it integrates the kinship relationship matrix A and the genome relationship matrix G based on pedigree information, and establishes a new relationship matrix H to simultaneously estimate genotypes and genotypes. The breeding value of type individuals.

H matrix construction method:

In the formula, w is the weighting factor, that is, the proportion of polygenic genetic effects.

After constructing the H matrix, the MME solution process is the same as that of traditional BLUP:

Since genotyped individuals contain both pedigree records and phenotypic data, ssBLUP tends to have better performance than GBLUP. High accuracy. This method has become one of the most commonly used animal models in current animal breeding. In plant breeding, there is often a lack of comprehensive pedigree information, and the genotypes of individuals in the population can easily be determined, so it has not been widely used.

If you replace the individual kinship matrix of covariates in GBLUP with the relationship matrix composed of SNP markers, build a model, and then predict individuals, this is the idea of ??RRBLUP (Ridge Regression Best Linear Unbiased Prediction) .

Why not just use the least squares method? The least squares method assumes that the marker effect is a fixed effect, performs regression on all SNPs in segments, and then adds the significant SNP effects in each segment to obtain the individual genome breeding value. This method only considers the effects of a few significant SNPs, which can easily lead to multiple linearity and overfitting.

RRBLUP is a modified least squares method that can estimate the effect sizes of all SNPs. This method assumes that the marker effect is a random effect and obeys a normal distribution, uses a linear mixed model to estimate the effect value of each marker, and then adds the effect of each marker to obtain the individual estimated breeding value.

Generally speaking, the number of markers in genotype data is much larger than the number of samples (pgt; gt; n). Because RRBLUP is calculated in units of tags, its running time is longer than that of GBLUP and its accuracy is comparable.

GBLUP is a representative of the direct method. It uses the individual as a random effect, the kinship matrix constructed from the genetic information of the reference population and the predicted population as the variance and covariance matrix, estimates the variance components through an iterative method, and then solves the mixture The model obtains the estimated breeding value of the individual to be predicted. RRBLUP is a representative of the indirect method. It first calculates the effect value of each marker, then accumulates the effect values, and then obtains the breeding value. The figure below compares the similarities and differences between the two methods:

The direct method estimates , and the indirect method estimates the sum of marker effects M . When K=M’M and the marker effect g obeys an independent normal distribution (as shown in the figure above), the breeding value estimated by the two methods is the same, that is, = M.

The genomic selection method based on BLUP theory assumes that all markers have the same genetic variance. In fact, only a few SNPs have an effect in the whole genome and are linked to QTL that affect traits. Most SNPs are Ineffective. When we assume the variance of the marker effect to be some prior distribution, the model becomes a Bayesian approach. Common Bayesian methods were also proposed by Meuwissen (the person who proposed GS), mainly including BayesA, BayesB, BayesC, Bayesian Lasso, etc.

BayesA assumes that each SNP has an effect and follows a normal distribution, and the effect variance follows a scaled inverse chi-square distribution. The BayesA method presupposes two parameters related to genetics, the degree of freedom v and the scale parameter S. It introduces Gibbs sampling into Markov Chain Monte Carlo theory (MCMC) to calculate marker effects.

BayesB assumes that a few SNPs have an effect, and the effect variance obeys the inverse chi-square distribution, and that most SNPs have no effect (in line with the actual situation of the whole genome). The prior distribution of the marker effect variance of the BayesB method uses a mixed distribution, and it is difficult to construct a complete conditional posterior distribution of each marker effect and variance. Therefore, BayesB uses Gibbs and MH (Metropolis-Hastings) sampling to jointly sample the marker effect and variance.

The BayesB method introduces a parameter π in the operation process. Assume that the probability that the marker effect variance is 0 is π, and the probability that it obeys the inverse chi-square distribution is 1-π. When π is 1, all SNPs have an effect, which is equivalent to BayesA. The BayesB method is more accurate when genetic variation is controlled by a few QTL that have a large impact.

The parameter π in BayesB is artificially set and will have a subjective impact on the results. BayesB is optimized by BayesC, BayesCπ, BayesDπ and other methods. The BayesC method takes π as an unknown parameter, assumes that it obeys the uniform distribution of U(0,1), and assumes that the effect variances of effective SNPs are different. The BayesCπ method assumes that the SNP effect variances are the same based on BayesC and uses Gibbs sampling to solve. The BayesDπ method calculates the unknown parameter π and the scale parameter S. It is assumed that the prior distribution and posterior distribution of S both obey the (1, 1) distribution and can be directly sampled from the posterior distribution.

The following figure vividly illustrates the variance distribution of the marking effect of different methods:

Bayesian Lasso (Least absolute shrinkage and selection operator) assumes that the variance of the marking effect obeys the normal distribution of exponential distribution , that is, Laplace distribution. The difference between it and BayesA is that the labeling effect obeys a different distribution. BayesA assumes that the labeling effect obeys a normal distribution. Laplace distribution allows maximum or minimum values ??to occur with greater probability.

It can be seen from the above various Bayesian methods that the focus and difficulty of the Bayesian method lies in how to make reasonable assumptions about the prior distribution of the hyperparameters.

Compared with the BLUP method, the Bayes model often has more parameters to be estimated, which not only improves the prediction accuracy, but also brings a greater amount of calculation. MCMC requires tens of thousands of iterations, and each iteration requires re-evaluation of all marker effect values. This process is continuous and cannot be parallelized, and consumes a large amount of computing time, which limits its application in animal and plant breeding practices with strong time-sensitive requirements. .

In order to improve the computing speed and accuracy, many scholars have optimized the a priori assumptions and parameters in the Bayes method, and proposed fastBayesA, BayesSSVS, fBayesB, emBayesR, EBL, BayesRS, BayesTA, etc. But currently the most commonly used Bayesian methods are still the ones mentioned above.

The prediction accuracy of various models largely depends on whether the model assumptions are suitable for the genetic construction of the predicted phenotype. Generally speaking, the accuracy of the Bayesian method after parameter adjustment is slightly higher than the BLUP method, but the operation speed and robustness are not as good as BLUP. Therefore, we should weigh the pros and cons and make a reasonable choice based on our own needs.

In addition to parameter solving methods based on BLUP and Bayes theory, genome selection also includes semi-parametric (such as RKHS, see next article) and non-parametric methods, such as machine learning (ML).

Machine learning is a branch of artificial intelligence that focuses on predicting unobserved individuals (unlabeled data) by applying highly flexible algorithms to known properties (features) and outcomes of observed individuals (labeled data) ) result. Results can be continuous, categorical or binary. In animal and plant breeding, labeled data corresponds to a training population with genotypes and phenotypes, while unlabeled data corresponds to a testing population, and the characteristics used for prediction are SNP genotypes.

Compared with traditional statistical methods, machine learning methods have many advantages:

Support Vector Machine (SVM) is a typical non-parametric method and is a supervised learning method. It can solve both classification problems and regression analysis. SVM is based on the principle of structural risk minimization and takes into account the complexity of model fitting and training samples. Especially when we do not know enough about our own population data, SVM may be an alternative method for genome prediction.

The basic idea of ??SVM is to solve the separation hyperplane that can correctly divide the training data set and have the largest geometric interval. In Support Vector Regression (SVR), the approximation error is usually used instead of the margin between the optimal separating hyperplane and the support vector like in SVM. Assuming that ε is a linear loss function in the insensitive region, when the measured and predicted values ??are less than ε, the error is equal to zero. The goal of SVR is to simultaneously minimize the empirical risk and the square norm of the weights. That is, the hyperplane is estimated by minimizing the empirical risk.

Figure 1 below compares the difference between regression (Figure A) and classification (Figure B) in SVM. In the formula, ξ and ξ* are slack variables, C is a user-defined constant, W is the weight vector norm, and ? represents the feature space mapping.

When SVM is used for predictive analysis, large high-dimensional data sets bring great complexity to calculations. The application of kernel functions can greatly simplify the inner product, thus solving the curse of dimensionality. Therefore, the selection of kernel function (which needs to consider the distribution characteristics of training samples) is the key to SVM prediction. Currently, the most commonly used kernel functions are: linear kernel function, Gaussian kernel function (RBF), polynomial kernel function, etc. Among them, RBF has wide adaptability and can be applied to any distribution of training samples (with appropriate width parameters). Although it sometimes leads to overfitting problems, it is still the most widely used kernel function.

Ensemble Learning is also one of the most common algorithms in machine learning. It learns through a series of learners and uses certain rules to integrate the learning results to achieve better results than a single learner. In layman's terms, it is a bunch of weak learners combined into a strong learner. In the field of GS, Random Forest (RF) and Gradient Boosting Machine (GBM) are two widely used integrated learning algorithms.

RF is an ensemble method based on decision trees, which is a classifier that contains multiple decision trees. In genome prediction, RF, like SVM, can be used as either a classification model or a regression model. When used for classification, please note that individuals in the population need to be divided in advance according to their phenotypic values. The RF algorithm can be divided into the following steps:

Finally, RF will combine the output of the classification tree or regression tree for prediction. In classification, unobserved classes are predicted by counting the votes (usually using one vote per decision tree) and assigning the class with the highest vote count. In regression, by averaging the ntree output.

There are two important factors that affect the results of the RF model: one is the number of covariates randomly sampled at each node (mtry, that is, the number of SNPs).

When building a regression tree, mtry defaults to p/3 (p is the number of predictions to build the tree). When building a classification tree, mtry is [image upload failed...(image-10f518-1612450396027)]; the second is the number of decision trees . Many studies have shown that more trees are not always better, and paper mulberry planting is also very time-consuming. In the application of GS to plant breeding, the ntree of RF is usually set between 500-1000.

When GBM is based on a decision tree, it is a Gradient Boosting Decision Tree (GBDT). Like RF, it also contains multiple decision trees. But there are many differences between the two. The biggest difference is that RF is based on the bagging algorithm, which means that it votes on multiple results or simply calculates the average to select the final result. GBDT is based on the boosting algorithm, which builds a weak learner at each iteration step to make up for the shortcomings of the original model. GBM handles various learning tasks by setting different loss functions.

Although many studies have tried to apply a variety of classic machine learning algorithms to genome prediction, the improved accuracy is still limited and time-consuming. Among the countless machine learning algorithms, no single method universally improves predictability, and different applications and their optimal methods and parameters vary. Compared with classic machine learning algorithms, deep learning (DL) may be a better choice for genome prediction in the future.

Traditional machine learning algorithms such as SVM are generally shallow models. In addition to the input and output layers, deep learning also contains multiple hidden layers. The depth of the model structure explains the meaning of its name. The essence of DL is to learn more useful features by building a machine learning model with many hidden layers and massive training data, thereby ultimately improving the accuracy of classification or prediction. The modeling process of DL algorithm can be simply divided into the following three steps:

In the field of GS, more DL algorithms have been studied, including multi-layer perceptron (MPL) and convolutional neural network. (Convolutional neural network, CNN) and recurrent neural networks (Recurrent Neural Networks, RNN), etc.

MLP is a feedforward artificial neural network (ANN) model that maps multiple input data sets to a single output data set. MLP includes at least one hidden layer, as shown in Figure 2 below. In addition to an input layer and an output layer, it also includes 4 hidden layers. Each layer is connected to the nodes of the previous layer and given different weights (w ), and finally transformed through the activation function to map the input to the output.

CNN is a type of feed-forward neural network that contains convolutional calculations and has a deep structure. It usually has representation learning capabilities and can perform translation-invariant classification of input information according to its hierarchical structure. The hidden layer of CNN includes three types: Convolutional layer, Pooling layer and Fully-connected layer. Each type has different functions. For example, the main function of the convolutional layer is It is to extract features from the input data. The pooling layer performs feature selection and information filtering on the feature map output after feature extraction from the convolutional layer. The fully connected layer is similar to the hidden layer in ANN and is generally located at the end of the hidden layer of CNN. And only transmit signals to the fully connected layer. The CNN structure is shown in Figure 3 below.

It should be noted that deep learning is not a panacea.

The prerequisite for using DL is to have a sufficiently large and high-quality training data set, and according to GS's research on animals and plants, some DL algorithms have no obvious advantages compared with traditional genome prediction methods. However, there is consistent evidence that DL algorithms can capture nonlinear patterns more effectively. Therefore, DL is able to perform assisted breeding by integrating GS traditional models based on data from different sources. In short, in the face of massive breeding data in the future, the application of DL will become increasingly important.

The above are common prediction models in GS, and different classification methods may be different. Here is a brief introduction to the more important methods not mentioned above, some of which are extensions of the above three categories of methods.

Reproducing Kernel Hilbert Space (RKHS) is a typical semi-parametric method. It uses a Gaussian kernel function to fit the following model:

The RKHS model can be solved using the Gibbs sampler of the Bayesian framework, or a mixed linear model.

GBLUP is still a widely used method in animal and plant breeding, and it assumes that all markers have the same effect. However, in actual situations, any marker unrelated to the target trait used to estimate the genetic relationship matrix will dilute the effect of QTL. Many studies have improved it, with several main ideas:

Following the above ideas, the sBLUP (Settlement of Kinship Under Progressively Exclusive Relationship BLUP, SUPER BLUP) method further refines TABLUP into a few genes controlled traits, such that the genotype relationship matrix is ??constructed using only markers associated with the trait.

If you want to consider the impact of group structure in the kinship matrix, you can group individuals according to the similarity of their genetic relationships, and then use the compressed groups as covariates to replace the original individuals. , and the kinship relationships of individuals within the group are the same. Therefore, when constructing a genome relationship matrix, the genetic effect value of the group can be used to replace the individual value, and the group corresponding to the individual is used for prediction. This is cBLUP (Compressed BLUP).

The above ideas all mentioned integrating verified and newly discovered sites into the model. Where do these sites come from? The most common source is naturally Genome Wide Association Study (GWAS). There is a natural connection between GS and GWAS. Taking the significant association sites of GWAS into GS, the direct benefit is that it can maintain the prediction ability for multiple generations, and the indirect benefit is that it can increase the number of verified mutations.

The figure below compares various methods of GWAS-assisted genome prediction. a represents the molecular marker-assisted selection method (MAS), which only uses a few major effect sites; b represents the classic GS method, which uses all markers and has the same marker effect; c allocates markers according to weight; d will significantly correlate markers Treated as a fixed effect; e treats the significantly associated marker as another random effect (with its own kernel derived); f divides the chromosome into segments, and the G matrix constructed by each segment is assigned to a different random effect.

The results of GWAS-assisted genome prediction will be more complicated. Simply considering the association signal into the model may not necessarily improve the accuracy. The specific performance should be related to the genetic construction of the trait.

GS has two different strategies for estimating genetic effects. One is to focus on estimating breeding values, transmitting additive effects from parents to offspring. Non-additive effects (such as dominant and epistatic effects), on the other hand, are related to a specific genotype and cannot be directly inherited.

When estimating variance components, nonadditive effects are often treated as noise along with random environmental effects. Another strategy focuses on both additive and nonadditive effects and is often used in the exploration of hybrid vigor. Hybrid vigor is generally thought to be the result of dominant and epistatic effects, so if nonadditive effects are significant and you happen to ignore them, genetic estimates will be biased.

The utilization of hybrid vigor is an important research topic in plant breeding, especially in staple food crops such as rice and corn. Taking non-additive genetic effects into the GS model for hybrid prediction is also one of the current hot topics in genome prediction in crop breeding.

Of course, the composition of heterosis effects also changes with traits, and genomic prediction of different traits needs to be combined with the identification of heterogeneous QTL sites. Since general combining ability GCA (reflection of additive effects) and special combining ability SCA (reflection of non-additive effects) may come from different genetic effects, GCA and SCA should be considered respectively when predicting hybrid F 1 . The GCA model can be based on GBLUP, focusing on the construction of genotype kinship matrix. There are two methods for the SCA model: one is to integrate the Panel of heterogeneous SNP sites into the GBLUP model as a fixed effect; the other is to use nonlinear models, such as Bayesian and machine learning methods. It has been reported that machine learning and general statistical models are relatively consistent for traits with medium and low heritability in additive models. But in non-additive models, machine learning methods perform better.

Traditional GS models often only focus on a single phenotypic trait in a single environment, ignoring the interrelationships between multiple traits or multiple environments in actual situations. Some studies can also improve the accuracy of genomic predictions by modeling multiple traits or multiple environments simultaneously. Taking the multi-trait (MT) model as an example, the multivariate model (MV) can be expressed by the following formula:

Multi-trait selection is generally used to achieve a certain degree of agreement between traits. genetic construct, i.e., are genetically related. It is particularly useful for traits with low heritability (associated with traits with high heritability) or traits that are difficult to measure.

The environmental conditions of crops are not as easy to control as those of animals, and most of the traits are quantitative traits and are easily affected by the environment. Multi-environment (ME) experiments have played an important role, and the interaction between genotype and environment (Genotype by E nvironment, G × E) is also the focus of current genome selection.

In addition to GBLUP, multivariable models can also be based on linear regression of the Bayesian framework, or based on nonlinear machine learning methods.

We know that genes can finally be reflected in phenotypic characteristics after transcription, translation and a series of regulations, which can only reflect the potential for phenotypic events to a certain extent. With the development of multi-omics technology, integrating multi-omics data for genome prediction is also an important direction in current GS research.

In plant breeding, in addition to the genome, transcriptomics and metabolomics are the two omics groups currently being studied relatively frequently in GS. The transcriptome correlates and predicts gene expression levels with traits, while the metabolome correlates and predicts the content of small molecules that regulate phenotypes with traits. For some specific traits, the prediction ability may be improved. The best way is to integrate the data of each group into the model simultaneously, but this will greatly increase the complexity of the model.

The accuracy of phenotypic determination directly affects the construction of the model. For some complex traits, it is obviously no longer advisable to record them by visual observation alone, and phenotypic investigation is time-consuming, laborious, and costly. Therefore, high-throughput phenotyping is also an important direction for the development of GS.

The scope of phenotype is very broad. When individual traits cannot be easily measured, we can also use multi-omics data, such as proteome, metabolome and other data instead.

Taking into account cost-effectiveness issues, multi-omics technology is still in the research stage in animal and plant breeding, but represents a future application direction.