Longitudinal analysis

Page contents

Introduction
Longitudinal analysis in ASReml-R
Random regressions

Introduction

Now it is the time to start with problems that breeders face when dealing with multiple assessments of a trait. Longitudinal data arise when an individual is repeatedly assessed at different times. For example, tree height at ages 2, 5, 8 and 12 years, or data coming from increment cores.

Longitudinal data can be considered as a special case of multivariate analysis, because the ‘same trait’ is measured each time. However, longitudinal analysis differs from typical multivariate because there is an underlying continuum (‘time’) and the sequential nature of assessments (height at age 5 comes first than height at age 8) creates patterns of variation.

Longitudinal data allow us to model changes of heritability and genetic correlation with age. Furthermore, data from several assessments can be pooled to improve the prediction of breeding values at a given age. This type of datasets is frequent in breeding programs (especially in the ones where labour and other assessment costs are low). Despite of this, its analysis has mostly been reduced to univariate analysis by age or, in special cases, to bivariate analysis to obtain age-age correlations.

The steps for prediction of breeding values (or family effects) for longitudinal data are almost identical to those for univariate and multivariate analysis. Nevertheless, we can use the existence of patterns to simplify the analysis. The use of an unstructured (us) covariance matrix is not necessarily the best option. If there are m measurements, the covariance matrix involves m(m+1)/2 components. In other words, increasing the number of assessments by m increases the number of estimates by m(m+1)/2. Therefore, it is easy to see that the amount of information to estimate parameters is somewhat reduced (and also the ‘quality’ of the estimates). It is possible to use other structures to reduce the number of parameters to estimate and still explain the observed patterns of correlation.

The above description is very simple and brief. If you require more detail on the description of the different models, their interrelationships and examples of application in forest genetics see my references 7 and 8. There are PDF versions available for download; yes, this is an example of shameless self-promotion.

Longitudinal analysis in ASReml

There are basically two ways of setting up the datasets for longitudinal analysis, either as multiple traits for each record or as multiple records for a single trait. In any case, you will need to setup observations for all times for all individuals. Thus, if you do not have observations for some times it is necessary to specify the missing values (see below):

 # Data as a single trait
 tree fixed_effects random_effects assessment_No age trait
 1001      1              25              1       5    4.2
 1001      1              25              2       8    7.9
 1001      1              25              3      12   14.3
 .
 .
 .

 Data as multiple trait
 tree fixed_effects random_effects  trait1 trait2 trait3
 1001      1              25         4.2    7.9    14.3
 .
 .
 .

You can have multiple fixed and multiple random effects in any of the setups. In the case of the single trait setup it is necessary to use the !repeat command in the pedigree file line, because the same tree ID is used several times (as many times as assessments). The example below assumes a single trait setup for the dataset and considers three assessments for 1526 trees in a single site with 8 reps (just to make it small). Replicates are considered a fixed effect.

Analysis of longitudinal data under different models  
  tree   !P  
  mum    50 !I  
  dad     1  
  rep     8  
  assess  3 !I  
  age  
  height !m0  
RRlongit.csv !repeat 
RRlongit.csv !csv !maxit 20 !dopart $A !STEP 0.1

!part 1  # Plain vanilla us
height !sigmapar ~ assess assess.rep !r,
          us(assess !INIT 0.13
                          0.20 0.34
                          0.24 0.41 0.51 !GP).nrm(tree) !f mv 
residual id(tree).us(assess !INIT 0.46
                                  0.39 0.97
                                  0.48 0.83 2.36 !GP)

!part 2  # Same as before but using corgh
height !sigmapar ~ assess assess.rep !r,
          corgh(assess !INIT 0.95
                              0.78 0.92
                              0.13 0.34 0.51 !GP).nrm(tree) !f mv 
residual id(tree).us(assess !INIT 0.46
                                  0.39 0.97
                                  0.48 0.83 2.36 !GP)

!part 3  # Exponential tree structure (using exph)
 height !sigmapar ~ assess assess.rep !r,
          exph(assess !INIT 0.96
                            0.13 0.34 0.51
                            !COORD 5 8 12 !GP).nrm(tree) !f mv
residual id(tree).us(assess !INIT 0.46
                                  0.39 0.97
                                  0.48 0.83 2.36 !GP)

!part 4  # Random regression, full fit
height ~ assess assess.rep !r,          
         us(pol(age,2) !INIT 5.20
                             1.80 1.10
                             0.05 0.05 0.22 !GP).nrm(tree) !f mv 
residual id(tree).us(assess !INIT 0.46
                                  0.39 0.97
                                  0.48 0.83 2.36 !GP)

!part 5 # Random regression, reduced fit
height ~ assess assess.rep !r,
          us(pol(age,1) !INIT 5.20
                              1.80 1.10 !GP).nrm(tree) !f mv 
residual id(tree).us(assess !INIT 0.46
                                  0.39 0.97
                                  0.48 0.83 2.36 !GP)

For the us, corgh and exph structures the factor assess is used in the same way as site for the multiple site or Trait for the multivariate case. Trait is not used here because there is only a single trait. Parts 1 and 2 are equivalent to a full multivariate case when traits are different, and are included here to start from the most generic case. However, given that we are assessing the same trait at different times, it is expected to have a large degree of autocorrelation, which can be used to simplify the model. In all the cases for the residual portion we are using the term id(tree) to identify (and count) the individual experimental units (in this case tree) that present multiple observations.

Part 3 presents the use of an exponential structure (exph), where we can specify the ages of assessment, allowing for irregular lags between measurements. In some case we can improve fitting of the model specifying a different time scale, for example a square root or logarithmic transformation. Then, rather than using times 5 8 12, we would use 2.236 2.828 3.464 (square root) or 1.609 2.079 2.485. In the exponential case we often specify a single correlation (0.96) and different variances for each assessment (0.13 0.34 0.51), although it would be possible to specify a single variance when we have homogeneous variances (usually after some transformation to stabilize variances).

Parts 4 and 5 show examples of using full fit and reduced fit random regression models, which is explained in the next section.

Random regressions

The phenotypic trajectory of a trait (dependent on time) can be expressed through a mathematical function tractable in a mixed linear model framework, for example, using polynomial regression, growth models, or cubic splines. Each part of the model (e.g. replicate, plot, additive effects, residuals) can be modelled using a different function. For example, the additive genetic effect of tree i on time j (aij) as a polynomial regression.

When using random regressions for modelling factors, rather than obtaining a covariance matrix of dimensions (numbers of observations)*(number of observations), we get a covariance matrix of the random regression parameters, that is of dimensions (number of coefficients)*(number of coefficients). As an example, for tree height assessed at 10 ages, the covariance matrix for the additive genetic matrix would be 10*10. However, if we fit a polynomial of third degree for the additive genetic values, the covariance matrix for the random regression parameters would be 4*4 (fitting intercept, x, x^2 and x^3).

Once we have the covariance matrix for the random parameters, it is very easy to obtain the full covariance matrix using: G = Z Q Z’, where Q refers to the covariance matrix of the random regression parameters and Z to the design matrices for times.

If we use a polynomial of degree (n-1), where n is the number of assessments, we will get a perfect (full) fit for the matrix under modelling. However, the interesting part is to find a good reduced fit (degree lower than n-1) more parsimonious model. Part 4 correspond to a full fit random regression, where for 3 assessments we are fitting a quadratic polynomial. The pol(x,d) ASReml function fits an orthogonal polynomial of degree d, and here is interacted with tree, so there is a different polynomial for each tree. Given that tree is a random effect, the output will include the variances for each of the coefficients, as well as the covariances between coefficients. Part 5 represents a reduced fit random regressions model, where we are fitting a linear regression, reducing the number of coefficients needed in the model.

The following example shows how to calculate the full covariance matrix of additive genetic effects for 3 assessments, based on the output from a full fit quadratic polynomial (Part 4) and a reduced fit linear polynomial.

 Example of modelling using Random regressions
 # Full (quadratic) fit from
 # Part 4 (6 parameters)
 Design matrix for random regression coefficients
 (in .res file)
     | 1.000  -1.000   1.000 |
 Z = | 1.000  -0.137  -0.503 |
     | 1.000   1.013   0.981 |

 Covariance matrix for the random regression coefficients 
 (in .asr file)
     |  1.843860     0.734105   -0.040207 |
 Q = |  0.734105     0.501793    0.061487 |
     | -0.040207     0.061487    0.037533 |

 Reconstructed additive genetic covariance matrix (G, 
 using any matrix algebra package)
 G = Z Q Z'
     | 0.7115880    1.0615705    1.3042244 |
 G = | 1.0615705    1.7105521    2.3399636 |
     | 1.3042244    2.3399636    3.9255211 |

 # Reduced (linear) fit from 
 # Part 5 (3 parameters)
 Design matrix for random regression coefficients 
 (in .res file)
     |  1.000  -1.000 |
 Z = |  1.000  -0.137 |
     |  1.000   1.013 |

 Covariance matrix for the random regression coefficients 
 (in .asr file)
 Q = | 1.787070    0.757768 |
     | 0.757768    0.446248 |

 Reconstructed additive genetic covariance matrix (G, 
 using any matrix algebra package)
 G = Z Q Z'     
     | 0.7177820    0.9866238    1.3448718 | 
 G = | 0.9866238    1.5878172    2.3889440 |
     | 1.3448718    2.3889440    3.7802338 |

 # Compare to unstructured (US) result from 
 # Part 1 (6 parameters)
     | 0.71157   1.06150   1.30423 |
 G = | 1.06150   1.71025   2.33960 |
     | 1.30423   2.33960   3.92577 |

In some cases (like in this example), it will be possible to have substantial reductions on the number of parameters to estimate, without big differences on the estimated covariance matrices.

Updated on December 1, 2021

Was this article helpful?

Yes No