Multivariate analysis

Single site, multiple traits

By now, ASReml’s univariate notation should start making sense. After enjoying the speed and flexibility of the program for the univariate case, you are probably thinking of moving to more complex analyses. The next step will be to run two or more traits in one site.

Multiple traits one site
 tree_ID    !P
 mother    100 !I
 subrace     8 !A
 locality   30 !A
 rep         5 !I
 iblock     23 !I
 plot      100 !I
 BD
 PPY
 FL
 MFA
 CC
woodped.csv !csv
woodtraits.csv !csv !skip 1 !dopart $A

!part 1 # Basic Density
BD ~ mu subrace rep !r nrmv(tree_ID)

!part 2 # Cellulose Content
CC ~ mu subrace rep !r nrmv(tree_ID)

!part 3 # Bivariate analysis 
BD CC ~ Trait Trait.subrace Trait.rep !r,
        corgh(Trait !INIT 0.8 516 1.51).nrm(tree_ID) !f mv
residual id(units).us(Trait !INIT 645 1 0.32)

Parts 1 and 2 of the previous example are just the univariate runs that will provide starting values for the multivariate analysis. Part 3 introduces some new notation. The left-hand side (LHS) of the model equation now has two variables: BD and CC. Rather than mu we use the keyword Trait. Trait tells ASReml to hold a vector of results (rather than one overall mean -mu- we will have as many overall means as traits there are on the LHS. We also add an interaction between Trait and all other terms of the equation, so we get results for, say subrace, for each of the traits.

The 1 2 1 specifies (from left to right) that there is ONE independent error structure, which consists of the direct product of TWO matrices and that there is ONE additional covariance structure to define (which corresponds to Trait.tree_ID). The latter has to be defined because it is more complex than just the product of a design matrix by a scalar. The definition of the error structure comprises two matrices: an identity matrix of size number of observations (the 0 that tells ASReml to count the records) and a 2×2 covariance matrix (the size is specified by Trait; you could also use 2 rather than Trait here). This covariance matrix is unstructured (US) and has error variance of 645 for BD, 0.32 for CC (both values coming from the univariate analyses) and 1 for the error covariance between the traits (we are just guessing this number). The 0 between Trait and US is a place holder that we will use when dealing with spatial analysis.

For the interaction term Trait.tree_ID we are defining this structure as the direct product of two matrices (one for each term). The first matrix is of dimension 2×2 (the size is indicated by the levels of Trait) and is specified as a general correlation matrix (corgh) in this case. You can use an unstructured (us) matrix as well, but correlation matrices tend to behave better and make convergence easier. The initial values (specified by using the !INIT statement) are first the correlation between traits (just a guess) and the additive variances for BD (516) and for CC (1.51). The second matrix is the numerator relationship matrix (nrm), specified by the factor tree_ID (which was defined as pedigree when using !P).

The residual command is used to define the error structure, which consists of the direct product of two matrices: an identity matrix of size number of observations (where the reserved word units tells ASReml to count the records) and a 2×2 covariance matrix (specified by Trait as in the case of the other structure). This covariance matrix is unstructured (us) and has error variance of 645 for BD, 0.32 for CC (both values coming from the univariate analyses) and 1 for the error covariance between the traits (we are just guessing this number).

To make things a bit clearer concerning the us and corgh matrices, they are read as a triangular matrix row by row from left to right (us) and as a lower triangular matrix row by row from left to right, followed by one or more variances on the diagonal (corgh). Thus, if you have a covariance matrix A and a correlation matrix B, where v is variance, c is covariance and r is correlation, and the indices refer to the traits:

     v11 c12 c13 c14           v11 r12 r13 r14
     c21 v22 c23 c24           r21 v22 r23 r24
A =  c31 c32 v33 c34      B =  r31 r32 v33 r34
     c41 c42 c43 v44           r41 r42 r43 v44

Then the starting values of the us matrix would be specified in the following order: v11 c21 v22 c31 c32 v33 c41 c42 c43 v44; while the values for the corgh matrix would be in this order: r21 r31 r32 r41 r42 r43 v11 v22 v33 v44.

Another example is the analysis of a CP progeny trial using a parental model:

Bivariate analysis for CP trial
# don’t use !P because is not a tree model
 treeID         
 mom   12 !I  
 dad   12 !I
 fam  105 !I
 block 10
 BD
 PY
# Skip first line of file
# because it contains columns names
woodprops.txt !skip 1 !dopart $A
!part 1
# GCA and SCA model
# mom and(dad) create only one factor
BD ~ mu block !r mom and(dad) fam

!part 2
PY ~ mu block !r mom and(dad) fam

!part 3
# Bivariate GCA and SCA model
BD PY  ~ Trait Trait.block !r,
         corgh(Trait !INIT 0.9 0.0003 0.003 !GP).mom,
         and(dad),
         corgh(Trait !INIT 0.2 0.0002 0.003 !GP).fam
residual id(units).us(Trait !INIT 0.0038 0.0005 0.0027 !GP)

There are several elements in common with the previous example, so I will mention the few new issues. In the case of the Controlled Pollinated trial we know both parents, so we can get an estimate of dominance using the fam term, which refers to the family code. Because we are using a parental model rather than an individual model, we use mom and(dad) to overlay the design matrices and create a unique prediction for each parent, rather than one as male and one as male. This overlay creates a unique factor, something that you can see in part 3, where the structure for Trait.mom refers to both mom and(dad). The use of !GP to enforce positive definite covariance matrices because, in this specific dataset, some of the parameters tended to be outside the parametric space (e.g. correlation over 1).

Two sites single trait treated as multivariate

In the univariate analysis page you can find the code for multiple site analysis considering either unique variances for additive variance and error variance or single additive variance and multiple error variances (one for each site). Another option is to consider the expressions of a trait in each site as a different trait (this goes back at least to 1952 in a paper by Doug Falconer). For example, diameter in site 1 would be considered as a different trait from diameter at site 2; thus, each trait is going to have its own additive and residual variances. The following example shows code for two sites, but it is easy to extend to any number of traits.

Multiple sites as multivariate analysis
 tree      !P       #V1
 site       2       #V2
 subrace   20       #V3
 rep        5       #V4
 iblock    20       #V5
 dbh_04             #V6
# These two new variables are created only
# to show an alternative way of setting up
# the problem
 dbh_site1 !=V2 !=1 !*V6 !M0    
 dbh_site2 !=V2 !=2 !*V6 !M0
multivarped.csv !csv
multivardat.csv !csv !dopart $A

# Univariate analysis of site 1 to get
# starting values
# We filter data to keep only site 1
!part 1          
!filter site !select 1
dbh_04 ~ mu subrace rep !r rep.iblock nrmv(tree)

# We now filter data to keep only site 2
!part 2     
!filter site !select 2
dbh_04 ~ mu subrace rep !r rep.iblock nrmv(tree)

# Multivariate analysis considering both sites
# No need to filter now
!part 3 dbh_04 ~ mu site site.subrace site.rep !r,
         diag(site !INIT 62 50).rep.iblock,
         corgh(site !INIT 0.85 752 600).nrm(tree)
residual at(site).idv(units) 

# Incomplete block effects are independent
# between sites, so a diagonal (diag) structure is 
# ideal. Alternatively, we can replace diag(site)
# by at(site). Both factors rep and iblock are not
# specified but they are assumed to be id().

# The additive performance on both sites is
# correlated, and we use corgh to get a direct
# estimate of that correlation.

# The residual part specifies the errors as 
# independent between sites, so 
# the covariance will be 0.

# Equivalent multivariate analysis using
# the new traits created
!part 4  
dbh_site1 dbh_site2 ~ Trait Trait.subrace Trait.rep !r,                      
                     diag(Trait !INIT 62 50).rep.iblock,                     
                     corgh(Trait !INIT 0.85 752 600).nrm(tree)
residual id(units).diag(Trait)

A few explanations now. The previous example uses two alternative ways to code the problem. The first one (!part 3) uses site interacting with all the terms of the model equation, including tree. This creates results for each factor in both sites. The model term site.rep.iblock is the variance of incomplete blocks in each site, and, because they are independent, we use a diagonal structure with the variances for each site as starting values. site.tree represents the tree (or additive genetic) variance in each site. Because the trees are related there is a correlation between the effects in both sites. We can estimate the correlation directly if we use corgh (and it makes convergence easier). We could still use a us structure as, for example, us(site !INIT 752 1 600).

The second approach (!part 4) creates a trait for diameter in each site, so for observations in site 1 dbh_site1 = dbh_04, while dbh_site2 is a missing value. Note the use of Trait rather than site as well as the use of a comma at the end of the first line of the model equation, to indicate that continues in the next line. In the case of the additional structures, diag(Trait !INIT 62 50) represents a 2×2 diagonal matrix with diagonal elements 62 and 50, while rep.iblock represents the incomplete block design matrix. Also, corgh(Trait !INIT 0.85 752 600) specifies a 2×2 correlation matrix, while tree is associated with the numerator relationship matrix.

Updated on November 30, 2021

Was this article helpful?

Yes No