Data tips

Page contents

Truncating the dataset
Univariate to multivariate format
Starting values: the apocryphal Dutch method

ASReml has not really been designed to perform data management (better use a database system), but still allows you to perform a few variable conversions, subsetting, etc. Below you will find some (hopefully) useful examples.

Truncating the dataset

The presence of runts (individual with extremely poor growth) is common in progeny trials. Lets say that you want to test the effect of dropping these trees from the analysis, but you are not sure where is the limit of size to drop the trees.

Dropping runts
 tree !P        # Tree IDs,          V1
 rep      5     # replicates         V2
 iblock  20     # incomplete block   V3
 dbh04          # Diameter at age 4  V4
 bd
 newdbh  !=V4 !≥$A !*V4 !M0
crctest1ped.txt
crctest1dat.txt !dopart $B   

!part 1
newdbh ~ mu rep !r rep.iblock tree

!part 2
bd ~ mu rep !r rep.iblock tree

# … etc

To run the previous example, assuming the file is called testrunts.as, you would need to specify to specify two numbers in the command line: the first one referring to $A (minimum diameter to include in the analysis) and the second referring to $B (part of the analysis to be run), e.g.

> asreml testrunts 5 1

would keep all trees with more than 5 cm of diameter and run part 1. So, how does it work? First, variable number 4 (dbh04) is copied to newdbh (!=V4); then, ASReml compares its value with the value taken by $A (!≥$A, 5 cm in this case). If newdbh is greater than or equal to $A, it takes a value of 1 (TRUE), 0 otherwise (FALSE). Now newdbh contains either 1s or 0s. Later newdbh is multiplied by dbh04 (!*V4), taken the value of dbh04 for the 1s and 0 for the 0s. Finally, the 0s are converted to missing values (!M0).

Univariate to multivariate format

Sometimes the dataset is setup in a univariate way, but we are interested in using some multivariate structures that are easier to fit when the data appear as several traits. For example, we are interested in dbh in several sites, so for each record we have only one dbh in one site. It is possible to expand the dataset to have as many variables for dbh as there are sites.

Creating multiple variables from one variable
 tree !P        # Tree IDs,                         V1
 site     3     # Three sites with codes 1, 2 and 3 V2
 rep      5     # replicates within sites           V3
 iblock  20     # incomplete blocks                 V4
 dbh            # Diameter                          V5
 bd
 dbhS1   !=V2 !==1 !*V5 !M0   # dbh at site 1
 dbhS2   !=V2 !==2 !*V5 !M0   # dbh at site 2
 dbhS3   !=V2 !==3 !*V5 !M0   # dbh at site 3

crctest2ped.txt
crctest2dat.txt !dopart $A

!part 1
…
…
etc

In this case, dbhS1, dbhS2 and dbhS3 correspond to dbhs measures at sites 1, 2 and 3 respectively. How does it work? First the site codes are copied to, say, dbhS1 (!=V2). Then the value is compared to the site code (e.g. 1, !==1). If the result of the comparison is TRUE, the variable takes a value of 1, 0 otherwise. Now dbhS1, which contains 1s or 0s, is multiplied by the dbhs (!*V5) and the 0s are treated as missing values (!M0). Same method is applied for other sites.

Starting values: the apocryphal Dutch method

How to get starting values is the second question (after how to define covariance structures) when starting to run multivariate analyses. It is usually framed as ‘Where do I get the bloody values from?’ In lots of cases we do not have an idea of the covariance, so a value like 0.1 is used, and in most bivariate cases will work well, thank you very much. In other cases, we have some knowledge of the correlation (for example, between growth and wood density, ~−0.1 to −0.3 is a good assumption. We can use that cov = r * sqrt(v1 * v2), where the variances v1 and v2 come from the univariate analyses and r from our assumption. Even better, we can use a coru or corg so it is possible to use the assumed correlation values directly.

The previous explanation is all good if we are putting together a bivariate analysis, but: what happens when we want to run more than 2 traits? Let’s say that we have 4 traits, we would need to run 6 bivariate analyses to get decent starting values to run the full multivariate (4 traits analysis). If you are doing this often, it becomes a real pain (wherever you prefer). Yes, there is hope and here is where the apocryphal Dutch method comes in.

Some time ago Greg Dutkowski and myself were involved with the Southern Tree Breeding Association people calculating genetic parameters to run Radiata Pine and Blue Gum national genetic evaluations. It was quite tedious, but Greg showed me this neat way of getting starting values which worked at least 50% of the time. He called it the ‘slash method’, but it was the end of a long day so I kept talking about the Dutch method.

The apocryphal Dutch method makes heavy use of two ASReml features: the ability to pick up results from previous runs (using the -c flag at the command line) and the posibility to fit level specific variances for a factor (or a trait) using the slash (/) qualifier. The system works as follows:

1. First, one runs univariate analyses for each of the traits (best using !dopart for each trait), so we can determine the appropriate model (in terms of significant fixed and random effects) for each trait. These analyses would be processed using something like: > asreml CPtrial 5 1 (assuming that the command file is called CPtrial.as, the minimum dbh is 5 cm and we are running part 1).

Analysis of E. globulus CP trial
 tree !P
 family  57 !I
 rep      5 !I
 inblk   20 !I
 plot   264 !I
 dbh
 height
 pilo
 # dropping runts (trees with dbh less than A)
 # from the analysis
 newdbh !=V6 !≥$A !*V6 !M0 
 newhei !=V6 !≥$A !*V7 !M0 
 newpil !=V6 !≥$A !*V8 !M0
CPtrial.ped !alpha !groups 10
CPtrial.dat !dopart $B

# Only the significant terms have been left in
# the models subraces are fitted as genetics 
# groups (see !groups 10)
!part 1
newdbh ~ mu rep !r rep.inblk plot nrmv(tree) family

!part 2
newhei ~ mu rep !r rep.inblk nrmv(tree) family

!part 3
newpil ~ mu rep !r rep.inblk nrmv(tree) family

2. Second, we set a multivariate model including all traits BUT we define a covariance structure only for the residuals and, maybe, for the additive genetic factor (either tree or mother). All the other factors (e.g. replicate, plot, etc) are set using slash rather than dot, so ASReml is in effect fitting only variances (without any covariance) for those factors, without the need for defining covariance structures. Check also the at(Trait,1).plot that is fitting the plot factor only for trait 1 (newdbh). This is run as: > asreml CPtrial 5 4.

# Multivariate analysis
!part 4
newdbh newhei newpilo ~ Trait Trait.rep !r Trait/nrm(tree),
                        Trait/rep.inblk Trait/family,
                        at(Trait,1).plot
residual id(units).us(Trait !INIT 236
                                  0.1 0.87
                                  0.1 0.1  3.16 !GP)

3. Once we run the multivariate run (and if we are lucky) the new model converges and we have a model with a full structure for the residuals and, maybe, a full structure for the additive genetic component. Now we can extend the model to another factor where we define the structure and use -c to continue running the model from the previous iterations (without the need to put the results from the previous step). Repeat this for each additional factor and cross your fingers. At the end of this process you should have the full model running. This is run as: > asreml -c CPtrial 5 4.

# Multivariate analysis
# We are adding an additional structure for Trait.tree
newdbh newhie newpilo ~ Trait Trait.rep! !r
                      corgh(Trait !INIT 0.9
                                        0.4 0.3
                                        34.7 0.31 2.82).nrm(tree),
                      Trait/rep.inblk Trait/family,
                      at(Trait,1).plot
residual id(units).us(Trait !INIT 236
                                  0.1 0.87
                                  0.1 0.1  3.16 !GP)
# When using corgh we first list the 
# correlations (first 3 numbers) followed by
# the variances (next three).

and then, again, as: >asreml -c CPtrial 5 4 for the following code:

# Multivariate analysis
# We added an additional structure for Trait.rep.inblk
!part 4
newdbh newhei newpilo ~ Trait Trait.rep !r
                        corgh(Trait !INIT 0.9
                                          0.4 0.3
                                          34.7 0.31 2.82).nrm(tree),
                        corgh(Trait !INIT 0.4
                                          0.2 0.1
                                          58.14 0.45 0.60).rep.inblk,
                        Trait/family,
                        at(Trait,1).plot

and keep going for each additional structure (Trait.family in this case).

4. A variant of this process (which is a good idea when there are some convergence problems) is to get the covariances and variances from the .asr file at each step and stick them in !part for the multivariate model in the .as file. This will allow you to tweak the model by parts and not lose any estimates when adding further terms in the model.5. There is neither explicit or implicit guarantee for the Dutch method, but if you are lucky it can save you a lot of time. If it doesn’t work you will need to run the bivariate analyses (sorry).

Updated on December 1, 2021

Was this article helpful?

Yes No