Joint genomic & pedigree evaluation

Download Report

Transcript Joint genomic & pedigree evaluation

Joint records & genomic &
pedigree evaluation
Andrés Legarra#, Ignacio Aguilar * †
#UR
631, SAGA, Castanet-Tolosan 31326 France
*Animal and Dairy Science Department, University of Georgia, Athens 30602
†Instituto Nacional de Investigación Agropecuaria, Las Brujas 90200, Uruguay
[email protected]
with help from, I. Misztal, D.L. Johnson, T. Lawlor, M. Toro, JM Elsen, O. Christensen, L.
Varona, P. Van Raden, G. de los Campos and others
Zaragoza, 10/11/2009
1
Financing
• Eadgene
• AMASGEN
• Holstein Association of America
2
Plan
1. Short review of models for genomic
selection
2. The two-(three) step procedure
3. The genomic relationship matrix
4. …and its extensions to include pedigree
5. Performance of one-step evaluation
3
Plan
1. Short review of models for genomic
selection
2. The two-(three) step procedure
3. The genomic relationship matrix
4. …and its extensions to include pedigree
5. Performance of one-step evaluation
4
Single marker
• Assume there is a marker in complete LD
with a QTL
• For example, the polymorphism in DGAT1 which
increases fat yield
• Use a linear model to estimate its effect
• yi= marker effect in animal i + noise
5
Base model
• y= Za + e
– Z= incidence matrix
of marker effects
– a= marker effect
– e=residuals
3 individuals, 1 marker with 4 alleles
 a1 
0 1 1 0  
a2 



Ζa   2 0 0 0 
 a3 
 0 1 0 1   
 a4 
• This can be solved, for example, by least
squares
6
Single marker
• This is fine if we know what markers are good
predictor of what genes
• But this is rarely the case
7
Whole genome
• The simpler is to do an extension of single
marker association analysis
• Do multiple marker regression
– Why multiple? To account for all genes
(markers) simultaneously
• Works well only with dense markers!
– Because to trace correctly QTLs we need
some markers in LD with them
8
Multiple marker additive model
• y= Za + e
4 individuals, 2 markers each
– Z= incidence matrix of
marker effects
– a= marker effect
– e=residuals
2 alleles in 1st marker
1
2
Ζa  
0

0
1 0 1 1
0 2 0 0
2 1 0 0
2 1 1 0
 a1,1 
 
0   a1,2 
0   a2,1 
 
1   a2,2 

0   a2,3 
 
 a2,4 
4 alleles in 2nd marker
9
A priori Distributions for marker
effects
• Several distributions have been proposed
–
–
–
–
Normal (Meuwissen et al., Genetics 2001; Van Raden JDS 2008)
Mixture of normal (Van Raden JDS 2008)
BayesA, BayesB (Meuwissen et al. 2001)
Lasso (YI, N. and S. XU, 2008 Genetics 179: 1045; De los Campos, Genetics, 2009; Park
and Casella, Journal of the American Statistical Association)
• No clear proof (from data) that any one is
superior
2
• I will use normal: Var a  I a
10
Useful parameterizations
• value of « 1 » allele = 0
• value of « 2 » allele = ai, where ai is the
effect of the SNP at that locus
– « 11 » = 0
– « 12 » = ai
– « 22 » = 2ai
11
Useful parameterizations
• value of « 1 » allele = -0.5 ai
• value of « 2 » allele = +0.5 ai, where ai is
the effect of the SNP at that locus
– « 11 » = -ai
– « 12 » = 0
– « 22 » = +ai
12
Useful parameterizations
• value of « 1 » allele = -pi ai
• value of « 2 » allele = (1-pi) ai, where ai is the effect of
the SNP at that locus, and pi is the frequence of the allele 2
• Thus results in centered Z matrix (E(Za)=0 for
any a)
– « 11 » = -2piai
– « 12 » = ai-2piai
– « 22 » = 2ai-2piai
• How do we choose p?
13
Useful parameterizations
• Different parameterizations do not give the
same result
• This is different from quantitative genetics
theory
– « Old » Falconerian genes are fixed and
constant terms are absorbed by the mean
– But now SNP are random effects
14
Plan
1. Short review of models for genomic
selection
2. The two-(three) step procedure
3. The genomic relationship matrix
4. …and its extensions to include pedigree
5. Performance of one-step evaluation
15
Why 2-step procedure
• BayesB and A and mixtures and Lasso are
fine, but only some animals are genotyped
– Do they have data?
• This limits practical applications
• Need to get pseudo-data for genotyped
animals
16
Inferring genotypes
•
•
Genotypes in some individuals can be
inferred, only to some extent
Peeling
1. Peeling unilocus
2. Pseudo-peeling multi-locus
3. Gengler’s gene content prediction
•
They work well only a few generations back (forward)
unless we genotype more individuals with low-density
SNPs and then use (2) (Habier et al., Genetics 182: 343)
17
Pseudo-data
• So we need pseudo-data
• EBV’s
• DYD’s
18
Pseudo-data
• EBV’s
• The problem with EBV’s is that they
already share information among
individuals
• e.g., a dam EBV is = own yield + parent
average + progeny contribution
• But then we are including genetic
contribution of parents, and thus the SNP
effects that we want to estimate
19
Pseudo-data
• Also, EBV’s are correlated
u | y ~ N  uˆ , Cuu 
• The correlation depends on the amount of data
and distribution across fixed effects and families
• EBVs of two cows are correlated, for example, if
they belong to the same herd, even if they are
not related
20
Pseudo-data
• DYD’s avoid part of these problems (Van Raden Wiggans 1991)
• DYD = daughter yield deviation
• Record of the daughter, corrected by
environmental effects and dam’s EBV
• Thus DYD = 0.5 BV sire + mendelian sampling
• E(DYD)=0.5 BV sire
• YD’s exist for cows
– YD = record –environmental effects
21
Pseudo-data
Problems of DYD’s / YD’s
• YD’s little reliable and subject to preferential treatment
• YD’s and DYD’s are less, yet still, correlated, and their
variances (=accuracies) are very hard to estimate. This
leads to serious problems (Neuner et al., 2008, 2009)
• Hard to define for some species/traits
• We accept regular BLUP with pedigree that we don’t like
22
2-3-step procedure
1. Get pseudo-data from pedigree-BLUP
• USA (Van Raden et al., JDS 92:16)
2. Run genomic evaluations with DYDs
3. Combine with pedigree-BLUP
•
FR (Guillaume et al. JDS 91:2520; also
•
http://www.inst-elevage.asso.fr/html1/IMG/pdf_CR_0972128-JT_13_oct_2009.pdf
)
2. Run joint « QTLs – additive infinitesimal »
BLUP evaluation with DYDs
–
Need variance component estimates (difficult to compute
with 20-35 QTLs they’re using)
23
Real problem of Pseudo-data
•
•
•
•
Extremely complex procedure
Loss of generality
We analyse a subset of the population.
Thus, ungenotyped dams (or daughters) of a
bull do not benefit from its improved accuracy.
24
Plan
1. Short review of models for genomic
selection
2. The two-(three) step procedure
3. The genomic relationship matrix
4. …and its extensions to include pedigree
5. Performance of one-step evaluation
25
The genomic relationship matrix
• Remember y = Za + noise;
(phenotype = sum of SNP effects).
• But we can say g = Za
(genetic value = sum of SNP effects). This is a Breeding Value
(=2EPD) in the literal sense.
• Then it follows that
– Var(g)=ZZ’2a
• Standardizing
– Var(g)=ZZ’2a/k = G2u
– Where 2u is « the » additive variance
26
The genomic relationship matrix
• G : matrix of pseudo-relationship or
« genomic relationships »
– Also, a « molecular relationship matrix »
• ZZ’ : « looks like » number of shared SNP
alleles among two individuals
27
• Assume all possible combinations in a
locus, a parameterization of 012; the
covariance at that locus is
11 12 22
11 0 0 0
12 0 1 2
22 0 2 4
28
The genomic relationship matrix
• ZZ’ is different depending on how do we
parameterize Z
• Parameterizations are
– -1,0,1
– 0,1,2
– -2p, 1-2p, 2-2p
29
The genomic relationship matrix
• For example, assume two individuals are
– (11,12,11,22,11)
– (22,11,12,22,12)
– zz’ (its covariance) is
• 4 with 0 1 2
• 0 with -1 0 1
• -1.75 with 2p, 1-2p, 2-2p
– all are correct (yet different) since they
represent a valid linear model!
30
The genomic relationship matrix
• How do we get the variance of SNP effects from a
polygenic variance?
• The formula assumes HW, linkage equilibrium of SNPs
(which is false) Gianola et al. (2009)
• This formula is (in HW) equal to trace(ZZ’)/ number of
individuals in data
k 2

pi 1  pi 
all SNPs
• k is not the number of SNPs
31
The genomic relationship matrix
• Elements in G are related to « true » (IBD)
relationships
• Why?
• Two guys share the same allele at the
marker because they have a common
ancestor (perhaps beyond pedigree
founders)
32
The genomic relationship matrix
• E(G)=A (relationship matrix) + a constant
matrix (Habier et al. 2007 Genetics 177:1389)
• If we use the parameterization of -2p, 12p, 2-2p, then the constant is 0
(Van Raden, 2008 91:4414)
– « If » p is the frequency at the founders
– Otherwise the genotyped animals « are »the
base population (Oliehoek et al. 173:483)
33
The genomic relationship matrix
• E(G)=A (relationship matrix)
• A is an average relationship, deviations of
which do exist
• G more informative than A.
– Two fullsibs might have a correlation of 0.6 or
0.4
• You need many markers to get these
« fine relationships »
34
Example
This is the chromosome of a sire
These are sons
In the infinitesimal model, each son
receives exactly half the sire.
35
Example
This is the chromosome of a sire
These are FOUR sons
•In reality, two sons are identical and other two
are very different from the first two but alike
among them.
36
The genomic relationship matrix
• G can be used for evaluation:
1
1
uˆ   I  G   y
• Same results as fitting marker effects a
• Some nice properties
– Pseudo-reliabilities from inverse
– Smaller set of equations
– Can use old programs 
37
Plan
1. Short review of models for genomic
selection
2. The two-(three) step procedure
3. The genomic relationship matrix
4. …and its extensions to include
pedigree
5. Performance of one-step evaluation
38
Proposals for overall relationship matrix
(Legarra et al., 2009 JDS 92:4656; Christensen & Lund, EAAP 2009)
• Not big loss in assuming normality for SNP
effects (Cole et al., Van Raden et al.)
• G easy to be constructed then
• Can we include G in the relationship
matrix?
• If we construct an overall relationship
matrix with good properties, then we can
just do BLUP with all data and animals
39
Proposals for overall relationship matrix
(Legarra et al., 2009 JDS 92:4656; Christensen & Lund, EAAP 2009; also Misztal et al.,
92:4648)
1. Naif
2. Modification for progeny
3. Overall modification
40
Naif proposal
• Let
(Legarra et al., 2009; Gianola & De los Campos, 2008)
 A11
– 1 : ungenotyped animals
A= 
– 2 : genotyped animals
 A 21
A12 

A 22 
41
Naif proposal
• Modification
– 1 : do not touch
– 2 : plug G (=K-1 in G&dlC)
• negative definite
• Incoherent
 A11
A= 
 A 21
A12 

G 
– Sons (parents) of two animals correlated in G might
not be correlated themselves in A11
42
Naif proposal
• Does not work
 A11 A12 
• Assume 2 are bulls and 1 are cows A g = 

A
G
 21

• Then bulls’ EBV can be computed as
uˆ 1   ZZ  A   Zy
1
11
1
11 1
uˆ 2  A21A uˆ
1
(from data)
(selection index)
this Z =
incidence matrix
of animals (not
of SNPs!)
• and G serves to nothing…
• This is because Ag is not a valid covariance matrix, as
assumed by selection index or BLUP
• Ignoring G to compute covariances among 1 and 2 or
individuals in 1 is wrong
43
Modification for progeny
• Assume all parents (« 1 ») are genotyped
and use G
• Use Quaas’ 1988 tracing of BV’s
• Use average transmissions of 0.5
• Each son is half parents + mendelian
samplings
44
Proposals for overall relationship matrix
• Matrix becomes
GP32 T33
 G

Ap= 

T
P
G
T33  P32GP32  D3  T33 
 33 32
45
What are these T’s and P’s
GP32 T33
 G

Ap= 

T
P
G
T33  P32GP32  D3  T33 
 33 32
• P: you are half your parents. One row of the
pedigree file
• T: you have ½ genes of your parents, ¼ of your
grandparents, 1/8 of your grandgrandparents
and so on.
– Can be computed from pedigree file through
recursion
• D: mendelian samplings
46
Proposals for overall relationship matrix
• Matrix becomes
GP32 T33
 G

Ap= 

T
P
G
T33  P32GP32  D3  T33 
 33 32
• Correct, positive definite if all founders
genotyped
• Otherwise incoherent « backwards »
– Animals correlated in G might have uncorrelated
ancestors
• Not very practical because it is complicated and does not
account for ancestors
47
Overall modification
• What would we do « in the old times »?
• Compute breeding values for whatever
animals
• Then use the « classical » selection index
based on pedigree
48
Overall modification
• So then:
Genotyped
Selection
index
p  u 2   N  0, G  and
Variance of the
selection index
(under normality)
1
1
p  u1 u 2   N  Α12 A 22
u 2 , Α11  Α12 A 22
Α 21 
Ungenotyped
• and we can construct
p  u1 , u 2   p  u 2  p  u1 u 2 
49
Overall modification
• This leads to:
 H11 H12 
H


 H 21 H 22 
1
 A12 A 22
GA 221 A 21  A11  A12 A 221A 21

1
GA

22 A 21
A12 A 221G 

G 
• (Semi) positive definite (by construction)
• No obvious incoherences
• Identical to Ap if all founders genotyped
50
Overall modification: example
51
Overall modification: example
This is the regular relationship matrix. Assume now that
animals 9 to 12 have a genomic relationship of 0.7
52
Overall modification: example
This
parents now
are related
G
This guy
now is
inbred
53
Overall modification: inverse
• The surprising thing is that H has a rather
sparse (and simple) inverse
H 1  A 1   0
0

 0 G 1  A 1 

22 
• Which simplifies computations
• Note that
 A11 A12 
1
22
A 1   21
and
A

A
22
22 
A 
A
54
BLUP equations for a young
animal
• Conditional equations from H-1:
 u  ud 
PA   s

 2 
GP 
PP 
  g ij u j
j , j i
g
Parent average
Genomic prediction
These are the same
sources used by USDA –
AIPL but here they are
computed simultaneously
ii
ij
uj
  a22
j , j i
« Genotyped subset » pedigree prediction
ii
a22
ii
wPP;
EBVi  2wPA  g ii wGP  a22
ii
w  1  2  g ii  a22

55
Overall modification
• Christensen & Lund (EAAP 2009) arrived to the
same result from a different perspective
– Predict markers using Gengler’s method
  

qˆ x  1 A xy A 

 q y  1 

1
y

– Predict both expectation and variance (the last
accounts for uncertainty)
– Same, identical, result!!
– Submitted to GSE
56
Plan
1. Short review of models for genomic
selection
2. The two-(three) step procedure
3. The genomic relationship matrix
4. …and its extensions to include pedigree
5. Performance of one-step evaluation
57
Advances
Genomic evaluations using
combined genomic pedigrees
matrices
I. Aguilar, I. Misztal, A. Legarra,
D. L. Johnson, S. Tsuruta, and T. J. Lawlor
Paper submitted to JDS; also,
http://www-interbull.slu.se/bulletins/bulletin40/Pre/ITB_Misztal.pdf
October 21,2009
58
Inverse of joint relationship matrix
 A12 A 221  G  A 22  A 221A 21
H  A
1
G

A
A


22
22 A 21

A12 A 221  G  A 22  

G  A 22

0
0

H A 
1
1 
0
G

A

22 
1
1
Using of single-step with regular MME equations
 X'X
 Z'X

X'Z
  βˆ   X'y 

-1   
Z'Z +  H   uˆ   Z'y 
59
Creation of matrices

Implementation requires
0
0

H A 
1
1 
0
G

A

22 
1

Inverses



1
Relationships based on pedigree between
genotyped animals
Genomic relationships
Currently by direct inversion
60
Expected Relationship Matrix of
Genotyped Animals - A22

A22 created explicitly using Colleau’s algorithm
It does not require A explicitly
Multiplication is done with two scans of the pedigree
v = Ar2 = (I - P)-1 D(I - P)-1'r2

For each genotyped animal in A22
A
A22
0
0
* 1
0
0
=
Ai.
A
61
Tabular method vs. Colleau algorithm

Testing
 6,500 genotyped Holstein
 57,000 pedigrees
CPU Time
Memory
Tabular*
Colleau method
311 s
45 s
12.1GB
322MB
*P.VanRaden (2009)
62
Implementation for large scale genetic
evaluation

US Holstein Final Score – May 2009
 10,466,066 records (1955-2009)
 6,232,548 cows
 9,100,106 pedigrees

Genotypes from AIPL
 Illumina BovineSNP50 chip
 6,508 bulls
 38,416 SNPs

Out of 6,508 bulls
 3,933 bulls have daughter up to 2004 (OLD)
 2,575 bulls no daughter in 2004 but with daughter in 2009
(YOUNG)
63
Computations

Creation of matrices



Inversion of matrices



G 10 min
A22 < 1 min
G and A22 2.5 min
Time per PCG round 2% greater than regular
evaluation
Complete evaluation ~ 2 hours
64
Analyses

Repeatability animal model

Records up to 2004



Ped04
PedGen104
Records up to 2009


Pedigree based relationships
Joint (pedigree & genomic) relationship
Pedigree based relationships
Ped09
Multiple step approach

Records up to 2004

Use predictions from Ped04
PedGenM04
65
Comparisons

Involve predictions from 2009 of YOUNG bulls
 Daughter deviation (deregressed proofs DD)
 Estimated breeding value (EBV)

with predictions from 2004 (Ped04,PedGen104,PedGenM04)
« Reliability »

R2, coefficient of regression (both should be 1)
 DD = μ + δEBV04 + e
« Bias »
 EBV09 = μ + δEBV04 + e
66
Computations
• A key step is the standardisation of G=ZZ’/k so
that we can use the value of the additive
genetic variance σ2u. What is k ?
2
2
k


/

• k is exactly
a
u
• But we did not compute it
67
Computations
• k is approximately k  2 pi 1 pi 
• The formula is valid only in LE, HW
(Gianola et al., Genetics
183:347)
• A reasonable guess – but not always
• An alternative: random « p » from beta distribution
(Gianola et al., id.)
68
Computations
• Also, how to compute p?
– « Current » observed frequencies
– Frequencies at the base population (e.g., by
Gengler’s procedure): usually a bad estimate
– 0.5 (why not?)
69
R2 and regression coefficients for
Young bulls
DD
Prediction method
R2, %
Parent average (Ped04) 24
Multistep (PedM04)
40
Single-step1 (Ped104)
G5
41
GB
38
GC
37
GG – G5
41
GG – GB
38
GG – GC
39
EBV09
δ
0.76
0.86
R2, %
36
50
δ
0.79
0.82
0.76
0.68
0.71
0.79
0.77
0.79
49
45
45
50
46
46
0.70
0.63
0.66
0.73
0.71
0.73
1Assumed
allele frequency of 0.5 (G5), base population (GB), current population (GC),
or calculated as in [30] of Gianola et al (2009) (GG).
70
Mixing genomic and pedigree information


EBV from final scores inflated by 20-40%,
and based on different G
Inflation was reduced (with little loss of
accuracy) using
0
H =A 
0
1

1

1
1 
(G  Α22 ) 
0
This implies that the prior for u2 should
include A as well
71
Multiple priors
• There seems to be two forms of priors for
genotyped animals:

p  u2   N 0, Gg2  A22a2
 
p  u2   N  0,  Gg2
 



  A
1
Regular mean

22
2
a

1


1



Harmonic mean
• One is a (very rough) approximation of the
These two
other
variances
could (must?)
• Work in progress…
be estimated
from data
72
Harmonic mean = Multiple priors
Assume two independent priors for u2
(A. Legarra, L. Varona)
p(u 2 | A22, G)  p(u 2 | A22 )p(u 2 | G)
where
u 2 | A22 ~ N(0, A22u2 )
u 2 | G ~ N(0, Gg2 )
p(u 2 | A22 )p(u 2 | G)~ e

u ' g2G-1u
e
2
1u
u ' u2A22

2
var(u2 )    G   A 
2
g
1
2
u
1
22
e

1  u
u '  g2G-1 u2A22

2
1
73
How to validate?
• Cross validation is fine, but..
• Even validation bulls are selected
• Selection produces bias in the regression
coefficients δ
• Also, cross-validation « kills the discussion »
because we just check whether any BayesX is
good or not but we don’t understand what’s
going on
• At some point genomic predictions will be simply
accepted and no more cross-validation will be
done.
74
Summary
• Genomic relationship matrix useful and informative;
extensions to include pedigree not very complex
• BLUP including pedigree & genomic information feasible
through joint relationship matrix
• (Quite) simple to implement
• Optimal priors (variance-covariance matrix) for
genotyped animals u2 yet to be determined but the
procedure seems to be quite robust
– More theory needed
• Straightforward for other models (multiple trait, maternal,
etc.) and populations and data structure (beef, chicken,
swine, sheep, whatever)
75