Talk - University of Arizona

Download Report

Transcript Talk - University of Arizona

Bayesian Estimators of
Time to Most Recent
Common Ancestry
Bruce Walsh
[email protected]
Ecology and Evolutionary Biology
BIO 5
Adjunct Appointments
Molecular and Cellular Biology
Plant Sciences
Epidemiology & Biostatistics
Animal Sciences
Definitions
MRCA - Most Recent Common Ancestor
TMRCA - Time to Most Recent Common Ancestor
IBD - identical by descent
Question: Given molecular marker information from
a pair of individuals, what is the estimated time back
to their most recent common ancestor?
Application in Genealogy -- how closely are (say)
Bill Walsh (former football coach) and I related?
Basic concept: The more closely related, the more
genetic markers in common we should share
With even a small number of highly polymorphic
autosomal markers, trivial to assess zero (subject/
biological sample) and one (parent-offspring) MRCA
For forensic applications: 13 highly polymorphic
autosomal marker loci sufficient for individualization
Thus, if issue of paternity ( Bill’s my dad) or a sibling,
Easy.
Problems with Autosomal Markers
Often we are very interested in MRCAs that are modest
(5-10 generations) or large (100’s to 10,000’s of
generations)
Unlinked autosomal markers simply don’t work over these
time scales.
Reason: IBD probabilities for individuals sharing a MRCA
5 or more generations ago are extremely small and hence
very hard to estimate (need VERY large number of
markers).
mtDNA and Y Chromosomes
So how can we accurately estimate TMRCA for modest
to large number of generations?
Answer: Use a set of completely linked markers
With autosomes, unlinked markers assort each generation
leaving only a small amount of IBD information on each
marker, which we must then multiply together. IBD
information decays on the order of 1/2 each generation.
With completely linked marker loci, information
on IBD does not assort away via recombination.
IBD information decay is on the order of the
mutation rate.
Y chromosome microsatellite
mutation rates- I
Estimates of human mutation rate in microsatellites
are fairly consistent over both the Y and the autosomes
Estimate of u
Source
Reference
0.0028
Y chromosome
Kayser et al. 2000
0.0021
Y chromosome
Heyer et al. 1997
0.001 - 0.0021
Autosomal
chromosomes
Wong & Weber 1993
Brinkmann 1998
Y chromosome
• Since Y = male, sons get their Y
chromosome only from their father
• Hence, Y chromosomes follow shared
paternal lineages
• No recombination on the Y, hence all
markers passed along as a group
• Thus, only way for father and son to differ
in their Y genotype is via mutation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Basic Structure of the Problem
What is the probability that the two marker
alleles at a haploid locus from two related
individuals agree given that their MRCA was
t generation ago?
Phrased another way, what is their probability
of identity in state (IBS), given a TMRCA of t
generations
Infinite Alleles Model
The first step in estimating TMRCA requires us to
assume a particular mutational model
Our (initial) assumption will be the infinite alleles
model (IAM)
The key assumption of this model (originally due
to Kimura and Crow, 1964) is that each new
mutation gives rise to a new allele.
Key: Under the infinite alleles, two alleles that are
identical in state that are also ibd have not
experienced any mutations since their MRCA.
Let q(t) = Probability two alleles with a MRCA
t generations ago are identical in state
If u = per generation mutation rate, then
q(t) = (1-u)2t
Pr(No mutation from MRCA->A) = (1-u)t
A
B
(1-u)t
(1-u) t
Pr(No mutation from MRCA->B) = (1-u)t
MRCA
Building the Likelihood Function for n Loci
For any single marker locus, the probability of IBS
given a TMRCA of t generations is
q(t) = (1-u)2t ≈ e-2ut = e-t, t = 2ut
The probability that k of n marker loci are IBS is just
n!
Pr(k) =
q(t ) k [1 ° q(t )]n °
(n ° k)! k!
k
a Binomial distribution with success parameter q(t)
Likelihood function for t given k of n matches
n!
L(t j n; k) =
e°
(n ° k)! k!
kø
°
(1 °
e
¢
° ø n° k
)
ML Analysis of TMRCA
It would seem that we now have all the pieces in
hand for a likelihood analysis of TMRCA given
the marker data (k of n matches)
Likelihood function (t = 2ut)
n!
L(t j n; k) =
e°
(n ° k)! k!
kø
°
(1 °
e
¢
° ø n° k
)
MLE for t is solution of ∂ L/∂t = 0
µ Ž
•n ¥
1
^
^
ø = 2tš = ln
= ln
= ° ln(p)
k
p
p = fraction of matches
In particular, the MLE for t becomes
•n ¥
1
1
b
t=
ln
= °
ln(p)
2š
k
2š
(
Likewise, the precision of this estimator follows
for the (negative) inverse of the 2nd derivative
of the log-likelihood function evaluated at the
MLE,
Var( t^ ) =° -
µ
@2
Ø Ž °- 1
µ
Ž
ln L(t j n; k) Ø
1 1 1° p
Ø
=
Ø
2
2 n
@t
4u
p
^
t= t
What about one-LOD support intervals (95% CI) ?
With n=k, the value of the likelihood function is
L(t) = (1-u)2tn ≈ e-2tun
L has a maximum value of one under the MLE
Hence, any value of t that gives a likelihood value of
0.1 or larger is in the one-LOD support interval
Solving, the one-LOD support interval is from t=0 to
t = (1/2n) [ -Ln(10)/Ln(1-u) ] ≈ (1/n) [ Ln(10)/(2u) ]
For u = 0.002, CI is (0, 575/n)
What about Hypothesis testing?
Again recall that for k =n that the likelihood at t = t0 is
L(t0) ≈ Exp(-2t0un)
Hence, the LR test statistic for Ho: t = t0 is just
LR = -2 ln [ L(t0)/ L(0) ]
= -2 ln [ Exp(-2t0un) / 1 ]
= 4t0un
Thus the probability for the test
that TMRCA = t0 is just Pr( c12 > 4t0un)
Trouble in Paradise
The ML machinery has seem to have done its job,
giving us an estimate, its approximate sampling
error, approximate confidence intervals, and a scheme
for hypothesis testing.
Hence, all seems well.
Problem: Look at k=n (= complete match at all markers).
MLE (TMRCA) = 0 (independent of n)
Var(MLE) = 0 (ouch!)
One-LOD support interval for t is (0, 575/n)
The problem(s) with ML
The expressions developed for the sampling variance,
approximate confidence intervals, and hypothesis
testing are all large-sample approximations
Problem 1: Here our sample size is the number of
markers scored in the two individuals. Not likely to
be large (typically 10-60).
Problem 2: These expressions are obtained by taking
appropriate limits of the likelihood function. If the
ML is exactly at the boundary of the admissible space
on the likelihood surface, this limit may not formally
exist, and hence the above approximations are incorrect.
The solution?
“Ain’t Too Proud to Bayes”
-- Brad Carlin
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Why go Bayesian?
Instead of simply estimating a point estimate (e.g., the
MLE), the goal is the estimate the entire distribution
for the unknown parameter q given the data x
p(q | x) = C * l(x | q) p(q)
Why Bayesian?
• Exact for any sample size
• Marginal posteriors
• Efficient use of any prior information
The Prior on TMRCA
The first step in any Bayesian analysis is choice of
an appropriate prior distribution p(t) -- our thoughts on
the distribution of TMRCA in the absence of any of
the marker data
Standard approach: Use a flat or uninformative prior,
with p(t) = a constant over the admissible range of the
parameter. Can cause problems if the likelihood function
is unbounded (integrates to infinity)
In our case, population-genetic theory provides the
prior: under very general settings, the time to MRCA
for a pair of individuals follows a geometric distribution
In particular, for a haploid gene, TMRCA follows
a geometric distribution with mean 1/Ne.
Hence, our prior is just
p(t) = l(1-l)t ≈ l e-lt, where l = 1/Ne
Hence, we can use an exponential prior with
hyperparameter (the parameter fully
The posterior
thus becomes
characterizing
the distribution)
l = 1/Ne. = 1/Ne
Prior hyperparameter
-k
p(t j k) / L(t j n; k) p(t) = exp[°- (2š k + •)t ] (1 °- exp[°- (2š t ) ]) n °
Previous likelihood
Prior function (ignoring constants
that cancel when we compute the normalizing factor
C)
The Normalizing constant
exp[ ° (2š k + •)t ] (1 ° exp[ ° (2š t) ])n °
p(t j k) =
I (š ; k; n; •)
k
where
Z
I (š ; k; n; •) =
1
exp[° (2š k + •)t ](1 ° exp[° (2š t ) ]) n °
k
dt
0
I ensures that the posterior distribution integrates
to one, and hence is formally a probability distribution
What is the effect of the
hyperparameter?
exp[ ° (2š k + •)t ] (1 ° exp[ ° (2š t) ])n °
p(t j k) =
I (š ; k; n; •)
k
If 2uk >> l, then essentially no dependence on the
actual value of l chosen.
Hence, if 2Neuk >> 1, essentially no dependence on
(hyperparameter) assumptions of the prior.
For a typical microsatellite rate of u = 0.002, this is just
Nek >> 250, which is a very weak assumption. For example,
with k =10 matches, Ne >> 25. Even with only 1 match (k=1),
just require Ne >> 250.
Closed-form Solutions for the
Posterior Distribution
Complete analytic solutions to the prior can be obtained
by using a series expansion (of the (1-ex)n term) to give
exp[ ° (2š k + •)t ](1 ° exp[ ° (2š t) ])n °
= exp[ °- (2š k + •)t ]
¦
nX
° k
i= 0
n °- k
X
=
i= 0
(°- 1) i
k
!
(n °- k)!
exp[ °- (2š ti )]
i!(n ° k ° i)!
(n ° k)!
(° 1)
exp[ °- (2š (k + i) + •)t ]
i!(n ° k °- i)!
i
Each term is just a * ebt, which is easily integrated
nX°- k
I (š ; k; n; •) =
i= 0
(n ° k)!
(° 1)
i!(n ° k ° i)!
i
n °- k
X
=
Z
(° 1) i
i= 0
n °- k
1
exp[°- (2š (k + i) + •)t ]dt
0
(n ° k)!
1
i!(n ° k ° i)! 2š (k + i) + •
2
(n ° k)! š n °- k
= Q n °- k
i = 0 [ •+ 2š (n ° i) ]
With the assumption of a flat prior, l = 0, this reduces to
(n ° k)!(k ° 1)!
I (š ; k; n; 0) =
( 2š ) n!
Hence, the complete analytic solution of the posterior is
¦ Q n °- k
p(t j k; •) =
!
n°
[
•+
2š
(n
°
i)
]
(1
°
exp[°
2š
t
])
i= 0
2n ° k (n °- k)!š n ° k
exp[ t(2š k + •) ]
Suppose k = n (no mismatches)
In this case, the prior is simply an exponential
distribution with mean 2un + l,
p(t j k = n) = (•+ 2nš ) exp[ ° (2š n + •)t ]
k
Analysis of n = k case (complete
match at n markers)
Mean TMRCA and its variance:
1
1
š t = æt =
'
•+ 2nš
2nš
Cumulative probability:
Z
P r(t • T ) =
T
p(t j k = n) dt = 1 °- exp (°- (2š n + •)T )
0
In particular, the time Ta satisfying P(t < Ta) = a is
° ln(1 °- Æ)
TÆ =
2š n + •
For a flat prior (l = 0), the 95% (one-side) confidence
interval is thus given by -ln(0.5)/(2nu) ≈ 1.50/(nu)
Hence, under a Bayesian analysis for u = 0.002, the
95% upper confidence interval is given by ≈ 749/n
Recall that the one-LOD support interval (approximate
95% CI) under an ML analysis is ≈ 575/n
The ML solution’s asymptotic approximation significantly
underestimates the true interval relative to the
exact analysis under a full Bayesian model
Sample Posteriors for u = 0.002
Posteriors
for n for
=for
==20
93Posteriors
Posteriors
n10=nn100
50
n = 92
100
0.08
0.30
0.060
0.25
0.050
0.06
2094
91
100
p( t | k0.040
)0.20
)
p(p(
t |tk| k
)0.04
0.15
0.030
90
99
19
98
97
18
0.10
0.020
96
17
0.02
89
16
95
15
0.05
0.010
0.00
0.000
0.00
0
25
01
10
3 10 4
20
515
30
6 20 7
825
40
50
60
9 3010 11
35 12 4013
Time
t to
MRCA
Time t to MRCA
Time
t to
MRCA
70
80
90
14
45 15 5016 17
55 18 6019
100
2065
Key points
• By using markers on a non recombining chromosomal
section, we can estimate TMRCA over much, much
longer time scales than with unlinked autosomal markers
• By using the appropriate number of markers we
can get accurate estimates for TMRCA for even
just a few generations. 20-50 markers will do.
• Hence, we have a fairly large window of resolution
for TMRCA when using a modest number of completely
linked markers.
Extensions. I: Different Mutation
Rates
Let marker locus k have mutation rate uk.
Code the observations as xk = 1 if a match,
otherwise code xk = 0
The posterior becomes:
"
p(t j x) / exp ° t
¦
! #
n
X
•+ 2
k= 1
š k xk
Yn £
k= 1
1° e
° 2t š i
§1 °
xk
Stepwise Mutation Model (SMM)
Microsatellite
are scored
by their number
The Infinite allelic
alleles variants
model (IAM)
is not especially
ofrealistic
repeat units.
Hence, two “matching”
alleles
can actually
with microsatellite
data, unless
the fraction
hide
multiple mutations
(and hence more time to the MRCA)
of matches
is very high.
Mutation 1
Mutation 2
Under IAM,
In reality,
score asthere
a match,
are two
and hence
mutations
no mutations
Y chromosome microsatellite
mutation rates- II
The SMM model is an attempt to correct for
multiple hits by fully accounting for the mutational
structure.
Good fit to array sizes in natural populations when
assuming the symmetric single-step model
• Equal probability of a one-step move up or down
In direct studies of (Y chromosome) microsatellites
35 of 37 dectected mutations in pedigrees were
single step, other 2 were two-step
Formally, the SMM model assumes the following
transition probabilities
š
P r( X (t + 1) = i ° 1 j X (t ) = i ) = P r( X (t + 1) = i + 1j X (t ) = i ) =
2
P r( X (t + 1) = i j X (t) = i ) = 1 °- š
P r( j X (t + 1) ° X (t) j • 2 j X (t ) = i ) = 0
Note that two alleles can match only if they have
experienced an even number of mutations in total between
them. In such cases, the match probabilities become
P r(mat ch j 2M moves ) =
1
22 M
µ
2M
M
Ž
=
1
22 M
(2M )!
(M !) 2
SMM0 model -- match/no
match under SMM
The simplest implementation of the SMM model is
to simply replace the match probabilities q(t) under
the IAM model with those for the SMM model.
This simply codes the marker loci as match / no match
We refer to this as the SMMO model
P r(mat ch j 2M moves ) =
X1
q(t ) =
M =0
1
X
=
M =0
1
22 M
µ
2M
M
Ž
=
1
22 M
(2M )!
(M !) 2
of
P r(match jNumber
2M moves
) PProb(Match)
r( 2M moves j t )
mutations
µ
1
22 M
= exp(°- 2tš )
Žµ
2 (2š t )2 M
4 (2M )!
62 M !
(š t)
(M8!) 2
10
(2M )!
(M !) 2
¦
X1
M=0
Ž
0.500
exp(°- 2tš )
0.375
0.313
0.273
0.246
X1 (x) 2k
Hence, = I ( 2x ) The zero-order modifed
0
2
Type I Bessel Function
(k!)
k
=0
q(ø) = exp(° ø) I 0 ( ø )
Under the SMM model, the prior hyperparameter
can now become important.
This is the case when the number n of markers is
small and/or k/n is not very close to one
Why? Under the prior, TMRCA is forced by a
geometric with 1/Ne. Under the IAM model for
most values this is still much more time than
the likelihood function predicts from the marker data
Under the SMM model, the likelihood alone predicts
a much longer value so that the forcing effect of the
initial geometric can come into play
n =5, k = 3, u = 0.002
IAM, both flat prior and Ne = 5000
Pr(TMRCA < t)
SSMO, Ne = 5000
SSMO, flat prior
Prior with Ne =5000
Time, t
An Exact Treatment: SMME
With a little work we can show that the probability
two sites differ by j steps is just
q( j ) (ø) = 2 exp(° ø) I j ( ø )
for j • 1
The resulting likelihood thus becomes
The jth-order modifed
Type I Bessel Function
k
i nj
Y h
n!
L( t j n 0 ; ¢¢¢; n k ) =
q( j ) (2š t )
n 0 ! n 1 ! ¢¢¢n k ! j
=0
Where nj is the number of sites that differ
by k (observed) steps
With this likelihood, the resulting prior becomes
Yk h
p( t j n 0 ; ¢¢¢; n k ) /
(j )
q
(2š t)
i nj
- •t
e°
j=0
This rearranges to give the general posterior
under the Exact SMM model (SMME) as
p( t j n 0 ; ¢¢¢; n k ) = R1
0
e
° ( •+ 2 š n ) t
Qk
[I j ( 2š t ) ]n j
j= 0
Qk
°
•
š
n
t
(
+
2
)
e
j= 0
ofmatches
k steps differences
NumberNumber
of exact
[ Ij ( 2š t ) ]n j dt
Example
Consider comparing haplotypes 1 and 3 from Thomas
et al.’s (2000) study on the Lemba and Cohen Y
chromosome modal haplotypes. Here six markers used,
four exactly match, one differs by one repeat, the other
by two repeats
Hence, n = 6, k = 4 for IAM and SMM0 models
n0 = 4, n1 = 1, n2 = 1, n = 6 under SMME model
Assume Hammer’s value of Ne=5000 for the prior
TMRCA for Lemba and Cohen Y
P(t | markers)
IAM
Model used
Mean
Medium
2.5%
97.5%
IAM
152.3
135.4
31.1
371
SMM0
454.7
233.7
40.4
2389
SMME
422.3
286.2
65.1
1631
SMM0
SMME
Time to MRCA, t
IAM
Pr(TMRCA < t)
SMM0
SMME
Time to MRCA, t
Summary
• Sets of non-recombining markers are
optimal for estimation of TMRCA
– Y (paternal lineage), mtDNA (maternal)
• Simple bayesian estimators are easily
developed and can be extended to allow for
more realistic mutation models.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
MRCA-I vs. MRCA-G
We need to distinguish between the MRCA for a pair
of individuals (MRCA-I) and the MRCA for a particular
genetic marker G (MRCA-G).
MRCA-G varies between any two individuals over
recombination units.
For example, we could easily have for a pair of relatives
MRCA (mtDNA ) = 180 generations
MRCA (Y ) = 350 generations
MRCA (one a-globulin allele ) = 90 generations
MRCA (other a -globulin allele ) = 400 generations
MRCA-G > MRCA-I
MRCA-G(
)
MRCA-G( )
MRCA-I
lost
Why the difference?
Pr(TMRCA < t)
n = 20, t0.95 = 38
n = 10, t0.95 = 75
Under a Bayesian analysis, we look
ML, probability
we plot the
at theUnder
posterior
likelihood
functionadjusted
and look
distribution
(likelihood
for theto0.1
value
to integrate
one)
and find the
values that give an area of 0.95
n = 20, area to
left of t=38 = 0.95
t
n = 10, area to
left of t=75 = 0.95