Talk - University of Arizona

Download Report

Transcript Talk - University of Arizona

Bayesian Estimators of
Time to Most Recent
Common Ancestry
Bruce Walsh
[email protected]
Ecology and Evolutionary Biology
Adjunct Appointments
Molecular and Cellular Biology
Plant Sciences
Epidemiology & Biostatistics
Animal Sciences
Definitions
MRCA - Most Recent Common Ancestor
TMRCA - Time to Most Recent Common Ancestor
Question: Given molecular marker information from
a pair of individuals, what is the estimated time back
to their most recent common ancestor?
With even a small number of highly polymorphic
autosomal markers, trivial to assess zero (subject/
biological sample) and one (parent-offspring) MRCA
Problems with Autosomal Markers
Often we are very interested in MRCAs that are modest
(5-10 generations) or large (100’s to 10,000’s of
generations)
Unlinked autosomal markers simply don’t work over these
time scales.
Reason: IBD probabilities for individuals sharing a MRCA
5 or more generations ago are extremely small and hence
very hard to estimate (need VERY large number of
markers).
MRCA-I vs. MRCA-G
We need to distinguish between the MRCA for a pair
of individuals (MRCA-I) and the MRCA for a particular
genetic marker G (MRCA-G).
MRCA-G varies between any two individuals over
recombination units.
For example, we could easily have for a pair of relatives
MRCA (mtDNA ) = 180 generations
MRCA (Y ) = 350 generations
MRCA (one a-globulin allele ) = 90 generations
MRCA (other a -globulin allele ) = 400 generations
MRCA-G > MRCA-I
MRCA-G(
)
MRCA-G( )
MRCA-I
lost
mtDNA and Y Chromosomes
So how can we accurately estimate TMRCA for modest
to large number of generations?
Answer: Use a set of completely linked markers
With autosomes, unlinked markers assort each generation
leaving only a small amount of IBD information on each
marker, which we must then multiply together. IBD
information decays on the order of 1/2 each generation.
With completely linked marker loci, information
on IBD does not assort away via recombination.
IBD information decay is on the order of the
mutation rate.
Y chromosome microsatellite
mutation rates- I
Estimates of human mutation rate in microsatellites
are fairly consistent over both the Y and the autosomes
Estimate of u
Source
Reference
0.0028
Y chromosome
Kayser et al. 2000
0.0021
Y chromosome
Heyer et al. 1997
0.001 - 0.0021
Autosomal
chromosomes
Wong & Weber 1993
Brinkmann 1998
Basic Structure of the Problem
What is the probability that the two marker
alleles at a haploid locus from two related
individuals agree given that their MRCA was
t generation ago?
Phrased another way, what is their probability
of identity in state (IBS), given they are identical
by descent (IBD) when their TMRCA is t
generations
Infinite Alleles Model
The first step in answering this question is to
assume a particular mutational model
Our (initial) assumption will be the infinite alleles
model (IAM)
The key assumption of this model (originally due
to Kimura and Crow, 1964) is that each new mutation
gives rise to a new allele.
The IAM was the first population-genetics model to
attempt to formally incorporate the structure of DNA
into a model
Key: Under the infinite alleles, two alleles that are
identical in state that are also ibd have not
experienced any mutations since their MRCA.
Let q(t) = Probability two alleles with a MRCA
t generations ago are identical in state
If u = per generation mutation rate, then
q(t) = (1-u)2t
Pr(No mutation from MRCA->A) = (1-u)t
A
B
(1-u)t
(1-u) t
Pr(No mutation from MRCA->B) = (1-u)t
MRCA
Building the Likelihood Function for n Loci
For any single marker locus, the probability of IBS
given a TMRCA of t generations is
q(t) = (1-u)2t ≈ e-2ut = e-t, t = 2ut
The probability that k of n marker loci are IBS is just
n!
k
n °- k
Pr(k) =
q(t ) [1 ° q(t )]
(n ° k)! k!
a Binomial distribution with success parameter q(t)
Likelihood function for t given k of n matches
°
¢n °- k
n!
°- kø
°
ø
L(t j n; k) =
e
1 °- e )
(
(n °- k)! k!
ML Analysis of TMRCA
It would seem that we now have all the pieces in
hand for a likelihood analysis of TMRCA given
the marker data (k of n matches)
Likelihood function (t = 2ut)
°
¢n °- k
n!
°- kø
°
ø
L(t j n; k) =
e
1 °- e )
(
(n °- k)! k!
MLE for t is solution of ∂ L/∂t = 0
µ Ž
•n ¥
1
^
^
ø = 2tš = ln
= ln
= °- ln(p)
k
p
() ()
p = fraction of matches
In particular, the MLE for t becomes
•n ¥
1
1
b
t=
ln
= °
ln(p)
2š
k
2š
()
Likewise, the precision of this estimator follows
for the (negative) inverse of the 2nd derivative
of the log-likelihood function evaluated at the
MLE,
Var( t^ ) =° -
µ
(
@2
Ø Ž °- 1
µ
Ž
- p
ln L(t j n; k) Ø
1
1
1
°
Ø
=
Ø
2
2 n
@t
4u
p
^
t= t
)
Likewise, we can (numerically) easily find 1-LOD
support intervals for t and hence construct
approximate 95% confidence intervals to TMRCA
Finally, hypothesis testing, say Ho: MRCA = t0, is
easily accomplished by comparing -2* the natural
log of the ratio of the value of the likelihood function
at t = t0 over the value of the likelihood function at the
MLE t = t
^
The resulting log likelihood ratio LR is (asymptotically)
distributed as a chi-square distribution with one degree of
freedom
Trouble in Paradise
The ML machinery has seem to have done its job,
giving us an estimate, its approximate sampling
error, approximate confidence intervals, and a scheme
for hypothesis testing.
Hence, all seems well.
Problem: Look at k=n (= complete match at all markers).
MLE (TMRCA) = 0 (independent of n)
Var(MLE) = 0 (ouch!)
What about one-LOD support intervals (95% CI) ?
With n=k, the value of the likelihood function is
L(t) = (1-u)2tn ≈ e-2tun
L has a maximum value of one under the MLE
Hence, any value of t that gives a likelihood value of
0.1 or larger is in the one-LOD support interval
Solving, the one-LOD support interval is from t=0 to
t = (1/2n) [ -Ln(10)/Ln(1-u) ] ≈ (1/n) [ Ln(10)/(2u) ]
For u = 0.002, CI is (0, 575/n)
With n=k, likelihood function reduces to
L(t) = (1-u)2tn ≈ e-2tun
MLE(t) = 0 for all values on n
1 LOD1≈LOD
t = 29
≈ t = 58
n=5
1 LOD
= 115value (1) of
0.1≈oft max
n=10
L(t)
likelihood function
n=20
t
(Plots for u = 0.002)
What about Hypothesis testing?
Again recall that for k =n that the likelihood at t = t0 is
L(t0) ≈ Exp(-2t0un)
Hence, the LR test statistic for Ho: t = t0 is just
LR = -2 ln [ L(t0)/ L(0) ]
= -2 ln [ Exp(-2t0un) / 1 ]
= 4t0un
Thus the probability for the test
that TMRCA = t0 is just Pr( c12 > 4t0un)
The problem(s) with ML
The expressions developed for the sampling variance,
approximate confidence intervals, and hypothesis
testing are all large-sample approximations
Problem 1: Here our sample size is the number of
markers scored in the two individuals. Not likely to
be large.
Problem 2: These expressions are obtained by taking
appropriate limits of the likelihood function. If the
ML is exactly at the boundary of the admissible space
on the likelihood surface, this limit may not formally
exist, and hence the above approximations are incorrect.
The solution?
“Ain’t Too Proud to Bayes”
-- Brad Carlin
Why Go Bayesian
An
extension
of likelihood
is Bayesian
statistics(e.g., the
Instead
of simply
estimating
a point estimate
MLE), the goal is the estimate the entire distribution
for the unknown parameter q given the data x
p(q | x) = C * l(x | q) p(q)
posterior distribution
of
Likelihood
function
prior distribution
for
q the for
q
Why Bayesian?
The
appropriate
constant
so
that
posterior
q given x
Given
the data x
integrates
to one.
• Exact for
any sample
size
• Marginal posteriors
• Efficient use of any prior information
• MCMC (such as Gibbs sampling) methods
The Prior on TMRCA
The first step in any Bayesian analysis is choice of
an appropriate prior distribution p(t) -- our thoughts on
the distribution of TMRCA in the absence of any of
the marker data
Standard approach: Use a flat or uninformative prior,
with p(t) = a constant over the admissible range of the
parameter. Can cause problems if the likelihood function
is unbounded (integrates to infinity)
In our case, population-genetic theory provides the
prior: under very general settings, the time to MRCA
for a pair of individuals follows a geometric distribution
In particular, for a haploid gene, TMRCA follows
a geometric distribution with mean 1/Ne.
Hence, our prior is just
p(t) = l(1-l)t ≈ l e-lt, where l = 1/Ne
Hence, we can use an exponential prior with
hyperparameter (the parameter fully
The posterior
thus becomes
characterizing
the distribution)
l = 1/Ne. = 1/Ne
Prior hyperparameter
-k
p(t j k) / L(t j n; k) p(t) = exp[°- (2š k + •)t ] (1 °- exp[°- (2š t ) ]) n °
Previous likelihood
Prior function (ignoring constants
that cancel when we compute the normalizing factor
C)
The Normalizing constant
exp[ °- (2š k + •)t ] (1 °- exp[ °- (2š t) ])n °- k
p(t j k) =
I (š ; k; n; •)
where
Z
I (š ; k; n; •) =
1
-k
exp[°- (2š k + •)t ](1 °- exp[°- (2š t ) ]) n °
dt
0
I ensures that the posterior distribution integrates
to one, and hence is formally a probability distribution
What is the effect of the
hyperparameter?
exp[ °- (2š k + •)t ] (1 °- exp[ °- (2š t) ])n °- k
p(t j k) =
I (š ; k; n; •)
If 2uk >> l, then essentially no dependence on the
actual value of l chosen.
Hence, if 2Neuk >> 1, essentially no dependence on
(hyperparameter) assumptions of the prior.
For a typical microsatellite rate of u = 0.002, this is just
Nek >> 250, which is a very weak assumption. For example,
with k =10 matches, Ne >> 25. Even with only 1 match (k=1),
just require Ne >> 250.
Closed-form Solutions for the
Posterior Distribution
Complete analytic solutions to the prior can be obtained
by using a series expansion (of the (1-ex)n term) to give
n °- k
exp[ °- (2š k + •)t ](1 °- exp[ °- (2š t) ])
= exp[ °- (2š k + •)t ]
n °- k
X
=
i= 0
(°- 1) i
¦
nX
°- k
(
i= 0
(°- 1) i
!
(n °- k)!
exp[ °- (2š ti )]
i!(n ° k ° i)!
(n °- k)!
exp[ °- (2š (k + i) + •)t ]
i!(n °- k °- i)!
Each term is just a * ebt, which is easily integrated
nX°- k
I (š ; k; n; •) =
i= 0
n °- k
X
=
(n °- k)!
(°- 1)
i!(n °- k °- i)!
Z
i
1
exp[°- (2š (k + i) + •)t ]dt
0
(n °- k)!
1
(°- 1)
i!(n °- k °- i)! 2š (k + i) + •
i
i= 0
n °- k
2
(n °- k)! š n °- k
= Q n °- k
i = 0 [ •+ 2š (n °- i) ]
With the assumption of a flat prior, l = 0, this reduces to
(n °- k)!(k °- 1)!
I (š ; k; n; 0) =
( 2š ) n!
Hence, the complete analytic solution of the posterior is
¦ Q n °- k
p(t j k; •) =
(
!
n °- k
[
•+
2š
(n
°
i)
]
(1
°
exp[°
2š
t
])
i= 0
2n °- k (n °- k)!š n °- k
exp[ t(2š k + •) ]
Suppose k = n (no mismatches)
In this case, the prior is simply an exponential
distribution with mean 2un + l,
p(t j k = n) = (•+ 2nš ) exp[ °- (2š n + •)t ]
Analysis of n = k case
Mean TMRCA and its variance:
1
1
š t = æt =
'
•+ 2nš
2nš
Cumulative probability:
< T) =
P r(t •
Z
T
p(t j k = n) dt = 1 °- exp (°- (2š n + •)T )
0
In particular, the time Ta satisfying P(t < Ta) = a is
TÆ =
-° ln(1 °- Æ)
2š n + •
For a flat prior (l = 0), the 95% (one-side) credible
interval is thus given by -ln(0.5)/(2nu) ≈ 1.50/(nu)
Hence, under a Bayesian analysis for u = 0.02, the
95% upper credible interval is given by ≈ 749/n
Recall that the one-LOD support interval (approximate
95% CI) under an ML analysis is ≈ 575/n
The ML solution’s asymptotic approximation significantly
underestimates the true interval relative to the
exact analysis under a full Bayesian model
Why the difference?
Pr(TMRCA < t)
n = 20, t0.95 = 38
n = 10, t0.95 = 75
Under a Bayesian analysis, we look
ML, probability
we plot the
at theUnder
posterior
likelihood
functionadjusted
and look
distribution
(likelihood
for theto0.1
value
to integrate
one)
and find the
values that give an area of 0.95
n = 20, area to
left of t=38 = 0.95
t
n = 10, area to
left of t=75 = 0.95
Sample Posteriors for u = 0.002
Posteriors
for n for
=for
==20
93Posteriors
Posteriors
n10=nn100
50
n = 92
100
0.08
0.30
0.060
0.25
0.050
0.06
2094
91
100
p( t | k0.040
)0.20
)
p(p(
t |tk| k
)0.04
0.15
0.030
90
99
19
98
97
18
0.10
0.020
96
17
0.02
89
16
95
15
0.05
0.010
0.00
0.000
0.00
0
25
01
10
3 10 4
20
515
30
6 20 7
825
40
50
60
9 3010 11
35 12 4013
Time
t to
MRCA
Time t to MRCA
Time
t to
MRCA
70
80
90
14
45 15 5016 17
55 18 6019
100
2065
Key points
• By using markers on a non-recombining chromosomal
section, we can estimate TMRCA over much, much
longer time scales than with unlinked autosomal markers
• By using the appropriate number of markers we
can get accurate estimates for TMRCA for even
just a few generations. 20-50 markers will do.
• Hence, we have a fairly large window of resolution
for TMRCA when using a modest number of completely
linked markers.
Extensions I: Different Mutation
Rates
Let marker locus k have mutation rate uk.
Code the observations as xk = 1 if a match,
otherwise code xk = 0
The posterior becomes:
"
p(t j x) / exp °- t
¦
(
! #
n
X
•+ 2
k= 1
š k xk
Yn £
[1 °- e
k= 1
°- 2t š i
§1 °- x k
]
Stepwise Mutation Model (SMM)
Microsatellite
are scored
by their number
The Infinite allelic
alleles variants
model (IAM)
is not especially
ofrealistic
repeat units.
Hence, two “matching”
alleles
can actually
with microsatellite
data, unless
the fraction
hide
multiple mutations
(and hence more time to the MRCA)
of matches
is very high.
Mutation 1
Mutation 2
Under IAM,
In reality,
score asthere
a match,
are two
and hence
mutations
no mutations
Y chromosome microsatellite
mutation rates- II
The SMM model is an attempt to correct for
multiple hits by fully accounting for the mutational
structure.
Good fit to array sizes in natural populations when
assuming the symmetric single-step model
• Equal probability of a one-step move up or down
In direct studies of (Y chromosome) microsatellites
35 of 37 dectected mutations in pedigrees were
single step, other 2 were two-step
SMM0 model -- match/no
match under SMM
The simplest implementation of the SMM model is
to simply replace the match probabilities q(t) under
the IAM model with those for the SMM model.
This simply codes the marker loci as match / no match
We refer to this as the SMMO model
Formally, the SMM model assumes the following
transition probabilities
š
P r( X (t + 1) = i ° 1 j X (t ) = i ) = P r( X (t + 1) = i + 1j X (t ) = i ) =
2
P r( X (t + 1) = i j X (t) = i ) = 1 °- š
P r( j X (t + 1) °- X (t) j >• 2 j X (t ) = i ) = 0
Note that two alleles can match only if they have
experienced an even number of mutations in total between
them. In such cases, the match probabilities become
P r(mat ch j 2M moves ) =
1
22 M
µ
Ž
()
2M
M
=
1
22 M
(2M )!
(M !) 2
P r(mat ch j 2M moves ) =
X1
q(t ) =
M =0
1
X
=
M =0
1
22 M
µ
Ž
()
2M
M
=
1
22 M
(2M )!
(M !) 2
of
P r(match jNumber
2M moves
) PProb(Match)
r( 2M moves j t )
mutations
µ
1
22 M
= exp(°- 2tš )
Žµ
2 (2š t )2 M
4 (2M )!
62 M !
(š t)
(M8!) 2
10
(2M )!
(M !) 2
¦
(
X1
M=0
Ž
0.500
exp(°- 2tš )
0.375
0.313
0.273
0.246
X1 (x) 2k
Hence, = I ( 2x ) The zero-order modifed
0
2
Type I Bessel Function
(k!)
k
=0
q(ø) = exp(°- ø) I 0 ( ø )
Under the SMM model, the prior hyperparameter
can now become important.
This is the case when the number n of markers is
small and/or k/n is not very close to one
Why? Under the prior, TMRCA is forced by a
geometric with 1/Ne. Under the IAM model for
most values this is still much more time than
the likelihood function predicts from the marker data
Under the SMM model, the likelihood alone predicts
a much longer value so that the forcing effect of the
initial geometric can come into play
n =5, k = 3, u = 0.02
IAM, both flat prior and Ne = 5000
Pr(TMRCA < t)
SSMO, Ne = 5000
SSMO, flat prior
Prior with Ne =5000
Time, t
An Exact Treatment: SMME
With a little work we can show that the probability
two sites differ by j steps is just
q( j ) (ø) = 2 exp(°- ø) I j ( ø )
>1
for j •
The resulting likelihood thus becomes
The jth-order modifed
Type I Bessel Function
k h
i nj
Y
n!
… nk ) =
( j ) (2š t )
L( t j n 0 ; ¢¢¢;
q
… k!
n 0 ! n 1 ! ¢¢¢n
j= 0
Where nj is the number of sites that differ
by k (observed) steps
With this likelihood, the resulting posterior becomes
… nk ) /
p( t j n 0 ; ¢¢¢;
Yk h
(j )
q
(2š t)
i nj
- •t
e°
j=0
This rearranges to give the general posterior
under the Exact SMM model (SMME) as
… nk ) = R
p( t j n 0 ; ¢¢¢;
1
0
e
°-( •+ 2 š n ) t
Qk
[I j ( 2š t ) ]n j
j= 0
Qk
°
•
š
n
t
(
+
2
)
e
j= 0
ofmatches
k steps differences
NumberNumber
of exact
[ Ij ( 2š t ) ]n j dt
Example
Consider comparing haplotypes 1 and 3 from Thomas
et al.’s (2000) study on the Lemba and Cohen Y
chromosome modal haplotypes. Here six markers used,
four exactly match, one differs by one repeat, the other
by two repeats
Hence, n = 6, k = 4 for IAM and SMM0 models
n0 = 4, n1 = 1, n2 = 1, n = 6 under SMME model
Assume Hammer’s value of Ne=5000 for the prior
TMRCA for Lemba and Cohen Y
P(t | markers)
IAM
Model used
Mean
Medium
2.5%
97.5%
IAM
152.3
135.4
31.1
371
SMM0
454.7
233.7
40.4
2389
SMME
422.3
286.2
65.1
1631
SMM0
SMME
Time to MRCA, t
IAM
Pr(TMRCA < t)
SMM0
SMME
Time to MRCA, t
Technology Transfer
The expressions developed above have direct
commercial applications
Family Tree DNA (ftDNA) -- provides Y
chromosome marker kits for genealogical studies
So far, ftDNA has processed over 80,000 such
kits
This amounts to a rough gross of around 8 million
dollars.
Forensic applications of the Y
• A not uncommon situation is the only DNA
is from fingernail scrappings.
• The result is a mixture wherein the
victim's DNA often overwhelms the DNA
of the perpetrator.
• Result: only modest match probability as
many autosomal markers cannot be
detected
• One solution: use Y chromosome markers.
Easily amplified over (female) background
Problem: How do we combine Y
match with autosomal match?
NRC 1996 recommendations (autosomal loci)
Product rule across markers
Population substructure correction within markers
Prob(Y match)*Prob(autosomal match)
Problem: Y markers may provide information
about population substructure membership
For example, a particular haplotype may be restricted
to a certain subpopulation, e.g., Native Americans
Correcting for Y substructure
Let y denote the observed Y haplotype
A the multilocus autosomal marker genotype
P(y,A) = P(A | y)*P(y)
Simple approach: knowledge of y indicates membership
in a particular subpopulation, P(A) computed using
allele frequencies for that subpopulation.
Suggestion: Multiply freq(y)* max Freq(A over subgroups)
A more precise accounting
Suppose two individuals share the same y haplotype.
What is there average coancestry, q?
Balding and Nichols give expressions for autosomal
single-locus genotype frequencies given that the
population shows structure with coancestry q.
Second approach: Compute q from haplotype matching.
Using this q value in Balding - Nichols expressions
to compute (single-locus) autosomal frequencies.
Posterior Distribution for a match at all
n markers with a prior of l = 1/Ne
2n ø
ø° 1
(1
°
u)
¢(1
°
•)
> k) = P
P (tjt •
1
2n ø ¢(1 ° •) ø °
(1
°
u)
ø= k
1
With a MRCA of t generations, q = (1/2)2t+1
> k] =
E [µj t •
X1 µ
t= k
1
22t + 1
Ž
> k)
P (tjt •
Typical situation is where we can exclude
father-son and paternal half-sibs, k > 2
E [µj t
>
•
2] '
•+ 2nš
nš
24 ° 8(•+
+ 2nš )
Typical values, n = 11, m = 1/500
•
For l = 1/5000, E [ q ] = 0.00186
•
For l = 1/500, E [ q ] = 0.00194
•
For l = 1/50, E [ q ] = 0.00272
For these values, unless pi < 0.01, Balding-Nichols
expression are essentially HW.
Formal procedure
Estimate P(y) from a database (counting methods,
Bayesian estimators)
Compute mutlilocus autosomal frequencies by
each major ethnic group using the product of
the single-locus genotypes computed using
group-specific allele frequencies and q = 0.002
correction.
Conservative P(y,A) = P(y)*max P(A)