Talk - University of Arizona

Download Report

Transcript Talk - University of Arizona

Bayesian Estimators of Time to Most Recent Common Ancestry

Bruce Walsh [email protected]

Ecology and Evolutionary Biology

Adjunct Appointments

Molecular and Cellular Biology Plant Sciences Epidemiology & Biostatistics Animal Sciences

Problem: Given some genetic marker information for a pair of individuals, what can we say about how recently they are related?

Application in Genealogy -- how closely are (say) Bill Walsh (former football coach) and I related?

Basic concept: The more closely related, the more genetic markers in common we should share For forensic applications: 11 highly polymorphic Automosomal marker loci sufficient for individualization Measure relatedness by TMRCA , the time to the most Recent common ancestor ( MRCA )

Often we are very interested in TMRCAs that are modest (5-10 generations) or large (100’s to 10,000’s of generations) Unlinked autosomal markers simply don’t work over these time scales.

Reason: With autosomes, unlinked markers assort each generation leaving only a small amount of IBD information on each marker, which we must then multiply together. IBD information decays on the order of 1/2 each generation. To extract the signal for 5-10 generations would require a very large number of automosomal markers (order of 100’s)

Y Chromosomes

How then can we accurately estimate TMRCA for modest to large number of generations?

Answer: Use a set of completely linked markers, such as those on the nonrecombiming arm of the Y chromosome.

With completely linked marker loci, information on IBD does not assort away via recombination. IBD information decay is on the order of the mutation rate, roughly 1/250 - 1/500 per generation .

Infinite Alleles Model The first step in estimating TMRCA requires us to assume a particular mutational model

Our (initial) assumption will be the infinite alleles model ( IAM ) The key assumption of this model (originally due to Kimura and Crow, 1964) is that each new mutation gives rise to a new allele.

Key: Under the infinite alleles, two alleles that are identical in state that are also ibd have not experienced any mutations since their MRCA. Let q(t) = Probability two alleles with a MRCA t generations ago are identical in state If u = per generation mutation rate, then A q(t) = (1-u) 2t (1-u) t Pr(No mutation from MRCA->A) = (1-u) t Pr(No mutation from MRCA->B) = (1-u) t B (1-u) t

Building the Likelihood Function for n Loci

For any single marker locus, the probability of IBS given a TMRCA of t generations is

q(t) = (1-u)

2t

≈ e

-2ut

= e

t

,

t

= 2ut

The probability that k of n marker loci are IBS is just

P r(k) =

-

n!

(n ° k)! k!

q(t )

k

[1 ° q(t )]

n ° k a Binomial distribution with success parameter q(t) Likelihood function for t given k of n matches

L(t j n ; k) = n!

e

° k ø

°

(

1 ° e

° ø

¢

) n ° k

ML Analysis of TMRCA

It would seem that we now have all the pieces in hand for a likelihood analysis of TMRCA given the marker data (k of n matches) Likelihood function ( t = 2ut)

L(t j n ; k) = n!

e

° k ø

°

(

1 ° e

° ø

¢

) n ° k MLE for t is solution of ∂ L/∂t = 0

Ž •n ¥ µ 1

( ) ( )

p = fraction of matches

In particular, the MLE for t becomes 1 2š ln ( •n k ¥ ) = ° 2š 1 ln(p) Likewise, the precision of this estimator follows for the (negative) inverse of the 2nd derivative of the log-likelihood function evaluated at the MLE,

°

-

µ

(

@

2

ln L(t j n ; k) @t

2

Ø Ø Ø Ø

t = t

Ž

) ° 1

= 1 4u

2

1 n µ p Ž

Trouble in Paradise

The ML machinery has seem to have done its job, giving us an estimate, its approximate sampling error, approximate confidence intervals, and a scheme for hypothesis testing.

Hence, all seems well.

Problem : Look at k=n (= complete match at all markers).

MLE (TMRCA) = 0 (independent of n) Var(MLE) = 0 (ouch!) The one-LOD support interval is from t=0 to t = (1/2n) [ -Ln(10)/Ln(1-u) ] = (0, 575/n)

With n=k, likelihood function reduces to

L(t) = (1-u)

2tn

≈ e

-2tun L(t) n=20 n=5 n=10 MLE(t) = 0 for all values on n 1 LOD ≈ ≈ t = 58 ≈ t = 115 likelihood function t (Plots for u = 0.002)

The problem(s) with ML

The expressions developed for the sampling variance, approximate confidence intervals, and hypothesis testing are all large-sample approximations Problem 1: Here our sample size is the

number of markers

scored in the two individuals. Not likely to be large (typically 10-30).

Problem 2: These expressions are obtained by taking appropriate limits of the likelihood function. If the ML is

exactly

at the boundary of the admissible space on the likelihood surface, this limit may not formally exist , and hence the above approximations are incorrect .

The Solution? Go Bayesian

Instead of simply estimating a point estimate (e.g., the MLE), the goal is the estimate the entire distribution for the unknown parameter q given the data x p( q | x) = C * l(x | q ) p( q ) • Exact for any sample size of q q • Marginal posteriors • Efficient use of any prior information

The Prior on TMRCA

The first step in any Bayesian analysis is choice of an appropriate prior distribution the marker data p(t) -- our thoughts on the distribution of TMRCA in the absence of any of Standard approach : Use a flat or uninformative prior, with p(t) = a constant over the admissible range of the parameter. Can cause problems if the likelihood function is unbounded (integrates to infinity) In our case, population-genetic theory provides the prior : under very general settings, the time to MRCA for a pair of individuals follows a geometric distribution

In particular, for a haploid gene, TMRCA follows a geometric distribution with mean 1/N e . Hence, our prior is just p(t) = l (1 l ) t

l e l t , where l = 1/N e Hence, we can use an hyperparameter The posterior thus becomes exponential prior (the parameter fully characterizing the distribution) l = 1/N with e .

e n ° k Previous likelihood function (ignoring constants that cancel when we compute the normalizing factor C)

The Normalizing constant

p(t j k) = exp [ ° (2š k + •)t ] (1 ° exp[ ° (2š t) ]) n ° k I (š ; k; n ; •) where I (š ; k; n ; •) = Z 1 0 n ° k dt I ensures that the posterior distribution integrates to one, and hence is formally a probability distribution

What is the effect of the hyperparameter?

exp [ ° (2š k + •)t ] (1 ° exp[ ° (2š t) ]) n ° k p(t j k) = I (š ; k; n ; •) If 2uk >> l , then essentially no dependence on the actual value of l chosen.

Hence, if 2N e uk >> 1 , essentially no dependence on (hyperparameter) assumptions of the prior.

For a typical microsatellite rate of u = 0.002, this is just N e k >> 250 , which is a very weak assumption. For example, with k =10 matches, N e >> 25. Even with only 1 match (k=1), just require N e >> 250.

Closed-form Solutions for the Posterior Distribution

Complete analytic solutions to the prior can be obtained by using a series expansion (of the (1-e x ) n term) to give

exp [ ° (2š k + •)t ](1 ° exp[ ° (2š t) ])

n ° k = ¦ ( X i = 0 i - !

= X i (n ° k)!

i = 0 Each term is just a * e bt , which is easily integrated

I (š ; k; n ; •) = = = X i - i = 0 X i = 0 Q 2 n ° k i - i = 0 Z 0 1 1 2š (k + i) + • With the assumption of a flat prior, l = 0, this reduces to

I (š ; k; n; 0) = ( 2š ) n!

Hence, the complete analytic solution of the posterior is

p(t j k; •) =

(

¦ Q 2

i = 0 n ° k Suppose k = n (no mismatches)

!

(1 ° exp [° 2š t ])

n ° k

exp[ t(2š k + •) ]

In this case, the prior is simply an exponential distribution with mean 2un + l ,

Analysis of n = k case

Mean TMRCA and its variance:

š

t

= æ

t

= 1 •+ 2n š '

Cumulative probability: Z T

1 2n š

0 In particular, the time T a satisfying P(t < T a ) = a is T Æ = 2š n + •

For a flat prior ( l = 0), the 95% (one-side) confidence interval is thus given by -ln(0.5)/(2nu)

≈ 1.50/(nu)

Hence, under a Bayesian analysis for u = 0.002, the 95% upper confidence interval is given by

749/n Recall that the one-LOD support interval (approximate 95% CI) under an ML analysis is

575/n The ML solution’s asymptotic approximation significantly

underestimates

the true interval relative to the exact analysis under a full Bayesian model

Sample Posteriors for u = 0.002

0.08

0.06

p( t | k )

p( t | k ) 0.04

100 20 94 99 93

Posteriors for n = 20

92 91 90 16 95 89 0.02

15 0.00

0 1 0 2 5 3 10 10 4 15 6 20 30 10 35 12 60 40 13 Time t to MRCA 70 15 50 16 80 18 90 65

Key points

• By using markers on a non recombining chromosomal section, we can estimate TMRCA over much, much longer time scales than with unlinked autosomal markers • By using the appropriate number of markers we can get accurate estimates for TMRCA for even just a few generations. 20-50 markers will do.

• Hence, we have a fairly large window of resolution for TMRCA when using a modest number of completely linked markers.

Stepwise Mutation Model (SMM)

The Infinite alleles model (IAM) is not especially realistic with microsatellite data, unless the fraction of matches is very high.

Mutation 1 Mutation 2

Formally, the SMM model assumes the following transition probabilities š 2 Note that two alleles can match only if they have experienced an even number of mutations in total between them. In such cases, the match probabilities become P r(mat ch j 2M moves ) = 1 2 2 M µ 2M ( ) Ž = 1 2 2 M (2M )!

(M !) 2

P r(mat ch j 2M moves ) = 1 2 2 M µ 2M ( ) Ž = 1 2 2 M (2M )!

(M !) 2 q(t ) = = k Hence, 2k (k!) 2 = 0 M = 0 M = 0 P r(match j 2M moves ) P r( 2M moves j t ) mutations Prob(Match) µ 1 2 2 M ¦ ( (2M )!

(M !) 2 (2š t ) 2 M 4 6 (š t) 2 M (2M )!

!

M = 0 (M !) 2 10 0.313

0.273

0.246

= I 0 ( 2x ) The zero-order modifed Type I Bessel Function

q(ø) = exp(° ø) I

0

( ø )

Under the SMM model, the prior hyperparameter can now become important .

This is the case when the number small n of markers is and/or k/n is not very close to one Why? Under the prior, TMRCA is forced by a geometric with 1/Ne. Under the IAM model for most values this is still much more time than the likelihood function predicts from the marker data Under the SMM model, the likelihood alone predicts a much longer value so that the forcing effect of the initial geometric can come into play

n =5, k = 3, u = 0.02

IAM, both flat prior and Ne = 5000 SSMO, N e = 5000 SSMO, flat prior Time, t Prior with N e =5000

An Exact Treatment: SMME

With a little work we can show that the probability two sites differ by j steps is just

q

( j )

(ø) = 2 exp(° ø) I

j

( ø )

The resulting likelihood thus becomes Type I Bessel Function L( t j n 0 … k ) = Y k h q ( j ) (2š t ) i n j n 0 ! n 1 n !

… k !

j = 0 Where n j is the number of sites that differ by k (observed) steps

With this likelihood, the resulting prior becomes p( t j n 0 … k ) / Y k j = 0 h q ( j ) (2š t) i n j e ° •t This rearranges to give the general posterior under the Exact SMM model (SMME) as p( t j n 0 ; ¢¢¢; n k ) = R 1 0 e ° ( •+ 2 š n )t e ° ( •+ 2 š n )t Q k Q j = 0 k j = 0 [I j [ I j ( 2š t ) ] n j ( 2š t ) ] n j dt

Example

Consider comparing haplotypes 1 and 3 from Thomas et al.’s (2000) study on the Lemba and Cohen Y chromosome modal haplotypes. Here six markers used, four exactly match, one differs by one repeat, the other by two repeats Hence, n = 6, k = 4 for IAM and SMM0 models n 0 = 4, n 1 = 1, n 2 = 1, n = 6 under SMME model Assume Hammer’s value of N e =5000 for the prior

TMRCA for Lemba and Cohen Y IAM

Model used IAM SMM0 SMME Mean 152.3

454.7

422.3

Medium 135.4

233.7

286.2

2.5% 31.1

40.4

65.1

SMM0

97.5% 371 2389 1631

SMME Time to MRCA, t

IAM SMM0 SMME Time to MRCA, t