Transcript Document

Coalescent Theory in Biology

www. coalescent .dk

Fixed Parameters: Population Structure, Mutation, Selection, Recombination,...

Reproductive Structure Genealogies of non-sequenced data Genealogies of sequenced data TGTTGT CGTTAT CATAGT Parameter Estimation Model Testing

Wright-Fisher Model of Population Reproduction Haploid Model

i. Individuals are made by sampling with replacement in the previous generation.

Assumptions

ii. The probability that 2 alleles have same ancestor in previous generation is 1/2N

Diploid Model

Individuals are made by sampling a chromosome from the female and one from the male previous generation with replacement

1. Constant population size 2. No geography 3. No Selection 4. No recombination

P(k):=P{k alleles had k distinct parents}

1 1 Ancestor choices: k -> any

(2N) k

k -> k 2N *(2N-1) *..* (2N-(k-1)) =: (2N) [k] k -> k-1

 

k

2   (2

N

) [

k

 1]

2N k -> j

S k

,

j

(2

N

) [

j

]

S k,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.

 

For k << 2N:

P

(

k

)  2

N

[

k

] (2

N

)

k

 (

k

2  2

N

) 1  

k

 2   /2

N

e

 

k

 2   / 2

N



Waiting for most recent common ancestor - MRCA Distribution until 2 alleles had a common ancestor, X 2 ?: P(X 2 > 1) = (2N-1)/2N = 1-(1/2N) P(X 2 > j) = (1-(1/2N)) j P(X 2 = j) = (1-(1/2N)) j-1 (1/2N) j j 1 2 1 1 2N 2 1 1 1 2N Mean, E(X 2 ) = 2N.

Ex.: 2N = 20.000, Generation time 30 years, E(X 2 ) = 600000 years.

2N

10 Alleles’ Ancestry for 15 generations

Multiple and Simultaneous Coalescents

1.

Simultaneous Events 2.

Multifurcations.

3.

Underestimation of Coalescent Rates



Discrete

Continuous Time

6 t c :=t d /2N e 0

X k

is exp[  

k

2   ] distributed. E(X k )  1/  

k

 2  

6/2N e

1.0 corresponds to 2N generations 1.0

2N 0 1 4 2 6 5 3 0.0

  

The Standard Coalescent

Two independent Processes Continuous: Exponential Waiting Times Discrete: Choosing Pairs to Coalesce.

{1,2,3,4,5} Waiting Coalescing (1,2) - (3,(4,5)) {1,2}{3,4,5}

Exp

  2 2

{1}{2}{3,4,5} {1}{2}{3}{4,5} {1}{2}{3}{4}{5}

1 2 3 4 5

Exp

  3 2  

Exp

 4  2

Exp

5  2

1 - 2 3 - (4,5) 4 - 5



Expected Height and Total Branch Length 1 2 3 Time Epoch 1 1/3 Branch Lengths 2 1

1 /  

k

2   

k

(

k

2  1 )

2/(k-1) k Expected Total height of tree: H k = 2(1-1/k) i.Infinitely many alleles finds 1 allele in finite time.

ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles.

Expected Total branch length in tree, L k : 2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)

Effective Populations Size, N

e

.

In an idealised Wright-Fisher model: i.

loss of variation per generation is 1-1/(2N).

ii.

Waiting time for random alleles to find a common ancestor is 2N.

Factors that influences N e : i. Variance in offspring.

WF: 1. If variance is higher, then effective population size is smaller. ii. Population size variation - example k cycle: N 1 , N 2 ,..,N k . k/N e = 1/N 1 +..+ 1/N k . N 1 = 10 N 2 = 1000 => N e = 50.5

iii. Two sexes N e = 4 N f N m /( N f + N m )I.e. N f - 10 N m -1000 N e - 40

6 Realisations with 25 leaves

Observations: Variation great close to root.

Trees are unbalanced.



Sampling more sequences

The probability that the ancestor of the sample of size n is in a sub-sample of size k is (

n

 1)(

k

 1) (

n

 1)(

k

 1) Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.

Adding Mutations

m mutation pr. nucleotide pr.generation. L : seq. length µ = m*L Mutation pr. allele pr.generation. 2N e - allele number.

Q

:= 4N*µ -- Mutation intensity in scaled process.

Discrete time Discrete sequence Continuous time Continuous sequence 1/L 1/(2N e ) sequence sequence mutation

Q

/2 mutation

Q

/2 coalescence 1 Probability for two genes being identical: P(Coalescence < Mutation) = 1/(1+

Q

).

Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult.

Three Models of Alleles and Mutations.

Infinite Allele Infinite Site Finite Site

acgtgctt acgtgcgt acctgcat tcctg

c

at tcctgcat Q Q

i. Only identity, non-identity is determinable ii. A mutation creates a new type.

i. Allele is represented by a line.

ii. A mutation always hits a new position.

Q acgtgctt acgtgcgt acctgcat tcctg

g

ct tcctgcat

i. Allele is represented by a sequence.

ii. A mutation changes nucleotide at chosen position.

Infinite Allele Model

1 {( 1 )}  1 1 {( 1 , 2 )}  2 1 {( 1 ), ( 2 )}  1 2 {( 1 ), ( 2 )}  1 2 {( 1 ), ( 2 , 3 )}  1 1 2 1 {( 1 ), ( 2 , 3 )}  1 1 2 1 {( 1 , 2 ), ( 3 )( 4 , 5 )}  1 1 2 2 2 3 4 {( 1 ), ( 2 ), ( 3 )( 4 , 5 )}  1 3 2 1 5

Infinite Site Model

Final Aligned Data Set:

Labelling and unlabelling:positions and sequences

1 2 3 4 5 Ignoring mutation position Ignoring sequence label 1 2 3 5 4 Ignoring mutation position Ignoring sequence label

{ ,

The forward-backward argument

, } 2  5(4   ) (4 1   ) 4 classes of mutation events incompatible with data 9 coalescence events incompatible with data  

Infinite Site Model: An example

Theta=2.12

2 5 10 3 19 5 2 4 3 9 14 5 33

Impossible Ancestral States

Final Aligned Data Set:

acgtgctt acgtgcgt acctgcat tcctgcat tcctgcat s s s

Finite Site Model

Diploid Model with Recombination

An individual is made by: 1. The paternal chromosome is taken by picking random father.

2. Making that father’s chromosomes recombine to create the individuals paternal chromosome.

Similarly for maternal chromosome.

The Diploid Model Back in Time.

A recombinant sequence will have have two different ancestor sequences in the grandparent.

1- recombination histories I:

Branch length change

1 2 3 4 1 2 3 4 1 2 3 4

1- recombination histories II:

Topology change

1 2 3 4 1 2 3 4 1 2 3 4

1- recombination histories III:

Same tree

1 2 3 4 1 2 3 4 1 2 3 4

1- recombination histories IV:

Coalescent time must be further back in time than recombination time.

1 r 2 c 3 4

Recombination-Coalescence Illustration Copied from Hudson 1991 Intensities Coales.

Recomb.

0 1

(1+b)

b 3 (2+b)

6 2

3 2

1 2

Age to oldest most recent common ancestor

0 kb Scaled recombination rate -

250 kb

Number of genetic ancestors to the Human Genome S

– number of Segments E(S

) = 1 +

C C C R R R

sequence

Simulations Statements about number of ancestors are much harder to make.

0

Applications to Human Genome (Wiuf and Hein,97) Parameters used 4N e 20.000 Chromos. 1: 263 Mb. 263 cM Chromosome 1: Segments 52.000 Ancestors 6.800

All chromosomes Ancestors 86.000

Physical Population. 1.3-5.0 Mill.

A randomly picked ancestor: (ancestral material comes in batteries!)

0 0 6890 0 260 Mb 52.000

*35

7.5 Mb 8360

*250

30kb

Ignoring recombination in phylogenetic analysis General Practice in Analysis of Viral Evolution!!!

Recombination Assuming No Recombination 1 2 3 4 1 2 4 3 Mimics decelerations/accelerations of evolutionary rates.

No & Infinite recombination implies molecular clock.

Simulated Example

Genotype and Phenotype Covariation: Gene Mapping Sampling Genotypes and Phenotypes Decay of local dependency Genetype -->Phenotype Function

Dominant/Recessive.

Penetrance Spurious Occurrence Heterogeneity

A set of characters.

Binary decision (0,1).

Quantitative Character.

genotype

Genotype

Phenotype

phenotype Reich

et al.

(2001)

Result: The Mapping Function