Transcript Document
Coalescent Theory in Biology
www. coalescent .dk
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,...
Reproductive Structure Genealogies of non-sequenced data Genealogies of sequenced data TGTTGT CGTTAT CATAGT Parameter Estimation Model Testing
Wright-Fisher Model of Population Reproduction Haploid Model
i. Individuals are made by sampling with replacement in the previous generation.
Assumptions
ii. The probability that 2 alleles have same ancestor in previous generation is 1/2N
Diploid Model
Individuals are made by sampling a chromosome from the female and one from the male previous generation with replacement
1. Constant population size 2. No geography 3. No Selection 4. No recombination
P(k):=P{k alleles had k distinct parents}
1 1 Ancestor choices: k -> any
(2N) k
k -> k 2N *(2N-1) *..* (2N-(k-1)) =: (2N) [k] k -> k-1
k
2 (2
N
) [
k
1]
2N k -> j
S k
,
j
(2
N
) [
j
]
S k,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.
For k << 2N:
P
(
k
) 2
N
[
k
] (2
N
)
k
(
k
2 2
N
) 1
k
2 /2
N
e
k
2 / 2
N
Waiting for most recent common ancestor - MRCA Distribution until 2 alleles had a common ancestor, X 2 ?: P(X 2 > 1) = (2N-1)/2N = 1-(1/2N) P(X 2 > j) = (1-(1/2N)) j P(X 2 = j) = (1-(1/2N)) j-1 (1/2N) j j 1 2 1 1 2N 2 1 1 1 2N Mean, E(X 2 ) = 2N.
Ex.: 2N = 20.000, Generation time 30 years, E(X 2 ) = 600000 years.
2N
10 Alleles’ Ancestry for 15 generations
Multiple and Simultaneous Coalescents
1.
Simultaneous Events 2.
Multifurcations.
3.
Underestimation of Coalescent Rates
Discrete
Continuous Time
6 t c :=t d /2N e 0
X k
is exp[
k
2 ] distributed. E(X k ) 1/
k
2
6/2N e
1.0 corresponds to 2N generations 1.0
2N 0 1 4 2 6 5 3 0.0
The Standard Coalescent
Two independent Processes Continuous: Exponential Waiting Times Discrete: Choosing Pairs to Coalesce.
{1,2,3,4,5} Waiting Coalescing (1,2) - (3,(4,5)) {1,2}{3,4,5}
Exp
2 2
{1}{2}{3,4,5} {1}{2}{3}{4,5} {1}{2}{3}{4}{5}
1 2 3 4 5
Exp
3 2
Exp
4 2
Exp
5 2
1 - 2 3 - (4,5) 4 - 5
Expected Height and Total Branch Length 1 2 3 Time Epoch 1 1/3 Branch Lengths 2 1
1 /
k
2
k
(
k
2 1 )
2/(k-1) k Expected Total height of tree: H k = 2(1-1/k) i.Infinitely many alleles finds 1 allele in finite time.
ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles.
Expected Total branch length in tree, L k : 2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)
Effective Populations Size, N
e
.
In an idealised Wright-Fisher model: i.
loss of variation per generation is 1-1/(2N).
ii.
Waiting time for random alleles to find a common ancestor is 2N.
Factors that influences N e : i. Variance in offspring.
WF: 1. If variance is higher, then effective population size is smaller. ii. Population size variation - example k cycle: N 1 , N 2 ,..,N k . k/N e = 1/N 1 +..+ 1/N k . N 1 = 10 N 2 = 1000 => N e = 50.5
iii. Two sexes N e = 4 N f N m /( N f + N m )I.e. N f - 10 N m -1000 N e - 40
6 Realisations with 25 leaves
Observations: Variation great close to root.
Trees are unbalanced.
Sampling more sequences
The probability that the ancestor of the sample of size n is in a sub-sample of size k is (
n
1)(
k
1) (
n
1)(
k
1) Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.
Adding Mutations
m mutation pr. nucleotide pr.generation. L : seq. length µ = m*L Mutation pr. allele pr.generation. 2N e - allele number.
Q
:= 4N*µ -- Mutation intensity in scaled process.
Discrete time Discrete sequence Continuous time Continuous sequence 1/L 1/(2N e ) sequence sequence mutation
Q
/2 mutation
Q
/2 coalescence 1 Probability for two genes being identical: P(Coalescence < Mutation) = 1/(1+
Q
).
Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult.
Three Models of Alleles and Mutations.
Infinite Allele Infinite Site Finite Site
acgtgctt acgtgcgt acctgcat tcctg
c
at tcctgcat Q Q
i. Only identity, non-identity is determinable ii. A mutation creates a new type.
i. Allele is represented by a line.
ii. A mutation always hits a new position.
Q acgtgctt acgtgcgt acctgcat tcctg
g
ct tcctgcat
i. Allele is represented by a sequence.
ii. A mutation changes nucleotide at chosen position.
Infinite Allele Model
1 {( 1 )} 1 1 {( 1 , 2 )} 2 1 {( 1 ), ( 2 )} 1 2 {( 1 ), ( 2 )} 1 2 {( 1 ), ( 2 , 3 )} 1 1 2 1 {( 1 ), ( 2 , 3 )} 1 1 2 1 {( 1 , 2 ), ( 3 )( 4 , 5 )} 1 1 2 2 2 3 4 {( 1 ), ( 2 ), ( 3 )( 4 , 5 )} 1 3 2 1 5
Infinite Site Model
Final Aligned Data Set:
Labelling and unlabelling:positions and sequences
1 2 3 4 5 Ignoring mutation position Ignoring sequence label 1 2 3 5 4 Ignoring mutation position Ignoring sequence label
{ ,
The forward-backward argument
, } 2 5(4 ) (4 1 ) 4 classes of mutation events incompatible with data 9 coalescence events incompatible with data
Infinite Site Model: An example
Theta=2.12
2 5 10 3 19 5 2 4 3 9 14 5 33
Impossible Ancestral States
Final Aligned Data Set:
acgtgctt acgtgcgt acctgcat tcctgcat tcctgcat s s s
Finite Site Model
Diploid Model with Recombination
An individual is made by: 1. The paternal chromosome is taken by picking random father.
2. Making that father’s chromosomes recombine to create the individuals paternal chromosome.
Similarly for maternal chromosome.
The Diploid Model Back in Time.
A recombinant sequence will have have two different ancestor sequences in the grandparent.
1- recombination histories I:
Branch length change
1 2 3 4 1 2 3 4 1 2 3 4
1- recombination histories II:
Topology change
1 2 3 4 1 2 3 4 1 2 3 4
1- recombination histories III:
Same tree
1 2 3 4 1 2 3 4 1 2 3 4
1- recombination histories IV:
Coalescent time must be further back in time than recombination time.
1 r 2 c 3 4
Recombination-Coalescence Illustration Copied from Hudson 1991 Intensities Coales.
Recomb.
0 1
(1+b)
b 3 (2+b)
6 2
3 2
1 2
Age to oldest most recent common ancestor
0 kb Scaled recombination rate -
250 kb
Number of genetic ancestors to the Human Genome S
– number of Segments E(S
) = 1 +
C C C R R R
sequence
Simulations Statements about number of ancestors are much harder to make.
0
Applications to Human Genome (Wiuf and Hein,97) Parameters used 4N e 20.000 Chromos. 1: 263 Mb. 263 cM Chromosome 1: Segments 52.000 Ancestors 6.800
All chromosomes Ancestors 86.000
Physical Population. 1.3-5.0 Mill.
A randomly picked ancestor: (ancestral material comes in batteries!)
0 0 6890 0 260 Mb 52.000
*35
7.5 Mb 8360
*250
30kb
Ignoring recombination in phylogenetic analysis General Practice in Analysis of Viral Evolution!!!
Recombination Assuming No Recombination 1 2 3 4 1 2 4 3 Mimics decelerations/accelerations of evolutionary rates.
No & Infinite recombination implies molecular clock.
Simulated Example
Genotype and Phenotype Covariation: Gene Mapping Sampling Genotypes and Phenotypes Decay of local dependency Genetype -->Phenotype Function
Dominant/Recessive.
Penetrance Spurious Occurrence Heterogeneity
A set of characters.
Binary decision (0,1).
Quantitative Character.
genotype
Genotype
Phenotype
phenotype Reich
et al.
(2001)
Result: The Mapping Function