A simple model for a complex world Simon Whelan University of Manchester Isaac Newton Institute.

Download Report

Transcript A simple model for a complex world Simon Whelan University of Manchester Isaac Newton Institute.

A simple model for a complex world
Simon Whelan
University of Manchester
Isaac
Newton
Institute
Modelling sequence evolution
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
TCTTTATTGACGTGTATGGACAATTC...
TCTTTGTTAACGTGCATGGACAATTC...
TCCTTGCTAACATGCATGGACAATTC...
TCTTTGCTAACGTGCATGGATAATTC...
TCTT---TAACGTGCATAGATAACTC...
TCAC---TAACATGTATAGATAACTC...
TCTCTTCTAACGTGCATTGTGAAGTC...
TCTCTTTTGACATGTATTGAAAAATC...
A C G T
A
C
G
T
Simple models assume:
• All sites evolving to the same process
• All parts of the tree evolve to the same process
Spatial heterogeneity in sequence evolution
Also known as pattern heterogeneity
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
TCTTTATTGACGTGTATGGACAATTC...
TCTTTGTTAACGTGCATGGACAATTC...
TCCTTGCTAACATGCATGGACAATTC...
TCTTTGCTAACGTGCATGGATAATTC...
TCTT---TAACGTGCATAGATAACTC...
TCAC---TAACATGTATAGATAACTC...
TCTCTTCTAACGTGCATTGTGAAGTC...
TCTCTTTTGACATGTATTGAAAAATC...
A C G T
A C G T
A
C
G
T
A
C
G
T
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
Rate = 2.0
Spatial heterogeneity in sequence evolution
Also known as pattern heterogeneity
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
TCTTTATTGACGTGTATGGACAATTC...
TCTTTGTTAACGTGCATGGACAATTC...
TCCTTGCTAACATGCATGGACAATTC...
TCTTTGCTAACGTGCATGGATAATTC...
TCTT---TAACGTGCATAGATAACTC...
TCAC---TAACATGTATAGATAACTC...
TCTCTTCTAACGTGCATTGTGAAGTC...
TCTCTTTTGACATGTATTGAAAAATC...
A C G T
A C G T
A
C
G
T
A
C
G
T
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
Rate = 2.0
Spatial heterogeneity in sequence evolution
Also known as pattern heterogeneity
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
TCTTTATTGACGTGTATGGACAATTC...
TCTTTGTTAACGTGCATGGACAATTC...
TCCTTGCTAACATGCATGGACAATTC...
TCTTTGCTAACGTGCATGGATAATTC...
TCTT---TAACGTGCATAGATAACTC...
TCAC---TAACATGTATAGATAACTC...
TCTCTTCTAACGTGCATTGTGAAGTC...
TCTCTTTTGACATGTATTGAAAAATC...
A C G T
A C G T
A
C
G
T
A
C
G
T
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
Rate = 2.0
Temporal heterogeneity in sequence evolution
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
TCTTTATTGACGTGTATGGACAATTC...
TCTTTGTTAACGTGCATGGACAATTC...
TCCTTGCTAACATGCATGGACAATTC...
TCTTTGCTAACGTGCATGGATAATTC...
TCTT---TAACGTGCATAGATAACTC...
TCAC---TAACATGTATAGATAACTC...
TCTCTTCTAACGTGCATTGTGAAGTC...
TCTCTTTTGACATGTATTGAAAAATC...
A C G T
A C G T
A
C
G
T
A
C
G
T
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
Rate = 2.0
Temporal heterogeneity in sequence evolution
Seq1
Seq2
Seq3
Seq4
Seq5
Seq6
Seq7
Seq8
TCTTTATTGACGTGTATGGACAATTC...
TCTTTGTTAACGTGCATGGACAATTC...
TCCTTGCTAACATGCATGGACAATTC...
TCTTTGCTAACGTGCATGGATAATTC...
TCTT---TAACGTGCATAGATAACTC...
TCAC---TAACATGTATAGATAACTC...
TCTCTTCTAACGTGCATTGTGAAGTC...
TCTCTTTTGACATGTATTGAAAAATC...
A C G T
A C G T
A
C
G
T
A
C
G
T
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
Rate = 2.0
Motivation
Biological
Not including heterogeneity leads to inaccurate inferences (Naylor; Lockhart)
Form of heterogeneity is poorly characterised
Understanding heterogeneity may lead to biological insights
Modelling
Needs to describe general heterogeneity (Warnow)
Must be identifiable (Rhodes; Allman)
Should be computationally efficient:
• Few parameters
• Small(-ish) number of states
• Applicable to tree search
Temporal hidden Markov models (THMMs)
General type of model
Describes temporal and spatial heterogeneity
Allows simple likelihood computation (reversible; stationary; i.i.d.)
Previous incarnations
Mostly examine temporal and spatial rate variation
Covarion model of Tuffley and Steel and its progeny
Other names include:
• Markov modulated Markov processes (models)
• Switching processes
• Covarion-like
Substitution processes
There are 1,…,g separate HKY substitution processes, each
representing a hidden state in a HMM
The kth hidden state is defined by rate matrix Mk:
 
 ~k

M k   k  k A~ k
  A
 ~k
  A
~ ,~ ,~ ,~ 
k
A
k
C
k
G
k
T
~Ck

~ k
C
 k ~Ck
 k ~Gk
~Tk 
k
k ~k 
~
G
 T 

~Tk 

~Gk
 
= nucleotide distribution of hidden state k
 k = rate of hidden state k
 k = transition/transversion rate ratio of hidden state k
Note: Subscripts refer to observable states. Superscripts refer to hidden states
Temporal heterogeneity: a switching model
A reversible Markov model describing the switching rate between
hidden states
This process defined by g x g rate matrix C
 
 1, 2 ~1
 
C
 
 1, g ~1
  
 k ,l
~1 , ~ 2 ,, ~ g
 1, 2~ 2   1, g~ g 
2, g ~ g 

  

 2, g ~ 2




= exchangeability between hidden states k and l
= probability of a hidden state
Note: Subscripts refer to observable states. Superscripts refer to hidden states
Defining a THMM
The 4g x 4g instantaneous rate matrix is:
Qik, ,jl
M ik, j
 ~ l k ,l
  j C
0

for all i  j and k  l
for all i  j and k  l
for all i  j and k  l
Qik, ,jl = changes between observable states i, j and hidden states k, l
Equilibrium distribution is  ik  ~ k ~ik
Hidden states and observable states do not change simultaneously
Note: Subscripts refer to observable states. Superscripts refer to hidden states
THMMs for spatial and temporal heterogeneity
State 1
State 2
State 3
A C G T A C G T A C G T
A
C
State 1
G
T
A
C
State 2
G
T
A
C
State 3
G
T
C=
Rate of transitions
between hidden states
relative to substitution rate
0.07
Note: Value proportional to bubble area
Mixture models for spatial heterogeneity
State 1
State 2
State 3
A C G T A C G T A C G T
A
C
State 1
G
T
A
C
State 2
G
T
A
C
State 3
G
T
Restricting all  to
zero results in a
mixture model
k ,l
Probability of different
hidden states accounted
for by the equilibrium
distribution at the root
Mixture models for spatial heterogeneity
Pr(
)
Pr(
)
Pr(
)
Investigating heterogeneity in groEL
Data
Herbeck et al. (2005) examined groEL
sequences to investigate origins of
primary endosymbionts
Variability of GC content demonstrated to
affect tree estimate
There are 23 sequences of length 1572
nucleotides (all 3 codon positions)
Investigating heterogeneity in groEL
Investigating spatial heterogeneity
Use mixture model ( 
k ,l
set to 0)
Examine 2 and 3 hidden states
Relative importance of rate (  k to vary), nucleotide frequencies ( ~ k to vary),
and Ts/Tv bias (  k to vary)
Importance of all HKY parameters varying between classes
Investigating simple temporal heterogeneity
Single extra degree of freedom over mixture models
Use ‘simple’ THMM (
 k ,l
set to equal)
Relative importance of allowing different HKY parameters to vary temporally
Investigating simple temporal heterogeneity
GTR switching allows all  k ,l to vary
Results: groEL (no Γ- distribution)
Number of hidden states
Mixture model
THMM
2
3
2
3
Rates
2225.9
(4353.7)
2442.1
(4784.3)
2227.7
(4355.5)
2443.8
(4785.6)
Nucleotides
796.5
(1491.0)
1330.8
(2551.5)
796.5
(1489.0)
1339.0
(2566.1)
κ
299.1
(500.3)
316.6
(533.2)
300.0
(499.9)
316.7
(531.4)
All
2353.3
(4600.7)
2672.1
(5226.1)
2353.3
(4598.7)
2673.2
(5226.3)
All with GTR
switching
-
-
-
2694.0
(5266.0)
lnL(HKY) = -18579.6
Improvement in ln Lˆ over HKY
lnL(HKY+dG) = -16209.9
(Improvement in AIC over HKY)
It’s all about rate: frequencies
A C G T A C G T A C G T
A
C
G
T
A
C
G
T
A
C
G
T
C=
Rate of transitions
between hidden states
relative to substitution rate
0.04
Results: groEL (with Γ- distribution)
Number of hidden states
Mixture model
THMM
2
3
2
3
Rates
69.5
(135.1)
79.1
(152.2)
84.1
(162.3)
106.9
(205.9)
Nucleotides
196.6
(385.1)
282.7
(549.3)
234.7
(459.5)
338.1
(658.1)
κ
261.0
(518.0)
286.6
(567.2)
291.2
(576.3)
310.0
(612.0)
All
278.6
(545.3)
433.3
(842.5)
302.8
(591.6)
442.8
(859.6)
All with GTR
switching
-
-
-
456.4
(884.9)
lnL(HKY) = -18579.6
Improvement in ln Lˆ over HKY+Γ
lnL(HKY+dG) = -16209.9
(Improvement in AIC over HKY+Γ)
THMM+Γ Frequencies
State 1
State 2
State 3
A C G T A C G T A C G T
A
C
State 1
G
T
A
C
State 2
G
T
A
C
State 3
G
T
C=
Rate of transitions
between hidden states
relative to substitution rate
0.14
THMM+Γ All+H
State 1
State 2
State 3
A C G T A C G T A C G T
A
C
State 1
G
T
A
C
State 2
G
T
A
C
State 3
G
T
C=
Rate of transitions
between hidden states
relative to substitution rate
0.07
More results: data from PANDIT
Mixture models
Number of
GTR
parameter
switching
2
3
2
3
1338.9
3360.0
5060.7
5264.8
(2613.9)
(1728.0)
(10025.3)
(10401.7)
4134.3
6293.6
6594.5
9629.7
(8140.7)
(12331.2)
(13029.1)
(18971.3)
4186.0
4795.0
5705.7
6303.6
(8307.9)
(9494.1)
(11315.5)
(12479.2)
6379.5
9299.0
9265.1
14001.8
14409.5
(12567.0)
(18214.0)
(18306.1)
(27587.6)
(28371.0)
hidden states
Rates
Frequencies
Kappa
All
Single switching
3
-
-
-
Results extracted from 16 families (including groEL) with a total of 847
sequences and 24441 aligned columns of data.
Σ lnL(HKY) =
-1 053 026.8
Σ Improvement in ln Lˆ over HKY+Γ
Σ lnL(HKY+dG) =
-1 017 588.4
(Σ Improvement in AIC over HKY+Γ)
Improvement =
35 438.4
More evolution = more heterogeneity?
All+Γ with GTR switching
DAIC per site relative to HKY+dG
3
81
51
2
38
83
70
52
60
64
24
1
23
23
74
77
48
33
46
0
0
25
50
Tree length
(Looks similar for dN/dS and dN)
75
Γ-distributed RAS
No RAS
Rates
8
Rates
3
6
2
4
1
0
0
10
20
30
40
50
Frequencies
8
6
4
2
0
0
10
20
30
40
50
Kappa
8
6
4
2
0
0
10
20
30
40
50
All
8
6
Increase in AIC per site relative to HKY+Γ model
Increase in AIC per site relative to HKY model
2
0
0
10
20
30
40
50
60
70
50
60
70
40
50
60
70
40
50
60
70
Frequencies
3
2
1
0
0
10
20
30
40
Kappa
3
2
1
0
0
10
20
30
All
3
2
4
1
2
0
0
0
10
20
30
Tree length
40
50
0
10
20
30
Tree length
More evolution = more heterogeneity?
Potential cause 1: Something wrong with the statistics
ΔAIC per site relative to HKY(+Γ) is not correcting properly for
improvements given by additional branches or something else…
Some kind of systematic error as tree length grows, such as tree
estimate accuracy
Potential cause 2: Something biologically interesting
As tree length grows the substitution process tends to appear more
heterogeneous
A C G T
T
T
C
G
T
A
A
C
G
T
Time
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
A C G T
A
C
G
T
Rate = 2.0
A C G T
T
T
C
G
T
A
A
C
G
T
Time
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
A C G T
A
C
G
T
Rate = 2.0
A C G T
T
T
C
G
T
A
A
C
G
T
Time
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
A C G T
A
C
G
T
Rate = 2.0
A C G T
T
T
C
G
T
A
A
C
G
T
Time
Rate = 0.5
A C G T
A
C
G
T
Rate = 1.0
A C G T
A
C
G
T
Rate = 2.0
A simple model for describing complexity
Degeneracy of the genetic code
The degeneracy of the genetic code can
leads to staccato patterns of
evolution, particularly at the 3rd codon
Present in nearly all analyses of nucleotide
coding data
Other types of complexity
Any sequence where biological function
places restrictions on how sites
change and those restrictions have
the potential to vary over time
Hidden biological
process
4-fold
degeneracy
2-fold
degeneracy
1-fold
degeneracy
Conclusions
Temporal and spatial heterogeneity
Spatial variation in rate masks other effects
Most complex model provides best description of data in all cases
Progression to 4 hidden state models provides further improvement,
but runs into numerical optimisation problems
Biological causes of heterogeneity
May occur whenever there is biological function in sequence data
Long evolutionary times may require (even) more sophisticated models
THMMs could provide a simple framework for describing and drawing
inferences from heterogeneity induced by complex dependencies