A simple model for a complex world Simon Whelan University of Manchester Isaac Newton Institute.
Download ReportTranscript A simple model for a complex world Simon Whelan University of Manchester Isaac Newton Institute.
A simple model for a complex world Simon Whelan University of Manchester Isaac Newton Institute Modelling sequence evolution Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 TCTTTATTGACGTGTATGGACAATTC... TCTTTGTTAACGTGCATGGACAATTC... TCCTTGCTAACATGCATGGACAATTC... TCTTTGCTAACGTGCATGGATAATTC... TCTT---TAACGTGCATAGATAACTC... TCAC---TAACATGTATAGATAACTC... TCTCTTCTAACGTGCATTGTGAAGTC... TCTCTTTTGACATGTATTGAAAAATC... A C G T A C G T Simple models assume: • All sites evolving to the same process • All parts of the tree evolve to the same process Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 TCTTTATTGACGTGTATGGACAATTC... TCTTTGTTAACGTGCATGGACAATTC... TCCTTGCTAACATGCATGGACAATTC... TCTTTGCTAACGTGCATGGATAATTC... TCTT---TAACGTGCATAGATAACTC... TCAC---TAACATGTATAGATAACTC... TCTCTTCTAACGTGCATTGTGAAGTC... TCTCTTTTGACATGTATTGAAAAATC... A C G T A C G T A C G T A C G T Rate = 0.5 A C G T A C G T Rate = 1.0 Rate = 2.0 Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 TCTTTATTGACGTGTATGGACAATTC... TCTTTGTTAACGTGCATGGACAATTC... TCCTTGCTAACATGCATGGACAATTC... TCTTTGCTAACGTGCATGGATAATTC... TCTT---TAACGTGCATAGATAACTC... TCAC---TAACATGTATAGATAACTC... TCTCTTCTAACGTGCATTGTGAAGTC... TCTCTTTTGACATGTATTGAAAAATC... A C G T A C G T A C G T A C G T Rate = 0.5 A C G T A C G T Rate = 1.0 Rate = 2.0 Spatial heterogeneity in sequence evolution Also known as pattern heterogeneity Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 TCTTTATTGACGTGTATGGACAATTC... TCTTTGTTAACGTGCATGGACAATTC... TCCTTGCTAACATGCATGGACAATTC... TCTTTGCTAACGTGCATGGATAATTC... TCTT---TAACGTGCATAGATAACTC... TCAC---TAACATGTATAGATAACTC... TCTCTTCTAACGTGCATTGTGAAGTC... TCTCTTTTGACATGTATTGAAAAATC... A C G T A C G T A C G T A C G T Rate = 0.5 A C G T A C G T Rate = 1.0 Rate = 2.0 Temporal heterogeneity in sequence evolution Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 TCTTTATTGACGTGTATGGACAATTC... TCTTTGTTAACGTGCATGGACAATTC... TCCTTGCTAACATGCATGGACAATTC... TCTTTGCTAACGTGCATGGATAATTC... TCTT---TAACGTGCATAGATAACTC... TCAC---TAACATGTATAGATAACTC... TCTCTTCTAACGTGCATTGTGAAGTC... TCTCTTTTGACATGTATTGAAAAATC... A C G T A C G T A C G T A C G T Rate = 0.5 A C G T A C G T Rate = 1.0 Rate = 2.0 Temporal heterogeneity in sequence evolution Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 Seq8 TCTTTATTGACGTGTATGGACAATTC... TCTTTGTTAACGTGCATGGACAATTC... TCCTTGCTAACATGCATGGACAATTC... TCTTTGCTAACGTGCATGGATAATTC... TCTT---TAACGTGCATAGATAACTC... TCAC---TAACATGTATAGATAACTC... TCTCTTCTAACGTGCATTGTGAAGTC... TCTCTTTTGACATGTATTGAAAAATC... A C G T A C G T A C G T A C G T Rate = 0.5 A C G T A C G T Rate = 1.0 Rate = 2.0 Motivation Biological Not including heterogeneity leads to inaccurate inferences (Naylor; Lockhart) Form of heterogeneity is poorly characterised Understanding heterogeneity may lead to biological insights Modelling Needs to describe general heterogeneity (Warnow) Must be identifiable (Rhodes; Allman) Should be computationally efficient: • Few parameters • Small(-ish) number of states • Applicable to tree search Temporal hidden Markov models (THMMs) General type of model Describes temporal and spatial heterogeneity Allows simple likelihood computation (reversible; stationary; i.i.d.) Previous incarnations Mostly examine temporal and spatial rate variation Covarion model of Tuffley and Steel and its progeny Other names include: • Markov modulated Markov processes (models) • Switching processes • Covarion-like Substitution processes There are 1,…,g separate HKY substitution processes, each representing a hidden state in a HMM The kth hidden state is defined by rate matrix Mk: ~k M k k k A~ k A ~k A ~ ,~ ,~ ,~ k A k C k G k T ~Ck ~ k C k ~Ck k ~Gk ~Tk k k ~k ~ G T ~Tk ~Gk = nucleotide distribution of hidden state k k = rate of hidden state k k = transition/transversion rate ratio of hidden state k Note: Subscripts refer to observable states. Superscripts refer to hidden states Temporal heterogeneity: a switching model A reversible Markov model describing the switching rate between hidden states This process defined by g x g rate matrix C 1, 2 ~1 C 1, g ~1 k ,l ~1 , ~ 2 ,, ~ g 1, 2~ 2 1, g~ g 2, g ~ g 2, g ~ 2 = exchangeability between hidden states k and l = probability of a hidden state Note: Subscripts refer to observable states. Superscripts refer to hidden states Defining a THMM The 4g x 4g instantaneous rate matrix is: Qik, ,jl M ik, j ~ l k ,l j C 0 for all i j and k l for all i j and k l for all i j and k l Qik, ,jl = changes between observable states i, j and hidden states k, l Equilibrium distribution is ik ~ k ~ik Hidden states and observable states do not change simultaneously Note: Subscripts refer to observable states. Superscripts refer to hidden states THMMs for spatial and temporal heterogeneity State 1 State 2 State 3 A C G T A C G T A C G T A C State 1 G T A C State 2 G T A C State 3 G T C= Rate of transitions between hidden states relative to substitution rate 0.07 Note: Value proportional to bubble area Mixture models for spatial heterogeneity State 1 State 2 State 3 A C G T A C G T A C G T A C State 1 G T A C State 2 G T A C State 3 G T Restricting all to zero results in a mixture model k ,l Probability of different hidden states accounted for by the equilibrium distribution at the root Mixture models for spatial heterogeneity Pr( ) Pr( ) Pr( ) Investigating heterogeneity in groEL Data Herbeck et al. (2005) examined groEL sequences to investigate origins of primary endosymbionts Variability of GC content demonstrated to affect tree estimate There are 23 sequences of length 1572 nucleotides (all 3 codon positions) Investigating heterogeneity in groEL Investigating spatial heterogeneity Use mixture model ( k ,l set to 0) Examine 2 and 3 hidden states Relative importance of rate ( k to vary), nucleotide frequencies ( ~ k to vary), and Ts/Tv bias ( k to vary) Importance of all HKY parameters varying between classes Investigating simple temporal heterogeneity Single extra degree of freedom over mixture models Use ‘simple’ THMM ( k ,l set to equal) Relative importance of allowing different HKY parameters to vary temporally Investigating simple temporal heterogeneity GTR switching allows all k ,l to vary Results: groEL (no Γ- distribution) Number of hidden states Mixture model THMM 2 3 2 3 Rates 2225.9 (4353.7) 2442.1 (4784.3) 2227.7 (4355.5) 2443.8 (4785.6) Nucleotides 796.5 (1491.0) 1330.8 (2551.5) 796.5 (1489.0) 1339.0 (2566.1) κ 299.1 (500.3) 316.6 (533.2) 300.0 (499.9) 316.7 (531.4) All 2353.3 (4600.7) 2672.1 (5226.1) 2353.3 (4598.7) 2673.2 (5226.3) All with GTR switching - - - 2694.0 (5266.0) lnL(HKY) = -18579.6 Improvement in ln Lˆ over HKY lnL(HKY+dG) = -16209.9 (Improvement in AIC over HKY) It’s all about rate: frequencies A C G T A C G T A C G T A C G T A C G T A C G T C= Rate of transitions between hidden states relative to substitution rate 0.04 Results: groEL (with Γ- distribution) Number of hidden states Mixture model THMM 2 3 2 3 Rates 69.5 (135.1) 79.1 (152.2) 84.1 (162.3) 106.9 (205.9) Nucleotides 196.6 (385.1) 282.7 (549.3) 234.7 (459.5) 338.1 (658.1) κ 261.0 (518.0) 286.6 (567.2) 291.2 (576.3) 310.0 (612.0) All 278.6 (545.3) 433.3 (842.5) 302.8 (591.6) 442.8 (859.6) All with GTR switching - - - 456.4 (884.9) lnL(HKY) = -18579.6 Improvement in ln Lˆ over HKY+Γ lnL(HKY+dG) = -16209.9 (Improvement in AIC over HKY+Γ) THMM+Γ Frequencies State 1 State 2 State 3 A C G T A C G T A C G T A C State 1 G T A C State 2 G T A C State 3 G T C= Rate of transitions between hidden states relative to substitution rate 0.14 THMM+Γ All+H State 1 State 2 State 3 A C G T A C G T A C G T A C State 1 G T A C State 2 G T A C State 3 G T C= Rate of transitions between hidden states relative to substitution rate 0.07 More results: data from PANDIT Mixture models Number of GTR parameter switching 2 3 2 3 1338.9 3360.0 5060.7 5264.8 (2613.9) (1728.0) (10025.3) (10401.7) 4134.3 6293.6 6594.5 9629.7 (8140.7) (12331.2) (13029.1) (18971.3) 4186.0 4795.0 5705.7 6303.6 (8307.9) (9494.1) (11315.5) (12479.2) 6379.5 9299.0 9265.1 14001.8 14409.5 (12567.0) (18214.0) (18306.1) (27587.6) (28371.0) hidden states Rates Frequencies Kappa All Single switching 3 - - - Results extracted from 16 families (including groEL) with a total of 847 sequences and 24441 aligned columns of data. Σ lnL(HKY) = -1 053 026.8 Σ Improvement in ln Lˆ over HKY+Γ Σ lnL(HKY+dG) = -1 017 588.4 (Σ Improvement in AIC over HKY+Γ) Improvement = 35 438.4 More evolution = more heterogeneity? All+Γ with GTR switching DAIC per site relative to HKY+dG 3 81 51 2 38 83 70 52 60 64 24 1 23 23 74 77 48 33 46 0 0 25 50 Tree length (Looks similar for dN/dS and dN) 75 Γ-distributed RAS No RAS Rates 8 Rates 3 6 2 4 1 0 0 10 20 30 40 50 Frequencies 8 6 4 2 0 0 10 20 30 40 50 Kappa 8 6 4 2 0 0 10 20 30 40 50 All 8 6 Increase in AIC per site relative to HKY+Γ model Increase in AIC per site relative to HKY model 2 0 0 10 20 30 40 50 60 70 50 60 70 40 50 60 70 40 50 60 70 Frequencies 3 2 1 0 0 10 20 30 40 Kappa 3 2 1 0 0 10 20 30 All 3 2 4 1 2 0 0 0 10 20 30 Tree length 40 50 0 10 20 30 Tree length More evolution = more heterogeneity? Potential cause 1: Something wrong with the statistics ΔAIC per site relative to HKY(+Γ) is not correcting properly for improvements given by additional branches or something else… Some kind of systematic error as tree length grows, such as tree estimate accuracy Potential cause 2: Something biologically interesting As tree length grows the substitution process tends to appear more heterogeneous A C G T T T C G T A A C G T Time Rate = 0.5 A C G T A C G T Rate = 1.0 A C G T A C G T Rate = 2.0 A C G T T T C G T A A C G T Time Rate = 0.5 A C G T A C G T Rate = 1.0 A C G T A C G T Rate = 2.0 A C G T T T C G T A A C G T Time Rate = 0.5 A C G T A C G T Rate = 1.0 A C G T A C G T Rate = 2.0 A C G T T T C G T A A C G T Time Rate = 0.5 A C G T A C G T Rate = 1.0 A C G T A C G T Rate = 2.0 A simple model for describing complexity Degeneracy of the genetic code The degeneracy of the genetic code can leads to staccato patterns of evolution, particularly at the 3rd codon Present in nearly all analyses of nucleotide coding data Other types of complexity Any sequence where biological function places restrictions on how sites change and those restrictions have the potential to vary over time Hidden biological process 4-fold degeneracy 2-fold degeneracy 1-fold degeneracy Conclusions Temporal and spatial heterogeneity Spatial variation in rate masks other effects Most complex model provides best description of data in all cases Progression to 4 hidden state models provides further improvement, but runs into numerical optimisation problems Biological causes of heterogeneity May occur whenever there is biological function in sequence data Long evolutionary times may require (even) more sophisticated models THMMs could provide a simple framework for describing and drawing inferences from heterogeneity induced by complex dependencies