Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress
Download ReportTranscript Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA Dagstuhl Seminar, 2010 Recombination • One of the principle genetic forces shaping sequence variations within species • Two equal length sequences generate a third new equal length sequence in genealogy • Spatial order is important: different parts of genome inherit from different ancestors. 110001111111001 1100 00000001111 Prefix 000110000001111 Suffix Breakpoint 2 Ancestral Recombination Graph (ARG) 00 Recombination Mutations 10 1 0 0 1 1 1 10 01 S1 = 00 S2 = 01 S3 = 10 S4 = 10 00 10 11 S1 = 00 S2 = 01 S3 = 10 S4 = 11 01 00 Assumption: At most one mutation per site Network model: beyond tree model Reconstruction of Network-based Evolutionary History Different formulation Input: DNA sequences (haplotypes) or phylogenetic trees Biology: meiotic recombination in populations, or reticulate evolutionary processes: horizontal gene transfer or hybrid speciation Same objective Reconstruct the network-based evolutionary history (and related problems) • Efficiency • Accuracy 4 Reconstructing ARGs by Parsimony Kreitman’s data for adh locus of D. Malonagaster (1983) • Input: a set of binary sequences M • Goal: reconstruct ARGs deriving M • Parsimony formulation – minARG: Minimize the number of recombination events – NP complete (Wang, et al) 5 The minARG Problem Structural constrained ARGs, e.g. galled trees (Wang, et al, Gusfield, et al). • Simplified ARG topology Heuristic methods, e.g. program MARGARITA (Durbin, et al.), Song, et al., Parida, et al. Exact minARG by branch and bound (Lyngso, Song and Hein) Uniform sampling of minARGs by treating each minARG as equally likely (Wu) Estimating the range of minARGs: lower and upper bounds minARG for Kreitman’s data Rmin: minimum number of recombination for M. L(M): lower bound on Rmin U(M): upper bound on Rmin Several lower bounds give L(M)=7. Challenge: accurate inference of ARGs U(M)=7 for Kreitman’s data (Song, Wu and Gusfield). Thus, Rmin(M)=7 ARG Induces Local Trees Local trees: evolutionary history at a genomic position. Data 0000 0000 0101 Trace backwards in time. At recombination node, pick the branch passing alleles to the recombinant at this location. 0000 0110 0100 1110 1010 0010 0110 0101 0110 1110 Local tree near site 3 1010 1010 0000 Mutations Recombination 8 Local Trees Change Across the Genome Local trees change when moving across recombination breakpoints. Data 0000 0000 0101 Spatial property: 0000 Nearby local tree tends to be more similar. 0110 0100 1110 1010 0010 0110 0101 0110 1110 Local tree near site 2 How good is the inferred ARGs? 1010 1010 0000 Compare the inferred local tree topologies with the simulated trees Inferring Local Trees Problem: given binary sequences, infer local tree topologies (one tree for each site, ignore branch length) Key: local trees have different topology due to recombination Trees or Network? Do not reconstruct full network; local trees are very informative Parsimony-based approaches • Hein (1990,1993), Song and Hein (2005) • Wu (2010): shared topological features in nearby trees. Accuracy: Robinson-Foulds distances between inferred trees and the simulated tree Challenge: How to improve the accuracy? RENT: REfining Neighboring Trees • Maintain for each SNP site a (possibly nonbinary) tree topology – Initialize to a tree containing the split induced by the SNP • Gradually refining trees by adding new splits to the trees – Splits found by a set of rules (later) – Splits added early may be more reliable • Stop when binary trees or enough information is recovered 11 A Little Background: Compatibility A B C M a b c d e 000 100 001 101 011 Sites A and B are compatible, but A and C are incompatible. • Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. • Easily extended to splits. 12 Fully-Compatible Region: Simple Case • A region of consecutive SNP sites where these SNPs are pairwise compatible. – May indicate no topology-altering recombination occurred within the region • Rule: for site s, add any such split to tree at s. – Compatibility: very strong property and unlikely arise due to chance. A B C 13 Split Propagation: More General Rule • Three consecutive sites A,B and C. Sites A and B are incompatible. Does site C matter for tree at site A? – Trees at site A and B are different. – Suppose site C is compatible with sites A and B. Then? – Site C may indicate a shared subtree in both trees at sites A and B. • Rule: a split propagates to both directions until reaching a incompatible tree. A B C 14 Reticulate Networks Gene trees: phylogenetic trees from gene sequences - Assume: Binary and rooted - Different topologies at different genes 1: 2: 3: 4: Gene A 000 001 110 100 1: 2: 3: 4: Gene B 000 101 010 001 ρ ρ T’ T Reticulate evolution: one explanation - Hybrid speciation, horizontal gene transfer 1 2 3 4 1 3 2 Reticulate network: A directed acyclic graph displaying each of the gene trees Hybridization event: nodes with in-degree two or more Keep two red edges 1 Keep two black edges 2 3 4 4 The Minimum Reticulation Problem Given: a set of K gene trees G. NP complete: even for K=2 Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. Current approaches: T1 1 T3 T2 2 3 4 1 2 3 4 1 2 4 3 Challenge: efficient and accurate reconstruction of reticulate network for multiple trees. N 1 • exact methods for K=2 case (see Semple, et al) • impose topological constraints (e.g. galled networks, see Huson, et al.) 2 3 4 Close lower and upper bounds for arbitrary number of trees (Wu, 2010) Performance of PIRN: Optimal Solution Horizontal axis: number of taxa Vertical axis: % of data LB=UB K: number of trees r: level of reticulation • Lower and upper bounds often match for many data 17 Performance of PIRN: Gap of Bounds Horizontal axis: number of taxa K: number of trees Vertical axis: gap between lower and upper bounds r: level of reticulation • Gap between the lower and upper bounds is often small for many data 18 Reticulate Network for Five Poaceae Trees ndhF phyB Lower bound: 11 Upper bound: 13 rbcL rpoC2 ITS 19 Reticulate Network for Five Poaceae Trees Upper bound: 13 used in this network 20 Acknowledgement • More information available at: http://www.engr.uconn.edu/~ywu • Research supported by National Science Foundation and UConn Research Foundation 21 Coalescent with Recombination Coalescent theory: define probabilistic distribution of genealogy Likelihood computation for coalescent with recombination Likelihood: summation of probability of all the ARGs Challenging: too many ARGs (Lyngso, Song and Hein) Probability of ARGs under certain parameters Importance Sampling approach: draw samples (ARGs) wrt some probablistic distribution Work well with no recombination Not working well with recombination Coalescent-based ARG Sampling minARG Uniform sampling of minARGs (Wu, 2007) • Treat each minARG as equally likely. • Algorithm for generating an minARG uniformly at random (exponential time for setting up, but polynomial-time in sampling) Probability of ARGs under certain parameters Challenge: develop a more general ARG sampling method that can efficiently sample ARGs approximately according to coalescent probabilities. A related problem: compute coalescent likelihood with recombination efficiently. Recent work: exact computation of coalescent likelihood under infinite sites model with no recombination (Wu, 2009) The Mosaic Model M: input sequences M, K=2 0000 0101 Total 5 breakpoint 0111 1111 1110 Assumption: input sequences are descendent of K founder sequences (unknown) Extant sequences: concatenation of exact copies of founder segment (no shift of position) breakpoint • Coloring: assign which position of a sequence is from which founder (color); need consistency The Minimum Mosaic Problem Inferred founders Data from Rastas and Ukkonen 20 sequences 40 sites 55 breakpoints: minimum number of breakpoints • Problem: given a set of binary sequences and the number of founder K, find a K-coloring of these sequences to minimize the number of color change (recombination breakpoints) • And find the K founder sequences (not part of input) The Minimum Mosaic Problem • Introduced by Ukkonen (2002) • Simple and easier to visualize • Main known results – An exponential-time algorithm which runs in polynomialtime algorithm for K=2 (Ukkonen 2002) – An exact method that works for relatively small K and modest-sized data (Wu and Gusfield, 2007) – Haplovisual program and other extensions by Rastas and Ukkonen (2007). – Heuristic algorithm by Roli and Blum (2009) – Lower bounds for the minimum number of breakpoints needed (Wu, 2010) • Challenges – Polynomial-time algorithm for K 3? – Concrete applications in biology? 26