Transcript (PPT)
CS 394C Algorithms for Computational Biology Tandy Warnow Fall 2009 Biology: 21st century’s Science “We can safely say that the 20th century has been the century of physics….And, according to the prominent Dutch physicist, Frans Saris, the hegemony of physics in the scientific world is over. Saris proclaims the 21st century the century of biology.” http://www.freedomlab.org/2009/01/27/the-hegemony-ended-already/the 21st century the century of biology. Biology: 21st Century Science! “When the human genome was sequenced seven years ago, scientists knew that most of the major scientific discoveries of the 21st century would be in biology.” January 1, 2008, guardian.co.uk Computational Biology: Just about anything goes! Computer science topics: Biological topics: • Function and structure prediction • Gene clustering • Metagenomics • Microarray analysis • Molecular modelling • Multiple sequence alignment • Neuroscience • Phylogenetic estimation • Protein docking • Protein-protein interactions • Sequence assembly • • • • • • • • • • • • • • Approximation algorithms Combinatorial optimization Computational analysis Computational complexity Computational geometry Computational image processing Computational topology Databases Data mining Graph-theory and algorithms Machine learning Neural networks Probability theory Scientific visualization Genome Sequencing Projects: Started with the Human Genome Project Whole Genome Shotgun Sequencing: Graph Algorithms and Combinatorial Optimization! Where did humans come from, and how did they move throughout the globe? QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. The 1000 Genome Project: using human genetic variation to better treat diseases Other Genome Projects! (Neandertals, Wooly Mammoths, and more ordinary creatures…) Metagenomics: C. Ventner et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes How did life evolve on earth? Current methods often use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week We prove theorems using graph theory and probability theory, and our algorithms are studied on real and Courtesy of the Tree of Life project Warnow and Linder Lab simulated data. This course • Fundamental mathematics of phylogeny and alignment estimation • Applied research problems: – – – – – Metagenomics Supertrees Simultaneous estimation of alignments and trees Reticulate evolution Historical linguistics Phylogenetic trees can be based upon morphology But some estimations need DNA! Orangutan Gorilla Chimpanzee Human Phylogenetic reconstruction methods 1. Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. 2. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Local optimum Cost Global optimum Phylogenetic trees 3. Bayesian methods But solving this problem exactly is … unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900 “Boosting” MP heuristics • We use “Disk-covering methods” (DCMs) to improve heuristic searches for MP and ML Base method M DCM DCM-M Rec-I-DCM3 significantly improves performance (Roshan et al.) 0.2 0.18 Current best techniques 0.16 0.14 Average MP 0.12 score above optimal, shown 0.1 as a percentage of the optimal 0.08 0.06 DCM boosted version of best techniques 0.04 0.02 0 0 4 8 12 16 Hours Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset 20 24 DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] 0.8 NJ Error Rate DCM1-NJ 0.6 0.4 0.2 0 0 400 800 No. Taxa 1200 •Theorem: DCM1-NJ converges to the true tree from polynomial length sequences 1600 Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA… Deletion Mutation The true pairwise alignment is: …ACGGTGCAGTTACCA… …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACCAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences. U V W X Y AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT X U Y V W SATé Algorithm (Liu et al., Warnow-Linder lab) SATé = Simultaneous Alignment and Tree Estimation, Science 2009 Obtain initial alignment and estimated ML tree T T Use new tree (T) to compute new alignment (A) Estimate ML tree on new alignment A SATé Results on 1000 taxon datasets • 24 hour SATé analysis • Other simultaneous estimation methods cannot run on large datasets Reconstructing the Tree of Life Challenges: - millions of species - lots of missing data QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Two possible approaches: - Combined Analysis - Supertree Methods Combined Analysis Methods gene 1 S1 S2 S3 S4 S7 S8 gene 3 TCTAATGGAA gene 2 GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA S4 S5 S6 S7 GGTAACCCTC GCTAAACCTC GGTGACCATC GCTAAACCTC S1 S3 S4 S7 S8 TATTGATACA TCTTGATACC TAGTGATGCA TAGTGATGCA CATTCATACC Combined Analysis gene 1 gene 2 gene 3 S1 S2 S3 S4 S5 S6 S7 S8 TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? TCTAAGGGAA ? ? ? ? ? ? ? ? ? ? TCTTGATACC TCTAACGGAA GGTAACCCTC TAGTGATGCA ?????????? GCTAAACCTC ? ? ? ? ? ? ? ? ? ? ?????????? GGTGACCATC ? ? ? ? ? ? ? ? ? ? TCTAATGGAC GCTAAACCTC TAGTGATGCA TATAACGGAA ? ? ? ? ? ? ? ? ? ? CATTCATACC Two competing approaches Species gene 1 gene 2 . . . ... gene k Combined Analysis Analyze separately ... Supertree Method Many Supertree Methods Matrix Representation with Parsimony (Most commonly used) • • • • • • • • MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI • • • • SDM Q-imputation PhySIC Majority-Rule Supertrees • Maximum Likelihood Supertrees • and many more ... Superfine: a new supertree method (Swenson, et al., Warnow-Linder Lab) Input: unrooted source trees Output: unrooted tree on the entire set of taxa Not yet submitted for publication Algorithmic strategy of SuperFine • First, construct a supertree with low false positives The Strict Consensus Merger • Then, refine the tree to reduce false negatives by resolving each polytomy Quartet Max Cut False Negative Rate QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Scaffold Density (%) False Negative Rate QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Scaffold Density (%) Running Time SuperFine vs. MRP QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%) An open problem in supertree estimation • Input: a collection of source trees on subsets of a set S of taxa • Output: tree T on the full set of taxa that minimizes the sum of the topological distances to the source trees This is NP-hard. How can we solve this effectively? Historical linguistics • Languages evolve, just like biological species. • How can we determine how languages evolve? • How can we use information on language evolution, to determine how human populations moved across the globe? Questions about Indo-European (IE) • How did the IE family of languages evolve? • Where is the IE homeland? • When did Proto-IE “end”? • What was life like for the speakers of protoIndo-European (PIE)? The Anatolian hypothesis (from wikipedia.org) Date for PIE ~7000 BCE The Kurgan Expansion • Date of PIE ~4000 BCE. • Map of Indo-European migrations from ca. 4000 to 1000 BC according to the Kurgan model • From http://indo-european.eu/wiki Estimating the date and homeland of the proto-Indo-Europeans • Step 1: Estimate the phylogeny • Step 2: Reconstruct words for protoIndo-European (and for intermediate proto-languages) • Step 3: Use archaeological evidence to constrain dates and geographic locations of the proto-languages “Perfect Phylogenetic Network” (Nakhleh et al., Language) Reticulate evolution • Not all evolution is tree-like: – Horizontal gene transfer – Hybrid speciation • How can we detect reticulate evolution? Metagenomics • Input: set of sequences • Output: a tree on the set of sequences, indicating the species identification of each sequence • Issue: the sequences are not globally alignable, and there are often thousands (or more) of the sequences Course Details • Phylogeny and multiple sequence alignment are the basis of almost everyting in the course • The first 1/3 of the class will provide the basics of the material • The next 2/3 will go into depth into selected topics Course details • There is no textbook • I will provide notes (online) • We will read papers from the scientific literature Your course project will most likely be a literature survey. If you prefer, you can engage in research! Grading • Homeworks 50% • Final exam 25% • Class participation 10% • Class project 15%