Transcript (PPT)

CS 394C
Algorithms for Computational Biology
Tandy Warnow
Fall 2009
Biology: 21st century’s Science
“We can safely say that the 20th century
has been the century of physics….And,
according to the prominent Dutch
physicist, Frans Saris, the hegemony of
physics in the scientific world is over.
Saris proclaims the 21st century the
century of biology.”
http://www.freedomlab.org/2009/01/27/the-hegemony-ended-already/the 21st
century the century of biology.
Biology: 21st Century Science!
“When the human genome was
sequenced seven years ago, scientists
knew that most of the major scientific
discoveries of the 21st century would be
in biology.”
January 1, 2008, guardian.co.uk
Computational Biology:
Just about anything goes!
Computer science topics:
Biological topics:
•
Function and structure prediction
•
Gene clustering
•
Metagenomics
•
Microarray analysis
•
Molecular modelling
•
Multiple sequence alignment
•
Neuroscience
•
Phylogenetic estimation
•
Protein docking
•
Protein-protein interactions
•
Sequence assembly
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Approximation algorithms
Combinatorial optimization
Computational analysis
Computational complexity
Computational geometry
Computational image
processing
Computational topology
Databases
Data mining
Graph-theory and algorithms
Machine learning
Neural networks
Probability theory
Scientific visualization
Genome Sequencing Projects:
Started with the Human Genome Project
Whole Genome Shotgun Sequencing:
Graph Algorithms and Combinatorial Optimization!
Where did humans come from,
and how did they move
throughout the globe?
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
The 1000 Genome Project: using
human genetic variation to better
treat diseases
Other Genome Projects! (Neandertals, Wooly
Mammoths, and more ordinary creatures…)
Metagenomics:
C. Ventner et al., Exploring the Sargasso Sea:
Scientists Discover One Million New Genes in
Ocean Microbes
How did life evolve on earth?
Current methods often use months to
estimate trees on 1000 DNA sequences
Our objective:
More accurate trees and alignments
on 500,000 sequences in under a week
We prove theorems using graph theory
and probability theory, and our
algorithms are studied on real and
Courtesy of the Tree of Life project
Warnow and Linder Lab
simulated data.
This course
• Fundamental mathematics of phylogeny and
alignment estimation
• Applied research problems:
–
–
–
–
–
Metagenomics
Supertrees
Simultaneous estimation of alignments and trees
Reticulate evolution
Historical linguistics
Phylogenetic trees can be
based upon morphology
But some estimations need DNA!
Orangutan
Gorilla
Chimpanzee
Human
Phylogenetic reconstruction methods
1.
Polynomial time distance-based methods: Neighbor Joining, FastME,
Weighbor, etc.
2.
Hill-climbing heuristics for NP-hard optimization criteria (Maximum
Parsimony and Maximum Likelihood)
Local optimum
Cost
Global optimum
Phylogenetic trees
3.
Bayesian methods
But solving this problem exactly is …
unlikely
# of
Taxa
# of Unrooted
Trees
4
3
5
15
6
105
7
945
8
10395
9
135135
10
2027025
20
2.2 x 1020
100
4.5 x 10190
1000
2.7 x 102900
“Boosting” MP heuristics
• We use “Disk-covering methods”
(DCMs) to improve heuristic searches
for MP and ML
Base method M
DCM
DCM-M
Rec-I-DCM3 significantly improves performance
(Roshan et al.)
0.2
0.18
Current best techniques
0.16
0.14
Average MP
0.12
score above
optimal, shown 0.1
as a percentage
of the optimal 0.08
0.06
DCM boosted version of best techniques
0.04
0.02
0
0
4
8
12
16
Hours
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset
20
24
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
•Theorem:
DCM1-NJ
converges to
the true tree
from polynomial
length
sequences
1600
Indels and substitutions at the
DNA level
Deletion Mutation
…ACGGTGCAGTTACCA…
Indels and substitutions at the
DNA level
Deletion Mutation
…ACGGTGCAGTTACCA…
Indels and substitutions at the
DNA level
Deletion Mutation
…ACGGTGCAGTTACCA…
…ACCAGTCACCA…
Deletion Mutation
The true pairwise alignment is:
…ACGGTGCAGTTACCA…
…ACGGTGCAGTTACCA…
…AC----CAGTCACCA…
…ACCAGTCACCA…
The true multiple alignment on a set of
homologous sequences is obtained by tracing
their evolutionary history, and extending the
pairwise alignments on the edges to a
multiple alignment on the leaf sequences.
U
V
W
X
Y
AGTGGAT
TATGCCCA
TATGACTT
AGCCCTA
AGCCCGCTT
X
U
Y
V
W
SATé Algorithm
(Liu et al., Warnow-Linder lab)
SATé = Simultaneous Alignment and Tree Estimation, Science 2009
Obtain initial alignment
and estimated ML tree T
T
Use new tree (T)
to compute new
alignment (A)
Estimate ML tree on
new alignment
A
SATé Results on 1000 taxon
datasets
• 24 hour SATé analysis
• Other simultaneous estimation methods cannot run on
large datasets
Reconstructing the Tree of
Life
Challenges:
- millions of species
- lots of missing data
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Two possible approaches:
- Combined Analysis
- Supertree Methods
Combined Analysis Methods
gene 1
S1
S2
S3
S4
S7
S8
gene 3
TCTAATGGAA
gene 2
GCTAAGGGAA
TCTAAGGGAA
TCTAACGGAA
TCTAATGGAC
TATAACGGAA
S4
S5
S6
S7
GGTAACCCTC
GCTAAACCTC
GGTGACCATC
GCTAAACCTC
S1
S3
S4
S7
S8
TATTGATACA
TCTTGATACC
TAGTGATGCA
TAGTGATGCA
CATTCATACC
Combined Analysis
gene 1 gene 2 gene 3
S1
S2
S3
S4
S5
S6
S7
S8
TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA
GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
TCTAAGGGAA ? ? ? ? ? ? ? ? ? ? TCTTGATACC
TCTAACGGAA GGTAACCCTC TAGTGATGCA
??????????
GCTAAACCTC ? ? ? ? ? ? ? ? ? ?
??????????
GGTGACCATC ? ? ? ? ? ? ? ? ? ?
TCTAATGGAC GCTAAACCTC TAGTGATGCA
TATAACGGAA ? ? ? ? ? ? ? ? ? ? CATTCATACC
Two competing approaches
Species
gene 1
gene 2 . . .
...
gene k
Combined
Analysis
Analyze
separately
...
Supertree
Method
Many Supertree Methods
Matrix Representation with Parsimony
(Most commonly used)
•
•
•
•
•
•
•
•
MRP
weighted MRP
Min-Cut
Modified Min-Cut
Semi-strict Supertree
MRF
MRD
QILI
•
•
•
•
SDM
Q-imputation
PhySIC
Majority-Rule
Supertrees
• Maximum Likelihood
Supertrees
• and many more ...
Superfine: a new supertree method
(Swenson, et al., Warnow-Linder Lab)
Input: unrooted source trees
Output: unrooted tree on the entire set
of taxa
Not yet submitted for publication
Algorithmic strategy of
SuperFine
• First, construct a supertree with low
false positives
The Strict Consensus
Merger
• Then, refine the tree to reduce false
negatives by resolving each
polytomy
Quartet Max Cut
False Negative Rate
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Scaffold Density (%)
False Negative Rate
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Scaffold Density (%)
Running Time
SuperFine vs. MRP
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
MRP 8-12 sec.
SuperFine 2-3 sec.
Scaffold Density (%)
Scaffold Density (%)
Scaffold Density (%)
An open problem in supertree
estimation
• Input: a collection of source trees on
subsets of a set S of taxa
• Output: tree T on the full set of taxa that
minimizes the sum of the topological
distances to the source trees
This is NP-hard.
How can we solve this effectively?
Historical linguistics
• Languages evolve, just like biological
species.
• How can we determine how languages
evolve?
• How can we use information on
language evolution, to determine how
human populations moved across the
globe?
Questions about
Indo-European (IE)
• How did the IE family of languages evolve?
• Where is the IE homeland?
• When did Proto-IE “end”?
• What was life like for the speakers of protoIndo-European (PIE)?
The Anatolian hypothesis
(from wikipedia.org)
Date for PIE ~7000 BCE
The Kurgan Expansion
• Date of PIE ~4000 BCE.
• Map of Indo-European migrations from ca. 4000 to 1000 BC
according to the Kurgan model
• From http://indo-european.eu/wiki
Estimating the date and homeland of the
proto-Indo-Europeans
• Step 1: Estimate the phylogeny
• Step 2: Reconstruct words for protoIndo-European (and for intermediate
proto-languages)
• Step 3: Use archaeological evidence to
constrain dates and geographic
locations of the proto-languages
“Perfect Phylogenetic Network”
(Nakhleh et al., Language)
Reticulate evolution
• Not all evolution is tree-like:
– Horizontal gene transfer
– Hybrid speciation
• How can we detect reticulate evolution?
Metagenomics
• Input: set of sequences
• Output: a tree on the set of sequences,
indicating the species identification of
each sequence
• Issue: the sequences are not globally
alignable, and there are often
thousands (or more) of the sequences
Course Details
• Phylogeny and multiple sequence
alignment are the basis of almost
everyting in the course
• The first 1/3 of the class will provide the
basics of the material
• The next 2/3 will go into depth into
selected topics
Course details
• There is no textbook
• I will provide notes (online)
• We will read papers from the scientific
literature
Your course project will most likely be a
literature survey.
If you prefer, you can engage in research!
Grading
• Homeworks 50%
• Final exam 25%
• Class participation 10%
• Class project 15%