Phylogenetic Trees - Parsimony Tutorial #12 Next semester:

Download Report

Transcript Phylogenetic Trees - Parsimony Tutorial #12 Next semester:

Phylogenetic Trees - Parsimony
Tutorial #12
Next semester:
Project in advanced algorithms for
phylogenetic reconstruction
(236512)
Initial details in:
http://www.cs.technion.ac.il/~moran/lab06.htm
- Come to me for more details -
.
Phylogenetic Reconstruction
We’d like to study the evolutionary history of species
Distance-based approach:
• Calculate (ML) pairwise (evolutionary) distances between species
• Find the edge-weighted tree best describing this metric
Major drawback:
• Lose of information when reducing data to pairwise distances
Character-based approach:
• Consider the character vector of each specie:
– morphological characters
– bio-molecular characters
• Optimization criteria:
– parsimony
– likelihood / posterior-probability
.
Most Parsimonious Tree
Parsimony-score:
Number of character-changes (mutations) along the evolutionary tree
(tree containing labels on internal vertices)
Example:
Score = 4
0
1 AAA
AAG
0
AAA
1
AGA
0
Score = 3
0
1 AAA
AAA 2
AAA
GGA
AAA
AAG
0
AAA
1
0
AGA
AGA
1
GGA
Most parsimonious tree:
 Tree with minimal parsimony score
Minimal Evolution Principle
3
Small vs. Large Parsimony
We break the problem into two:
1.
Small parsimony: Given the topology find the best assignment to
internal nodes
2. Large parsimony: Find the topology which gives best score
 Large parsimony is NP-hard
 We’ll show solution to small parsimony (Fitch and Sankoff’s algorithms)
Input to small parsimony:
tree with character-state assignments to leaves
Example:
Aardvark Bison Chimp
Dog
Elephant
A: CAGGTA
B: CAGACA
C: CGGGTA
D: TGCACT
E: TGCGTA
4
Fitch’s Algorithm
Execute independently for each character:
1.
Bottom-up phase: Determine set of possible states for each
internal node
2. Top-down phase: Pick states for each internal node
Dynamic Programming framework
2
1
Aardvark Bison Chimp
Dog
CAGGTA
CGGGTA
CAGACA
TGCACT
Elephant
TGCGTA
5
Fitch’s Algorithm
Bottom-up phase
Determine set of possible states for each internal node
•
•
Initialization: Ri = {si}
Do a post-order (from leaves to root) traversal of tree
– Determine Ri of internal node i with children j, k:

 R j  Rk if R j  Rk   

Ri  

R

R
otherwise


k
 j

T
T
Parsimony-score =
# union operations
AGT
CT
C
GT
T G
score = 3
T
A
T
6
Fitch’s Algorithm
Top-down phase
Pick states for each internal node
•
•
Pick arbitrary state in Rroot for the root
Do pre-order (from root to leaves) traversal of tree
– Determine sj of internal node j with parent i:
si if si  R j

sj  

arbitrary
state

R
otherwise


j
Complexity: O(mnk)
T
T
#characters
#states
#taxa/nodes
AGT
CT
C
GT
T G
score = 3
T
A
T
7
Weighted Parsimony
Sankoff’s algorithm
•
Each mutation a↔b costs differently - S(a,b).
1.
Bottom-up phase: Determine Ri(s) – cost of optimal stateassignment for subtree of i, when it is assigned state s.
2. Top-down phase: Pick optimal states for each internal node
Fitch’s algorithm as special case:
• Ri – set of states which yield minimal-cost subtree of i
Same as algorithm for
optimal lifted tree alignment
(Tutorial #4)
8
Sankoff’s Algorithm
Bottom-up phase
Determine Ri(s) for each internal node
•
•
0 if si  s 
Initialization: Ri ( s)  

 otherwise 
Do a post-order (from leaves to root) traversal of tree
– Determine Ri of internal node i with children j, k:
Ri ( s )  min s ' R j ( s ' )  S ( s' , s) min s ' Rk ( s' )  S ( s' , s)
Natural generalization
For non-binary trees
Remember pointers
ss’
C
T G
T
A
T
9
Sankoff’s Algorithm
Top-down phase
Pick states for each internal node
•
Select minimal cost character for root (s minimizing Rroot(s))
•
Do pre-order (from root to leaves) traversal of tree:
- For internal node j, with parent i, select state that produced
minimal cost at i (use pointers kept in 1st stage)
min s ' R j ( s' )  S ( s' , s)
Ri ( s) 

min s ' Rk ( s' )  S ( s' , s)
Complexity: O(mnk2)
C
T G
T
A
T
#characters
#states
#taxa/nodes
10
Fitch’s Algorithm
as special case of Sankoff’s algorithm
0 if a  b
1 otherwise
Unweighted parsimony: S (a, b)  
Sankoff’s algorithm:
• Ri(s) - cost of optimal subtree of i, when it is assigned state s
Fitch’s algorithm:
• Score(i) - cost of optimal state-assignment for subtree of i
• Ri
- set of optimal state-assignment for subtree of i
We need to show that:
1. Optimal tree assigns node i with state from Ri.
2. Fitch’s bottom-up recursive formula for Ri. is correct:

 R j  Rk if R j  Rk   
 Check for yourselves
Ri  

R

R
otherwise


k
 j

11
Fitch’s Algorithm
as special case of Sankoff’s algorithm
0 if a  b
1 otherwise
Unweighted parsimony: S (a, b)  
•
•
Score(i) - cost of optimal state-assignment for subtree of i
Ri
- set of optimal state-assignment for subtree of i
We need to show that:
1. Optimal tree assigns node i with state from Ri.
• Trivially true for the root
• Assume (to the contrary) that in an optimal assignment, some
node – j is assigned sj∉Rj
root
Why is this not
the case for the
weighted version?
i
j
Parsimony-score is integer
sj∉Rj  Rj(sj) ≥ Score(j)+1 
By switching from sj to some s∊Rj
we do not raise the parsimony-score
12
Exploring the Space of Trees
• We saw how to find optimal state-assignment for a given tree topology
• We need to explore space of topologies
• Given n sequences there are (2n-3)!! possible rooted trees
and (2n-5)!! possible unrooted trees
n!! 1  3  5    n  2 2  n 2 !
n
taxa (n)
3
4
5
6
8
10
# rooted trees
3
15
105
945
135,135
34,459,425
# unrooted trees
1
3
15
105
10,395
2,027,025
13
Exploring the Space of Trees
Possible solutions:
1.
Heuristic solutions for “traveling” through “topology-space”
2. Find (basic) topology using distance-based methods (NJ)
Notice another problem:
•
•
We obtain state-assignments to taxa using multiple alignment
We obtain optimal MA using topology of phylogenetic tree
(e.g. CLUSTAL)
Solution:
• Again, use some initial topology (via NJ)
AGG T - C - G - T T C G
TG - A A C C1,C2 ,
…
, Cm
14