Building phylogenetic trees

Download Report

Transcript Building phylogenetic trees

Building Phylogenetic Trees
Yaw-Ling Lin (林耀鈴)
Dept Computer Sci and Info Management
Providence University, Taiwan
E-mail: [email protected]
WWW: http://www.cs.pu.edu.tw/~yawlin
1
Phylogenetic Tree
branch
internal node
leaf
• Topology: bifurcating
– Leaves - 1…N
– Internal nodes N+1…2N-2
2
Orthologues / Paralogues
3
Rooted / Unrooted Tree
4
Counting Trees
5
Counting Trees
A
B
# Taxa (N) # Unrooted trees
C
A
B
C
A
C
B
D
D
E
A
B
C
F
D
E
3
4
5
6
7
8
9
10
.
.
.
.
30
1
3
15
105
945
10,935
135,135
2,027,025
.
.
.
.
≈3.58 x 1036
(2N - 5)!! = # unrooted trees for N taxa
(2N- 3)!! = # rooted trees for N taxa 6
Rrooting the tree:
B
To root a tree mentally,
imagine that the tree is
made of string. Grab the
string at the root and
tug on it until the ends of
the string (the taxa) fall
opposite the root:
Root
D
Unrooted tree
A
A
Note that in this rooted tree, taxon A is
no more closely related to taxon B than
it is to C or D.
C
B
C
D
Rooted tree
Root
7
UPGMA -- Unweighted Pair Group
Method with Arithmetic mean
simplest method - uses sequential clustering algorithm
(assumption of rate constancy among lineages - often violated)
step 1
A
Distance matrix
B
B dAB
C dAC dBC
step 2
(AB)
C d(AB)C
d(AB)C = (dAC + dAB) / 2
A
A
Tree
B
B
dAB / 2
C
d(AB)C / 2
8
UPGMA Step 1
combine B and C
A
d
a
e
b
A
B
C
D
E
B
0
C
10
0
D
12
4
0
E
10
4
6
0
7
13
15
13
0
c
9
UPGMA step 2
combine BC and D
(10+12)/2
d
A
BC
D
E
a
2
e
b
A
2
c
BC
0
D
11
0
E
10
5
0
7
14
13
0
(4+6)/2
10
UPGMA step 3
combine A and E
2 .5
.5
A
d
a
2
e
b
2
A
BCD
E
BCD E
0 10.5
7
0 13.5
0
c
11
UPGMA step 4
combine AE and BCD
2 .5
3.5
3.5
.5
a
2
e
AE
d
b
AE
BCD
2
BCD
0
12
0
c
12
UPGMA Result
2 .5
3.5
A
2.5
3.5
3.5
.5
d
a
2
e
b
A
B
C
D
E
B
0
C
10
0
D
12
4
0
E
10
4
6
0
7
13
15
13
0
2
c
13
UPGMA Result
2.5
3
3.5
3
2
2
5
1
2.5
3.5
3.5
d
a
.5
d
a
1
e
b
3
2
c
e
b
2
c
14
UPGMA(1)
15
UPGMA(2)
16
UPGMA -- Ilustrations
17
When UPGMA fails …
18
Neighbor Joining
• Very popular method
• Does not make molecular clock assumption :
modified distance matrix constructed to adjust for
differences in evolution rate of each taxon
• Produces unrooted tree
• Assumes additivity: distance between pairs of
leaves = sum of lengths of edges connecting them
• Like UPGMA, constructs tree by sequentially
joining subtrees
19
Additivity
20
Naïve NJ by Additivity?
•
•
•
•
•
O(n2) (i,j) pairs
O(n2) (k,l) pairs
(k,l) “rejects” (i,j) whenever additivity fails
O(n4) to pick an (i,j) neighbor pair!
So totally O(n5) time suffices
21
Neighbor Joining: Once we know
the correct (i,j) pair
22
Neighbour Joining: why not pick
the smallest (i,j) pair?
23
Neighbor-Joining: Algorithm
(A). For each pair i, j of OTUs, compute
Sij  ( N  2) Dij  ( Ri  Rj ),
j
i
where Ri   Dik
k
(B). Pick a pair i, j for which Sij is the smallest.
Create a new node k and infer distances:
1
( Dim  Djm  Dij ) for m  i, j
2
(C). The branch lengths from the new node are
Dkm 
j
i
k
1
[( N  2) Dij  Ri  Rj ]
Dik 
2( N  2)
and
1
[( N  2) Dij  Ri  Rj ]
2( N  2)
After deleting i and j from D and adding u, the
Djk 
process is repeated until the tree is complete
m
d ik  d jk  ri  rj
d ik  d jk  d ij
26
Neighbor-Joining: Complexity
• The method performs a search using time
O(n2) and using time O(n2) to update
distance matrix.
• Giving a total time complexity of O(n3),and
a space complexity of O(n2).
27
Reasoning the NJ Method
• How did the ideas of Si,j and Ri comes
from ?
• How correct is the algorithm?
• Heuristic or exact solution?
28
The “1-star” Sum of the Branch
Lengths
N
N
1
So   Lix 
Dij

N  1 i j
i 1
 T /( N  1)
• D and L as the distance between OTUs
and the branch length between nodes
• Each branch is counted N-1 times when all
distances are added
29
The “paired-2-star” Sum of the
Branch Lengths
N
N
1
LXY 
[ ( D1k  D 2 k )  ( N  2)(L1 X  L 2 X )  2 LiY ]
2( N  2) k 3
i 3
L1 X  L 2 X  D12
N
1
LiY 
Dij


N  3 3i  j
i 3
30
The “paired-2-star” Tree Size
N
N
1
LXY 
[ ( D1k  D 2 k )  ( N  2)(L1 X  L 2 X )  2 LiY ]
2( N  2) k 3
i 3
L1 X  L 2 X  D12
N
 LiY 
i 3
1
Dij

N  3 3i  j
N
S 12  LXY  ( L1 X  L 2 X )   LiY
i 3
N
1
1
1

( D1k  D 2 k )  D12 
Dij


2( N  2) k 3
2
N  2 3i  j
31
Before the proof
N
S 12  LXY  ( L1 X  L 2 X )   LiY
i 3
N
1
1
1

( D1k  D 2 k )  D12 
Dij


2( N  2) k 3
2
N  2 3i  j
N
N
k 1
k 1
if R1   D1k and R 2   D 2 k then
2T  R1  R 2 D12
S 12 

2( N  2)
2
Since T is the same for all pairs of i and j, Sij can be replaced by
Sij  ( N  2) Dij  Ri  Rj
for the purpose of computing the relative value.
33
Before the proof (Cont.)
From t hedefinit ionof t heSij one get
Sik  Sij 
1
 ( Dik  Djm  Dij  Dkm)
2( N  2 ) m  i , j , k
(a )
and
Skl  Suj 
 ( D
im
 Djm  Dij)  ( Dkm  Dlm  Dkl )
( b)
m i , j ,k ,l
i , j , k , l are dist int OT Us.
34
Neighbor-Joining: The proof
Assume that D is generatedby a treein which all branches
are positive.
Lem m a: If i and j are neighbors,thenSij is strictlyleast
elementin its row and column.
proof : Each summandin Sik  Sij 
 (D
ik
 Djm  Dij  Dkm)
mi , j ,k
is positive.
Theorem: If i and j are chosenso thatSij is minimum,then
i and j are neighbors.
35
1
3
Lemma
2
4
36
1
3
2
4
Proof
proof : Sik  Sij   ( Dik  Djm  Dij  Dkm) is positive
m i , j ,k
 Dij  Dkm  Dik  Djm
 Sik  Sij
38
Proof of the Theorem:
by contradiction
r
i
k
s
Type1: A = -2Dux-2Duv
w
B
x
u
Type2: B = -4Dvx+2Duv
x
For the sum in formula b to be
nonnegative, Type2 should be
more than Type1.
v
x
j
Skl  Sij 
A
l
( D
im
 Djm  Dij)  ( Dkm  Dlm  Dkl)
(b)
mi , j , k ,l
Suppose that i and j are not neighbors. Let k and l be any pair of neighbors, so that i,
j, k, and l are distinct and are represented in the tree .Consider the sum in formula (b),
which is nonnegative. If m is fifth OUT, then it joins the tree at point x along one of
the indicated arcs. Say that m is of type 1 if it joins the path from I to j at any node
different from u and that m is of type 2 if it joins the path from i to j at node u.
39
Proof of the theorem (Cont.)
If m is of type 1,then the corresponding summand in formula (b)
is -2Dux-2Duv. If m is of type 2, then the corresponding summand in
formula (b) is -4Dvx+2Duv. For the sum in formula (b) to be nonnegative,
there must be at least as many terms corresponding to OTUs m of type 2 as
there are terms corresponding top OTUs m of type 1. It follows that there
are more OTUs that join the path from i to j at u than there are OTUs that
join that path at all other nodes combined.
Because neither i nor j has a neighbor, there must be a pair r,s of
neighbors that argument applied to w that is different from u, By the above
argument applied to w, there are more OTUs that join the path from i to j at
w than there are OTUs that join that path at all other nodes combined. The
conclusions about u and w contradict each other, and the theorem follows.
40
Speeding up Neighbor-Joining Tree
Construction
• In this paper, the authors present several heuristics
for speeding up the NJ method.
• The heuristics attempt to reduce the search time by
using a quad-tree.
• The worst case time complexity remains O(n3) and
the space complexity after adding the quad-tree is
still O(n2).
• The authors have implemented a tool, QuickJoin.
41
Previous Work
• The neighbor-joining method is introduced by
Saitou and Nei.
• The algorithm was later amended by Studier
and Keppler with a running time O(n3).
• BIONJ -- Gascuel et al. produce a O(n3)
implementation of a variant of the NJ
algorithm that produce more accurate trees in
many cases.
• QuickTree -- Durbin et al. produce an code
optimized implementation of the NJ algorithm.
42
+/- of distance methods
• Advantages:
– easy to perform
– quick calculation
– fit for sequences having high similarity scores
• Disadvantages:
– the sequences are not considered as such (loss of
information)
– all sites are generally equally treated (do not take into
account differences of substitution rates )
– not applicable to distantly divergent sequences.
44
Parsimony
45
Maximum Parsimony Method
principle - search for tree that requires the smallest number
of character state changes between the OTUs
informative sites - those that
favor some trees over others
operationally - at least two
different kinds of residues
at the site, each of which is
found in at least two of the
OUT sequences
OTU
1
2
3
4
1
T
T
T
T
2
C
T
T
T
3
A
A
C
C
Site
4 5 6
G A T
G A A
G A T
T A A
7
C
C
C
G
8
T
T
G
G
9
A
A
A
A
10
G
G
G
C
46
Evaluating Parsimony Scores
• How do we compute the Parsimony score
for a given tree?
• Traditional Parsimony
– Each base change has a cost of 1
• Weighted Parsimony
– Each change is weighted by the score c(a,b)
47
Traditional Parsimony
Par (s1,..., sn ;T )  min
 1{ xu  xv }
internal {u ,v }E
nodes
a {a}
•Solved independently for
each position
•Linear time solution
a {a,g}
a
g
a
48
Traditional Parsimony
49
Evaluating Weighted
Parsimony
Dynamic programming on the tree
Initialization:
• For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise
S(i,a) = 
Iteration:
• if k is node with children i and j, then
S(k,a) = minb(S(i,b)+c(a,b)) + minb(S(j,b)+c(a,b))
Termination:
• cost of tree is minaS(r,a) where r is the root
50
Cost of Evaluating Parsimony
• Score is evaluated on each position independetly.
Scores are then summed over all positions.
• If there are n nodes, m characters, and k possible
values for each character, then complexity is
O(nmk)
• By keeping traceback information, we can
reconstruct most parsimonious values at each
ancestor node
52
Weighted Parsimony
53
Traditional Parsimony is not
“complete”
54
Parsimony: Searching over all
trees by Branch and Bound
55
Inferring trees –
Maximum Likelihood method
• Maximum likelihood supposes a model of
evolution along tree branches.
• Strategy:
Find parameters (tree, branch lengths, substitution rate) that
maximizes the likelihood assigned to the data.
• Note: Model of evolution does not include indels!
• In Phylip package: program PROTML
59
Probabilistic Methods
• The phylogenetic tree represents a generative
probabilistic model (like HMMs) for the observed
sequences.
• Background probabilities: q(a)
• Mutation probabilities: P(a|b, t)
• Models for evolutionary mutations
– Jukes Cantor
– Kimura 2-parameter model
• Such models are used to derive the probabilities
60
Jukes Cantor model
• A model for mutation rates
• Mutation occurs at a
constant rate
• Each nucleotide is
equally likely to mutate
into any other nucleotide
with rate a.
61
Kimura 2-parameter model
• Allows a different rate for transitions and
transversions.
62
Mutation Probabilities
• The rate matrix R is used to derive the mutation
probability matrix S:
S(t   )  I  R
• S is obtained by integration. For Jukes Cantor:
1
P (a | a, t )  Saa (t )  (1  3e  4t )
4
1
P ( g | a, t )  Sag (t )  (1  e  4t )
4
• q can be obtained by setting t to infinity
63
Mutation Probabilities
• Both models satisfy the following properties:
• Lack of memory:
–Pa c (t  t ')  Pa b (t )Pb c (t ')
A
C
G
T
b
• Reversibility:
– Exist stationary probabilities
{Pa} s.t.PaPab (t )  PbPba (t )
64
Probabilistic Approach
• Given P,q, the tree topology
and branch lengths, we can
t4
compute:
x5
x4
t1
P ( x1, x 2 , x3, x 4 , x5 | T , t ) 
x1
t2
x2
t3
x3
q( x5 )p( x 4 | x5 , t 4 )p( x3 | x5 , t 3 )p( x1 | x 4 , t1 )p( x 2 | x 4 , t 2 )
65
Computing the Tree Likelihood
 We
are interested in the probability of observed
data given tree and branch “lengths”:
P( x1, x 2, x3 | T , t ) 
P( x1, x 2, x3, x 4, x5 | T ,t )
x 4,x 5
 Computed
by summing over internal nodes
 This can be done efficiently using a tree upward
traversal pass.
66
Tree Likelihood Computation
• Define P(Lk|a)= prob. of leaves below node k
given that xk=a
• Init: for leaves: P(Lk|a)=1 if xk=a ; 0 otherwise
• Iteration: if k is node with children i and j, then
P(Lk | a)   P(b | a, ti )L(i | b)P(c | a, t j )L( j | c )
a,b
• Termination:Likelihood is
P( x1
, , x3 | T , t )   P(Lroot | a)q(a)
a
67
Maximum Likelihood (ML)
P( X1,, X n | T , t )   P( x1[m],, xn [m] | T , t )
m
• Score each tree by
– Assumption of independent positions
• Branch lengths t can be optimized
– Gradient ascent
– EM
• We look for the highest scoring tree
– Exhaustive
– Sampling methods (Metropolis)
68
Optimal Tree Search
• Perform search over possible topologies
T1
T2
T3
Parameter space
Parametric
optimization
(EM)
Local Maxima
T4
Tn
69
Computational Problem
• Such procedures are computationally expensive!
• Computation of optimal parameters, per candidate,
requires non-trivial optimization step.
• Spend non-negligible computation on a candidate,
even if it is a low scoring one.
• In practice, such learning procedures can only
consider small sets of candidate structures
70
Max Likelihood versus Parsimony
3
1
1
3
1
2
1
4
2
4
3
4
2
3
0.3
0.1
0.09
0.1
0.3
2
4
T
T1
T2
T3
(Example from BSA p. 225)
Choose tree T, with unequal branch lengths.
Generate 1000 sequences of length N according to probabilistic model
(A) Reconstruction by ML
(B) Reconstruction by Parsimony
N
T1
T2
T3
N
T1
T2
T3
20
419
339
242
20
396
378
224
100
638
204
158
100
405
515
79
500
904
61
35
500
404
594
2
2000
997
3
0
2000
353
646
0
71
Conclusion: ML infers right tree as N gets larger, Parsimony does not necessarily.
Max Likelihood versus NJ
3
1
0.1
0.09
0.1
1
3
1
2
1
4
2
4
3
4
2
3
0.3
0.3
2
T
4
T1
T2
T3
(Example from BSA p. 225)
Choose tree T, with unequal branch lengths.
Generate 1000 sequences of length N according to probabilistic model
(A) Reconstruction by ML
N
T1
T2
T3
20
419
339
242
100
638
204
158
500
904
61
35
2000
997
3
0
(B) Reconstruction by NJ
Conclusion: ML infers right tree as N gets largerl. If the probabilistic model is
correct, the ML distances shall be very close to additive, therefore the NJ method
72
predicts the correct tree.
Phylip - practicalities
• Menu-driven, no command line
• Input file format:
– First line: <number of sequences> <number of letters per sequence>
– Next lines: Sequences:
• First ten characters is the sequence name
• Then sequence follows. Spaces and newlines are allowed.
• Dashes (-) signify gaps
• Example:
4 46
hba1
MV-LSPADKTNVKAAWGKVG
AHAGEYGAEALERMFLSFPTTKTYFP
beta
MVHLTPEEKSAVTALWGKVN
VDEVGGEALGRLLVVYPWTQRFFESF
Myoglobin –MGLSDGEWQLVLNVWGKVE
ADIPGHGQEVLIRLFKGHPETLEKFD
Leghemogl MGAFSEKQESLVKSSWEAFK
QNVPHHSAVFYTLILEKAPAAQNMFS
73
The End
74