Transcript Slide 1

1
Molecular Evolution
and Phylogenetic Tree
Reconstruction
4
3
1
4
5
2
2
3
5
Phylogenetic Trees
• Nodes: species
• Edges: time of independent
evolution
• Edge length represents
evolution time
 AKA genetic distance
 Not necessarily
chronological time
Inferring Phylogenetic Trees
Trees can be inferred by several criteria:
 Morphology of the organisms
• Can lead to mistakes!
 Sequence comparison
Example:
Mouse:
Rat:
Baboon:
Chimp:
Human:
ACAGTGACGCCCCAAACGT
ACAGTGACGCTACAAACGT
CCTGTGACGTAACAAACGA
CCTGTGACGTAGCAAACGA
CCTGTGACGTAGCAAACGA
Inferring Phylogenetic Trees
• Sequence-based methods
 Deterministic (Parsimony)
 Probabilistic (SEMPHY)
• Distance-based methods
 UPGMA
 Neighbor-Joining
• Can compute distances from sequences
Distance Between Two Sequences
Basic principles:
• Degree of sequence difference is proportional to length of
independent sequence evolution
• Only use positions where alignment is certain – avoid
areas with (too many) gaps
Distance Between Two Sequences
Given sequences xi, xj,
Define
dij = distance between the two sequences
One possible definition:
dij = fraction f of sites u where xi[u]  xj[u]
Better scores are derived by modeling evolution as a
continuous change process
Outline
• Molecular Evolution
• Distance Methods
 UPGMA / Average Linkage
 Neighbor-Joining
• Sequence Methods
 Deterministic (Parsimony)
 Probabilistic (SEMPHY)
Molecular Evolution
Q:
A:
•
•
•
•
How can we model evolution on nucleotide
level? (ignore gaps, focus on substitutions)
Consider what happens at a specific position
for small time interval Δt
P(t) = vector of probabilities of {A,C,G,T} at time t
μAC = rate of transition from A to C per unit time
μA = μAC + μAG + μAT rate of transition out of A
pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + …
Molecular Evolution
In matrix/vector notation, we get
P(t+Δt) = P(t) + Q P(t) Δt
where Q is the substitution rate matrix
Molecular Evolution
• This is a differential equation:
P’(t) = Q P(t)
• A substitution rate matrix Q implies a probability
distribution over {A,C,G,T} at each position,
including stationary (equilibrium) frequencies πA,
πC, πG, πT
• Each Q is an evolutionary model (some work
better than others)
Evolutionary Models
• Jukes-Cantor
• Kimura
• Felsenstein
• HKY
Estimating Distances
• Solve the differential equation and compute
expected evolutionary time given sequences
• Jukes-Cantor
• Kimura
Outline
• Molecular Evolution
• Distance Methods
 UPGMA / Average Linkage
 Neighbor-Joining
• Sequence Methods
 Deterministic (Parsimony)
 Probabilistic (SEMPHY)
A simple clustering method for building tree
UPGMA (unweighted pair group method using arithmetic averages)
Or the Average Linkage Method
Given two disjoint clusters Ci, Cj of sequences,
1
dij = ––––––––– {p Ci, q Cj}dpq
|Ci|  |Cj|
Claim that if Ck = Ci  Cj, then distance to another cluster Cl is:
dil |Ci| + djl |Cj|
dkl = ––––––––––––––
|Ci| + |Cj|
Algorithm: Average Linkage
1
Initialization:
4
Assign each xi into its own cluster Ci
Define one leaf per sequence, height 0
3
Iteration:
5
2
Find two clusters Ci, Cj s.t. dij is min
Let Ck = Ci  Cj
Define node connecting Ci, Cj, and place it at
height dij/2
Delete Ci, Cj
Termination:
When two clusters i, j remain, place root at
height dij/2
1
4
2
3
5
Average Linkage Example
v
v
w
0
x
6
y
8
z
8
0
x
8
8
8
0
4
4
0
2
y
w
xyz
0
6
8
0
8
vw xyz
8
v
w
v
z
w
xyz
vw
0
8
xyz
0
0
0
v
w
x
yz
v
w
x
yz
0
6
8
8
0
8
8
0
4
0
4
3
2
1
v
w
x
y
z
Ultrametric Distances and Molecular Clock
Definition:
A distance function d(.,.) is ultrametric if for any three distances dij  dik 
dij, it is true that
dij  dik = dij
The Molecular Clock:
The evolutionary distance between species x and y is 2 the Earth time
to reach the nearest common ancestor
That is, the molecular clock has constant rate in all species
The molecular clock
results in ultrametric
distances
years
1
4
2
3
5
Ultrametric Distances & Average Linkage
1
4
2
3
5
Average Linkage is guaranteed to reconstruct correctly a binary tree with
ultrametric distances
Proof: Exercise
Weakness of Average Linkage
Molecular clock: all species evolve at the same rate (Earth time)
However, certain species (e.g., mouse, rat) evolve much faster
Example where UPGMA messes up:
AL tree
Correct tree
3
2
1
4
1
4
2
3
Additive Distances
1
8
d1,4
3
13
7
9
5
11
10
2
4
12
6
Given a tree, a distance measure is additive if the distance between any pair of
leaves is the sum of lengths of edges connecting them
Given a tree T & additive distances dij, can uniquely reconstruct edge lengths:
•
•
Find two neighboring leaves i, j, with common parent k
Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m  i, j
Reconstructing Additive Distances Given T
x
y
D
v
v
w
x
y
z
0
w
T
5
4
x
y
z
10
17
16
16
0
15
14
14
0
9
15
0
14
0
3
z
7
3
4
w
6
v
If we know T and D, but do not
know the length of each leaf, we
can reconstruct those lengths
Reconstructing Additive Distances Given T
x
y
D
v
v
w
x
y
z
0
w
x
y
T
z
10
17
16
16
0
15
14
14
0
9
15
0
14
0
z
w
v
Reconstructing Additive Distances Given T
D
v
v
w
0
10 17 16 16
w
0
x
y
x
z
T
y
15 14 14
x
0
y
9
15
0
14
z
z
a
0
D1
a
x
y
z
a
x
0
11 10 10
0
y
z
9
15
0
14
0
dax = ½ (dvx + dwx – dvw)
day = ½ (dvy + dwy – dvw)
daz = ½ (dvz + dwz – dvw)
w
v
Reconstructing Additive Distances Given T
D1
a
x
a
x
0
11 10 10
x
y
0
z
9
y
y
4
15
0
b
14
z
z
0
7
D2
a
b
z
a
b
z
0
6
10
0
D3
10
0
T
5
a
c
a
c
0
3
0
3
c
3 a
4
w
6
d(a, c) = 3
d(b, c) = d(a, b) – d(a, c) = 3
d(c, z) = d(a, z) – d(a, c) = 7
d(b, x) = d(a, x) – d(a, b) = 5
d(b, y) = d(a, y) – d(a, b) = 4
d(a, w) = d(z, w) – d(a, z) = 4
d(a, v) = d(z, v) – d(a, z) = 6
Correct!!!
v
Neighbor-Joining
• Guaranteed to produce the correct tree if distance is additive
• May produce a good tree even when distance is not additive
1
Step 1: Finding neighboring leaves
3
0.1
0.1
0.1
Define
Dij = (N – 2) dij – ki dik – kj djk
0.4
2
Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighbors
0.4
4
Algorithm: Neighbor-Joining
Initialization:
Define T to be the set of leaf nodes, one per sequence
Let L = T
Iteration:
Pick i, j s.t. Dij is minimal
Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L
Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik
where ri = (N – 2)-1 ki dik
Remove i, j from L;
Add k to L
Termination:
When L consists of two nodes, i, j, and the edge between them of length dij
Outline
• Molecular Evolution
• Distance Methods
 UPGMA / Average Linkage
 Neighbor-Joining
• Sequence Methods
 Deterministic (Parsimony)
 Probabilistic (SEMPHY)
Parsimony
•
One of the most popular methods:


GIVEN multiple alignment
FIND tree & history of substitutions explaining alignment
Idea:
Find the tree that explains the observed sequences with a minimal
number of substitutions
Two computational subproblems:
1. Find the parsimony cost of a given tree (easy)
2. Search through all tree topologies (hard)
Example: Parsimony Cost of One Column
{A}
C=1
{A}
{A, B}
C++
A
B
A
A
A
{A}
B
{B}
A
{A}
A
{A}
Parsimony Scoring
Given a tree, and an alignment column u
Label internal nodes to minimize the number of required substitutions
Initialization:
Set cost C = 0; node k = 2N – 1 (last leaf)
Iteration:
If k is a leaf, set Rk = { xk[u] }
// Rk is simply the character of kth species
If k is not a leaf,
Let i, j be the daughter nodes;
Set Rk = Ri  Rj if intersection is nonempty
Set Rk = Ri  Rj, and increment C if intersection is empty
Termination:
Minimal cost of tree for column u, = C
Example
{B}
{A,B}
{A}
{B}
{A}
{A,B}
{A}
A
A
A
A
B
B
A
B
{A}
{A}
{A}
{A}
{B}
{B}
{A}
{B}
Parsimony Traceback
Traceback:
1. Choose an arbitrary nucleotide from R2N – 1 for the root
2. Having chosen nucleotide r for parent k,
If r  Ri choose r for daughter i
Else, choose arbitrary nucleotide from Ri
Easy to see that this traceback produces some assignment of cost C
Another Parsimony Algorithm
Let C(v) be cost for subtree rooted at node v
Let C(v,x) be cost for subtree rooted at v if we force v to have value x
Initialization:
For each leaf v
C(v) = 0
C(v,x) = 0 if x is input character that labels v; C(v,x) = ∞
otherwise
Iteration:
Let u, w be children of v
C(v,x) = min(C(u) + 1, C(u,x)) + min(C(v) + 1, C(v,x))
C(v) = min C(v,x)
Termination:
Minimal cost is C(root)
Probabilistic Methods
xroot
t1
t2
x1
x2
A more refined measure of evolution along a tree than parsimony
P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)
If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,
= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)
Probabilistic Methods
xroot
xu
x2
x1
•
xN
If we know all internal labels xu,
P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)
•

jrootP(xj
| xparent(j), tj, parent(j))
Usually we don’t know the internal labels, therefore
P(x1, x2, …, xN | T, t) =
 x x
N+1
N+2
…
x
2N-1
P(x1, x2, …, x2N-1 | T, t)
Felsenstein’s Likelihood Algorithm
Define:
and recursively compute:
Felsenstein’s Likelihood Algorithm
Now using u and U we can compute:
and
Probabilistic Methods
Given M (ungapped) alignment columns of N sequences,
• Define likelihood of a tree:
L(T, t) = P(Data | T, t) =

m=1…M
P(x1m, …, xnm | T, t)
Maximum Likelihood Reconstruction:
• Given data X = (xij), find a topology T and length vector t that
maximize likelihood L(T, t)
Current popular methods
HUNDREDS of programs available!
http://evolution.genetics.washington.edu/phylip/software.html#methods
Some recommended programs:
•
Discrete—Parsimony-based
 Rec-1-DCM3
http://www.cs.utexas.edu/users/tandy/mp.html
Tandy Warnow and colleagues
•
Probabilistic
 SEMPHY
http://www.cs.huji.ac.il/labs/compbio/semphy/
Nir Friedman and colleagues