Transcript Slide 1

1
Phylogeny Tree
Reconstruction
4
3
1
4
5
2
2
3
5
Final Exam
• 24-hour, takehome exam
• More straight-forward questions than in homeworks
• Please email Michael and Serafim by Friday, with your
preference of day to take exam
• Exam starts Sunday, …, Thursday noon; ends Monday,
..., Friday noon
Number of labeled unrooted tree topologies
2
1
4
4
4
3
• How many possibilities are there for leaf 4?
Number of labeled unrooted tree topologies
2
1
4
3
• How many possibilities are there for leaf 4?
For the 4th leaf, there are 3 possibilities
Number of labeled unrooted tree topologies
2
1
4
5
3
• How many possibilities are there for leaf 5?
For the 5th leaf, there are 5 possibilities
Number of labeled unrooted tree topologies
2
1
4
5
3
• How many possibilities are there for leaf 6?
For the 6th leaf, there are 7 possibilities
Number of labeled unrooted tree topologies
2
1
4
5
3
• How many possibilities are there for leaf n?
For the nth leaf, there are 2n – 5 possibilities
Number of labeled unrooted tree topologies
2
1
4
5
3
N = 10
#unrooted: 2,027,025
#rooted: 34,459,425
N = 30
#unrooted: 8.7x1036
#rooted:
4.95x1038
•
#unrooted trees for n taxa: (2n-5)*(2n-7)*...*3*1 = (2n-5)! / [2n-3*(n-3)!]
•
#rooted trees for n taxa: (2n-3)*(2n-5)*(2n-7)*...*3 = (2n-3)! / [2n-2*(n-2)!]
Search through tree topologies:
Branch and Bound
Observation: adding an edge to an existing tree can only increase the
parsimony cost
Enumerate all unrooted trees with at most n leaves:
[i3][i5][i7]……[i2N–5]]
where each ik can take values from 0 (no edge) to k
At each point keep C = smallest cost so far for a complete tree
Start B&B with tree [1][0][0]……[0]
Whenever cost of current tree T is > C, then:
 T is not optimal
 Any tree extending T with more edges is not optimal:
Increment by 1 the rightmost nonzero counter
Bootstrapping to get the best trees
Main outline of algorithm
1. Select random columns from a multiple alignment – one column can
then appear several times
2. Build a phylogenetic tree based on the random sample from (1)
3. Repeat (1), (2) many (say, 1000) times
4. Output the tree that is constructed most frequently
Probabilistic Methods
xroot
t1
t2
x1
x2
A more refined measure of evolution along a tree than parsimony
P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)
If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,
= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)
Probabilistic Methods
xroot = x2N-1
xu
x2
x1
•
xN
If we know all internal labels xu,
P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)
•

jrootP(xj
| xparent(j), tj, parent(j))
Usually we don’t know the internal labels, therefore
P(x1, x2, …, xN | T, t) =
 x x
N+1
N+2
…
x
2N-1
P(x1, x2, …, x2N-1 | T, t)
Computing the Likelihood of a Tree
xk
tki
xi
tkj
xj
• Define P(Lk | a): probability of subtree rooted at xk, given that xk = a
• Then, P(Lk | a) =
(b
P(Li | b) P(b | a, tki)
)(c
P(Lj | c) P(c | a, tki)
)
Felsenstein’s Likelihood Algorithm
To calculate P(x1, x2, …, xN | T, t)
Initialization:
Set k = 2N – 1
Recursion: Compute P(Lk | a) for all a  
If k is a leaf node:
Set P(Lk | a) = 1(a = xk)
If k is not a leaf node:
1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j
2. Set P(Lk | a) =
b,c P(b | a, t )P(L | b) P(c | a, t ) P(L | c)
ki
i
kj
j
Termination:
Likelihood at this column = P(x1, x2, …, xN | T, t) =
aP(L
2N-1
| a)P(a)
Probabilistic Methods
Given M (ungapped) alignment columns of N sequences,
• Define likelihood of a tree:
L(T, t) = P(Data | T, t) =

m=1…M
P(x1m, …, xnm, T, t)
Maximum Likelihood Reconstruction:
• Given data X = (xij), find a topology T and length vector t that
maximize likelihood L(T, t)
Some new sequencing technologies
Molecular Inversion Probes
Molecular Inversion Probes
Single Molecule Array for Genotyping—Solexa
Nanopore Sequencing
http://www.mcb.harvard.edu/branton/index.htm
Nanopore Sequencing
http://www.mcb.harvard.edu/branton/index.htm
Nanopore Sequencing—Assembly
• Resulting reads are likely to look different than Sanger reads:
 Long (perhaps 10,000bp-1,000,000bp)
 High error rate (perhaps 10% – 30%)
 Two colors?
• A/ CTG
• AT/ CG
• AG/ CT
• How can we assemble under such conditions?
Pyrosequencing
Pyrosequencing on a chip
Mostafa Ronaghi, Stanford
Genome Technologies Center
454 Life Sciences
Pyrosequencing Signal
Pyrosequencing—Assembly
?
• Resulting reads are likely to look different than Sanger reads:
 Short (currently 100 to 200 bp)
 Low error rates, except in homopolymeric runs (AAA…, CCC…, etc)
 Currently, not known how to do paired reads on a chip
Polony Sequencing