Some new sequencing technologies CS262 Lecture 12, Win07, Batzoglou Molecular Inversion Probes CS262 Lecture 12, Win07, Batzoglou.

Download Report

Transcript Some new sequencing technologies CS262 Lecture 12, Win07, Batzoglou Molecular Inversion Probes CS262 Lecture 12, Win07, Batzoglou.

Some new sequencing technologies
CS262 Lecture 12, Win07, Batzoglou
Molecular Inversion Probes
CS262 Lecture 12, Win07, Batzoglou
Single Molecule Array for Genotyping—Solexa
CS262 Lecture 12, Win07, Batzoglou
Nanopore Sequencing
http://www.mcb.harvard.edu/branton/index.htm
CS262 Lecture 12, Win07, Batzoglou
Pyrosequencing on a chip
Mostafa Ronaghi, Stanford Genome
Technologies Center
CS262 Lecture 12, Win07, Batzoglou
454 Life Sciences
Polony Sequencing
CS262 Lecture 12, Win07, Batzoglou
Some future directions for sequencing
1.
Personalized genome sequencing
•
•
Find your ~3,000,000 single nucleotide polymorphisms (SNPs)
Find your rearrangements
•
Goals:
•
•
•
•
Link genome with phenotype
Provide personalized diet and medicine
(???) designer babies, big-brother insurance companies
Timeline:
•
•
•
Inexpensive sequencing:
Genotype–phenotype association:
Personalized drugs:
CS262 Lecture 12, Win07, Batzoglou
2010-2015
2010-???
2015-???
Some future directions for sequencing
2.
Environmental sequencing
•
Find your flora:
•
•
•
•
•
External organs: skin, mucous membranes
Gut, mouth, etc.
Normal flora: >200 species, >trillions of individuals
Flora–disease, flora–non-optimal health associations
Timeline:
•
•
•
•
organisms living in your body
Inexpensive research sequencing:
Research & associations
Personalized sequencing
today
within next 10 years
2015+
Find diversity of organisms living in different environments
•
•
Hard to isolate
Assembly of all organisms at once
CS262 Lecture 12, Win07, Batzoglou
Some future directions for sequencing
3.
Organism sequencing
•
•
Sequence a large fraction of all organisms
Deduce ancestors
•
•
•
•
Reconstruct ancestral genomes
Synthesize ancestral genomes
Clone—Jurassic park!
Study evolution of function
•
•
•
Find functional elements within a genome
How those evolved in different organisms
Find how modules/machines composed of many genes evolved
CS262 Lecture 12, Win07, Batzoglou
1
Phylogeny Tree
Reconstruction
4
3
1
4
5
2
2
3
5
Phylogenetic Trees
• Nodes: species
• Edges: time of independent
evolution
• Edge length represents
evolution time
 AKA genetic distance
 Not necessarily
chronological time
CS262 Lecture 12, Win07, Batzoglou
Inferring Phylogenetic Trees
Trees can be inferred by several criteria:
 Morphology of the organisms
• Can lead to mistakes!
 Sequence comparison
Example:
Orc:
Elf:
Dwarf:
Hobbit:
Human:
CS262 Lecture 12, Win07, Batzoglou
ACAGTGACGCCCCAAACGT
ACAGTGACGCTACAAACGT
CCTGTGACGTAACAAACGA
CCTGTGACGTAGCAAACGA
CCTGTGACGTAGCAAACGA
Modeling Evolution
During infinitesimal time t, there is not enough time for two substitutions
to happen on the same nucleotide
So we can estimate P(x | y, t), for x, y  {A, C, G, T}
Then let
S(t) =
P(A|A, t) ……
…
…
P(T|A, t) ……
P(A|T, t)
…
…
P(T|T, t)
x
t
y
CS262 Lecture 12, Win07, Batzoglou
Modeling Evolution
Reasonable assumption: multiplicative
(implying a stationary Markov process)
S(t+t’) = S(t)S(t’)
A
C
T
G
That is to say, P(x | y, t+t’) = z P(x | z, t) P(z | y, t’)
Jukes-Cantor: constant rate of evolution
For short time , S() = I+R =
CS262 Lecture 12, Win07, Batzoglou
1 - 3




1 - 3




1 - 3




1 - 3
Modeling Evolution
Jukes-Cantor:
S(t+) = S(t)S() = S(t)(I + R)
For longer times,
S(t) =
r(t)
s(t)
s(t)
s(t)
s(t)
r(t)
s(t)
s(t)
Therefore,
(S(t+) – S(t))/ = S(t) R
s(t)
s(t)
r(t)
s(t)
s(t)
s(t)
s(t)
r(t)
At the limit of   0,
S’(t) = S(t) R
Equivalently,
r’ = -3r + 3s
s’ = -s + r
Where we can derive:
r(t) = ¼ (1 + 3 e-4t)
s(t) = ¼ (1 – e-4t)
CS262 Lecture 12, Win07, Batzoglou
Those diff. equations lead to:
r(t) = ¼ (1 + 3 e-4t)
s(t) = ¼ (1 – e-4t)
Modeling Evolution
Kimura:
Transitions: A/G, C/T
Transversions: A/T, A/C, G/T, C/G
Transitions (rate ) are much more likely than transversions (rate )
S(t) =
r(t)
s(t)
u(t)
s(t)
Where
CS262 Lecture 12, Win07, Batzoglou
s(t)
r(t)
s(t)
u(t)
u(t)
s(t)
r(t)
s(t)
s(t)
u(t)
s(t)
r(t)
s(t) = ¼ (1 – e-4t)
u(t) = ¼ (1 + e-4t – e-2(+)t)
r(t) = 1 – 2s(t) – u(t)
Phylogeny and sequence comparison
Basic principles:
• Degree of sequence difference is proportional to length of
independent sequence evolution
• Only use positions where alignment is pretty certain –
avoid areas with (too many) gaps
CS262 Lecture 12, Win07, Batzoglou
Distance between two sequences
Given sequences xi, xj,
Define
dij = distance between the two sequences
One possible definition:
dij = fraction f of sites u where xi[u]  xj[u]
Better model (Jukes-Cantor):
f = 3 s(t) = ¾ (1 – e-4t) 
¾ e-4t = ¾ – f  log (e-4t) = log (1 – 4/3 f)
 -4t = log(1 – 4/3 f)
dij = t = - ¼ -1 log(1 – 4/3 f)
CS262 Lecture 12, Win07, Batzoglou
A simple clustering method for building tree
UPGMA (unweighted pair group method using arithmetic averages)
Or the Average Linkage Method
Given two disjoint clusters Ci, Proof
Cj of sequences,
Ci,Cl dpq + Cj,Cl dpq
1
dij = ––––––––– {p Ci, q Cj}ddklpq= ––––––––––––––––
(|Ci| + |Cj|) |Cl|
|Ci|  |Cj|
Claim that if Ck = Ci  Cj, then distance
to |)another
Cl||C
is:|) 
|C |/(|C ||C

d cluster
+ |C |/(|C
dpq
= ––––––––––––––––––––––––––––––––––––
(|Ci| + |Cj|)
i
dil |Ci| + djl |Cj|
dkl = ––––––––––––––
|Ci| + |Cj|
CS262 Lecture 12, Win07, Batzoglou
i
l
Ci,Cl
|Ci| dil + |Cj| djl
= –––––––––––––
(|Ci| + |Cj|)
pq
j
j
l
Cj,Cl
Algorithm: Average Linkage
1
Initialization:
4
Assign each xi into its own cluster Ci
Define one leaf per sequence, height 0
3
Iteration:
5
2
Find two clusters Ci, Cj s.t. dij is min
Let Ck = Ci  Cj
Define node connecting Ci, Cj,
& place it at height dij/2
Delete Ci, Cj
Termination:
When two clusters i, j remain,
place root at height dij/2
CS262 Lecture 12, Win07, Batzoglou
1
4
2
3
5
Example
v
v
w
0
x
6
y
8
z
8
0
x
8
8
8
0
4
4
0
2
y
w
xyz
0
6
8
0
8
vw xyz
8
v
w
v
z
w
xyz
vw
0
8
xyz
0
0
0
v
w
v
w
x
yz
0
6
8
8
0
8
8
0
4
x
yz
CS262 Lecture 12, Win07, Batzoglou
0
4
3
2
1
v
w
x
y
z
Ultrametric Distances and Molecular Clock
Definition:
A distance function d(.,.) is ultrametric if for any three distances dij  dik 
dij, it is true that
dij  dik = dij
The Molecular Clock:
The evolutionary distance between species x and y is 2 the Earth time
to reach the nearest common ancestor
That is, the molecular clock has constant rate in all species
The molecular clock
results in ultrametric
distances
years
1
4
CS262 Lecture 12, Win07, Batzoglou
2
3
5
Ultrametric Distances & Average Linkage
1
4
2
3
5
Average Linkage is guaranteed to reconstruct correctly a binary tree with
ultrametric distances
Proof: Exercise
CS262 Lecture 12, Win07, Batzoglou
Weakness of Average Linkage
Molecular clock: all species evolve at the same rate (Earth time)
However, certain species (e.g., mouse, rat) evolve much faster
Example where UPGMA messes up:
AL tree
Correct tree
3
2
1
CS262 Lecture 12, Win07, Batzoglou
4
1
4
2
3
Additive Distances
1
8
d1,4
3
13
7
9
5
11
10
2
4
12
6
Given a tree, a distance measure is additive if the distance between any pair of
leaves is the sum of lengths of edges connecting them
Given a tree T & additive distances dij, can uniquely reconstruct edge lengths:
•
•
Find two neighboring leaves i, j, with common parent k
Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m  i, j
CS262 Lecture 12, Win07, Batzoglou
Reconstructing Additive Distances Given T
x
y
D
v
v
w
0
w
4
x
y
z
10
17
16
16
0
15
14
14
0
9
15
0
14
x
y
z
CS262 Lecture 12, Win07, Batzoglou
T
5
0
3
z
7
3
4
w
6
v
If we know T and D, but do not
know the length of each leaf, we
can reconstruct those lengths
Reconstructing Additive Distances Given T
x
y
D
v
v
w
0
w
x
y
z
10
17
16
16
0
15
14
14
0
9
15
0
14
x
y
z
CS262 Lecture 12, Win07, Batzoglou
T
0
z
w
v
Reconstructing Additive Distances Given T
D
v
v
w
0
10 17 16 16
w
0
x
y
x
z
T
y
15 14 14
x
0
y
9
15
0
14
z
z
a
0
D1
a
x
a
x
0
11 10 10
0
y
z
CS262 Lecture 12, Win07, Batzoglou
y
z
9
15
0
14
0
dax = ½ (dvx + dwx – dvw)
day = ½ (dvy + dwy – dvw)
daz = ½ (dvz + dwz – dvw)
w
v
Reconstructing Additive Distances Given T
D1
a
x
a
x
0
11 10 10
x
y
0
z
9
y
y
4
15
0
b
14
z
z
0
7
D2
a
b
a
b
z
0
6
10
0
z
D3
10
0
a
c
CS262 Lecture 12, Win07, Batzoglou
T
5
a
c
0
3
0
3
c
3 a
4
w
6
d(a, c) = 3
d(b, c) = d(a, b) – d(a, c) = 3
d(c, z) = d(a, z) – d(a, c) = 7
d(b, x) = d(a, x) – d(a, b) = 5
d(b, y) = d(a, y) – d(a, b) = 4
d(a, w) = d(z, w) – d(a, z) = 4
d(a, v) = d(z, v) – d(a, z) = 6
Correct!!!
v
Neighbor-Joining
• Guaranteed to produce the correct tree if distance is additive
• May produce a good tree even when distance is not additive
1
Step 1: Finding neighboring leaves
3
0.1
0.1
0.1
Define
Dij = (N – 2) dij – ki dik – kj djk
0.4
2
Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighbors
CS262 Lecture 12, Win07, Batzoglou
0.4
4
Algorithm: Neighbor-joining
Initialization:
Define T to be the set of leaf nodes, one per sequence
Let L = T
Iteration:
Pick i, j s.t. Dij is minimal
Define a new node k, and set dkm = ½ (dim + djm – dij) for all m  L
Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik
Remove i, j from L;
Add k to L
Termination:
When L consists of two nodes, i, j, and the edge between them of length dij
CS262 Lecture 12, Win07, Batzoglou
Parsimony – direct method not using distances
•
One of the most popular methods:


GIVEN multiple alignment
FIND tree & history of substitutions explaining alignment
Idea:
Find the tree that explains the observed sequences with a minimal
number of substitutions
Two computational subproblems:
1. Find the parsimony cost of a given tree (easy)
2. Search through all tree topologies (hard)
CS262 Lecture 12, Win07, Batzoglou
Example: Parsimony cost of one column
{A}
Final cost C = 1
{A}
{A, B}
Cost
C+=1
A
B
A
A
A
{A}
CS262 Lecture 12, Win07, Batzoglou
B
{B}
A
{A}
A
{A}
Parsimony Scoring
Given a tree, and an alignment column u
Label internal nodes to minimize the number of required substitutions
Initialization:
Set cost C = 0; node k = 2N – 1 (last leaf)
Iteration:
If k is a leaf, set Rk = { xk[u] }
// Rk is simply the character of kth species
If k is not a leaf,
Let i, j be the daughter nodes;
Set Rk = Ri  Rj if intersection is nonempty
Set Rk = Ri  Rj, and C += 1, if intersection is empty
Termination:
Minimal cost of tree for column u, = C
CS262 Lecture 12, Win07, Batzoglou
Example
{B}
{A,B}
{A}
{B}
{A}
{A,B}
{A}
A
A
A
A
B
B
A
B
{A}
{A}
{A}
{A}
{B}
{B}
{A}
{B}
CS262 Lecture 12, Win07, Batzoglou
Traceback to find ancestral nucleotides
Traceback:
1. Choose an arbitrary nucleotide from R2N – 1 for the root
2. Having chosen nucleotide r for parent k,
If r  Ri choose r for daughter i
Else, choose arbitrary nucleotide from Ri
Easy to see that this traceback produces some assignment of cost C
CS262 Lecture 12, Win07, Batzoglou
Example
Admissible with Traceback
x
B
Still optimal, but
inadmissible with Traceback
A
{A, B}
B
A
x
{A}
A
{A, B}
B
B
A
B
x
B
x
A
{A}
B A
{B} {A}
B
{B}
A
A
A
x
A
x
A
CS262 Lecture 12, Win07, Batzoglou
B
A
B
B
A
B
Probabilistic Methods
xroot
t1
t2
x1
x2
A more refined measure of evolution along a tree than parsimony
P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)
If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,
= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)
CS262 Lecture 12, Win07, Batzoglou
Probabilistic Methods
xroot
xu
x2
x1
•
xN
If we know all internal labels xu,
P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)
•

jrootP(xj
| xparent(j), tj, parent(j))
Usually we don’t know the internal labels, therefore
P(x1, x2, …, xN | T, t) =
CS262 Lecture 12, Win07, Batzoglou
x x
N+1
N+2
…
x
2N-1
P(x1, x2, …, x2N-1 | T, t)
Felsenstein’s Likelihood Algorithm
To calculate P(x1, x2, …, xN | T, t)
Let P(Lk | a) denote the prob.
of all the leaves below node k,
given that the residue at k is a
Initialization:
Set k = 2N – 1
Iteration: Compute P(Lk | a) for all a  
If k is a leaf node:
Set P(Lk | a) = 1(a = xk)
If k is not a leaf node:
1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j
2. Set P(Lk | a) =
b,c P(b | a, t ) P(L | b) P(c | a, t ) P(L | c)
i
i
j
j
Termination:
Likelihood at this column = P(x1, x2, …, xN | T, t) =
CS262 Lecture 12, Win07, Batzoglou
aP(L
2N-1
| a)P(a)
Probabilistic Methods
Given M (ungapped) alignment columns of N sequences,
• Define likelihood of a tree:
L(T, t) = P(Data | T, t) =

m=1…M
P(x1m, …, xnm, T, t)
Maximum Likelihood Reconstruction:
• Given data X = (xij), find a topology T and length vector t that
maximize likelihood L(T, t)
CS262 Lecture 12, Win07, Batzoglou
Current popular methods
HUNDREDS of programs available!
http://evolution.genetics.washington.edu/phylip/software.html#methods
Some recommended programs:
•
Discrete—Parsimony-based
 Rec-1-DCM3
http://www.cs.utexas.edu/users/tandy/mp.html
Tandy Warnow and colleagues
•
Probabilistic
 SEMPHY
http://www.cs.huji.ac.il/labs/compbio/semphy/
Nir Friedman and colleagues
CS262 Lecture 12, Win07, Batzoglou
Multiple Sequence
Alignments
CS262 Lecture 12, Win07, Batzoglou
Protein Phylogenies
• Proteins evolve by both duplication and species
divergence
CS262 Lecture 12, Win07, Batzoglou
Protein Phylogenies – Example
CS262 Lecture 12, Win07, Batzoglou
CS262 Lecture 12, Win07, Batzoglou
CS262 Lecture 12, Win07, Batzoglou
Definition
• Given N sequences x1, x2,…, xN:
 Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L
• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if
present in many
• Multiple alignments can point to elements that are conserved among
a class of and therefore important in the biology of these organisms
• The patterns of conservation can help us tell function of the element
CS262 Lecture 12, Win07, Batzoglou
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x:
y:
z:
AC-GCGG-C
AC-GC-GAG
GCCGC-GAG
Induces:
x: ACGCGG-C;
y: ACGC-GAC;
CS262 Lecture 12, Win07, Batzoglou
x: AC-GCGG-C;
z: GCCGC-GAG;
y: AC-GCGAG
z: GCCGCGAG
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Duck
Chicken
• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
CS262 Lecture 12, Win07, Batzoglou
A Profile Representation
T
C
C
C
A
C
G
T
-
A
A
A
A
A
G
G
G
G
G
G
–
–
–
–
C
C
C
C
C
T
T
T
T
T
1
A
A
A
A
A
T
C
C
T
T
1
.6
1
A
A
A
A
G
.4
1
C
–
–
T
G
G
G
G
G
G
G
.4
.2
.4 .8 .4
1
.6 .2
.2
1
C
C
C
.8
1 .2
.2
.2
C
C
C
C
C
.6
.8
• Given a multiple alignment M = m1…mn
 Replace each column mi with profile entry pi
• Frequency of each letter in 
• # gaps
• Optional: # gap openings, extensions, closings
 Can think of this as a “likelihood” of each letter in each position
CS262 Lecture 12, Win07, Batzoglou
Multiple Sequence Alignments
Algorithms
CS262 Lecture 12, Win07, Batzoglou
Multidimensional DP
Generalization of Needleman-Wunsh:
S(m) = i S(mi)
(sum of column scores)
F(i1,i2,…,iN):
Optimal alignment up to (i1, …, iN)
F(i1,i2,…,iN)
= max(all neighbors of cube)(F(nbr)+S(nbr))
CS262 Lecture 12, Win07, Batzoglou
Multidimensional DP
• Example: in 3D (three
sequences):
• 7 neighbors/cell
F(i,j,k)
CS262 Lecture 12, Win07, Batzoglou
= max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),
F(i – 1, j – 1, k ) + S(xi, xj, - ),
F(i – 1, j
, k – 1) + S(xi, -, xk),
F(i – 1, j
,k
) + S(xi, -, - ),
F(i , j – 1, k – 1) + S( -, xj, xk),
F(i , j – 1, k
) + S( -, xj, - ),
F(i , j
, k – 1) + S( -, -, xk) }
Multidimensional DP
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
CS262 Lecture 12, Win07, Batzoglou
Multidimensional DP
How do gap states generalize?
Running• Time:
badly!
1. Size• ofVERY
matrix:
LN;
 Require 2N – 1 states, one per combination of
gapped/ungapped sequences
Where L
lengthtime:
of each
 =Running
O(2N sequence
 2N  LN) = O(4N LN)
N = number of sequences
Y
YZ
2. Neighbors/cell: 2N – 1
XY
XYZ
Z
Therefore………………………… O(2N LN)
X
CS262 Lecture 12, Win07, Batzoglou
XZ
Progressive Alignment
pxy
pxyzw
x
y
z
pzw
w
•
When evolutionary tree is known:
 Align closest first, in the order of the tree
 In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version:
 Tree edges have weights, proportional to the divergence in that edge
 New profile is a weighted average of two old profiles
CS262 Lecture 12, Win07, Batzoglou
Progressive Alignment
x
y
Example
z
Profile: (A, C, G, T, -)
px = (0.8, 0.2, 0, 0, 0)
w
py = (0.6, 0, 0, 0, 0.4)
•
When evolutionary tree is known:
s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A)
+ 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)
 Align closest first, in the order of the tree
 In each step, align two sequencesResult:
x, y, or profiles
px, py0.1,
, to generate
a new
pxy = (0.7,
0, 0, 0.2)
alignment with associated profile presult
s(p , -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)
x
Weighted version:
 Tree edges have weights, proportional to the divergence in that edge
Result: p = (0.4, 0.1, 0, 0, 0.5)
 New profile is a weighted average of two old x-profiles
CS262 Lecture 12, Win07, Batzoglou
Progressive Alignment
x
y
?
z
w
•
When evolutionary tree is unknown:
 Perform all pairwise alignments
 Define distance matrix D, where D(x, y) is a measure of evolutionary
distance, based on pairwise alignment
 Construct a tree (UPGMA / Neighbor Joining / Other methods)
 Align on the tree
CS262 Lecture 12, Win07, Batzoglou
Heuristics to improve alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
CS262 Lecture 12, Win07, Batzoglou
Iterative Refinement
One problem of progressive alignment:
• Initial alignments are “frozen” even when new evidence comes
Example:
x:
y:
GAAGTT
GAC-TT
Frozen!
z:
w:
GAACTG
GTACTG
Now clear correct y = GA-CTT
CS262 Lecture 12, Win07, Batzoglou
Iterative Refinement
Algorithm (Barton-Stenberg):
1.
For j = 1 to N,
Remove xj, and realign to
x1…xj-1xj+1…xN
2.
Repeat 4 until convergence
allow y to vary
x,z fixed projection
CS262 Lecture 12, Win07, Batzoglou
z
y
x
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x:
y:
z:
w:
GAAGTTA
GAC-TTA
GAACTGA
GTACTGA
After realigning y:
x:
y:
z:
w:
CS262 Lecture 12, Win07, Batzoglou
GAAGTTA
G-ACTTA
GAACTGA
GTACTGA
+ 3 matches
Iterative Refinement
Example not handled well:
CS262 Lecture 12, Win07, Batzoglou
x:
y1:
y2:
y3:
GAAGTTA
GAC-TTA
GAC-TTA
GAC-TTA
z:
w:
GAACTGA
GTACTGA
Realigning any single yi
changes nothing
A* for Multiple Alignments
Review of the A* algorithm
v
GOAL
START
•
•
•
Say that we have a gigantic graph G
START: start node
GOAL: we want to reach this node with the minimum path
Dijkstra: O(VlogV + E) – too slow if the number of edges is huge
A*: a way of finding the optimal solution faster in practice
CS262 Lecture 12, Win07, Batzoglou
A* for Multiple Alignments
Review of the A* algorithm
g(v)
v
h(v)
GOAL
Lemma
Given sequences x, y, z, …
The sum-of pairs score of multiple alignment
M is lower (worse) than the sum of the optimal
g(v) is the cost so far
pairwise alignments
START
•
•
•
1.
2.
h(v) is an estimate of the minimum cost from v to GOAL
f(v) ≥ g(v) + h(v) is the
minimum cost of a path passing by v
Proof
M induces projected pairwise alignments axy,
ayz, axz, …, and Score(M) = d(axy) + d(axz) + d(ayz) +…
Expand v with the smallest f(v)
Never expand v, if f(v)
≥ shortest
path tothan
the the
goaloptimal
foundedit
so far
Each
of d(.) is smaller
distance
CS262 Lecture 12, Win07, Batzoglou
A* for Multiple Alignments
g(v)
v
h(v)
GOAL
START
• Nodes: Cells in the DPTo
matrix
compute h(v)
• g(v): alignment cost so far
For each pair of sequences x, y,
• h(v): sum-of-pairs of individual pairwise
alignments
R
Compute F (x, y), the DP matrix of scores of aligning
a suffix of x to a suffix of y
• Initial minimum alignment cost estimate: sum-of-pairs of global
pairwise alignments Then, at position (i1, i2, …, iN), h(v) becomes the
sum of (N choose 2) FR scores
CS262 Lecture 12, Win07, Batzoglou
Consistency
zk
z
xi
x
y
yj
CS262 Lecture 12, Win07, Batzoglou
yj’
Consistency
zk
z
xi
x
y
yj
yj’
Basic method for applying consistency
•
Compute all pairs of alignments xy, xz, yz, …
•
When aligning x, y during progressive alignment,
 For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz)
 Align x and y with DP using the modified s(.,.) function
CS262 Lecture 12, Win07, Batzoglou
Some Resources
Genome Resources
Annotation and alignment genome browser at UCSC
http://genome.ucsc.edu/cgi-bin/hgGateway
Specialized VISTA alignment browser at LBNL
http://pipeline.lbl.gov/cgi-bin/gateway2
ABC—Nice Stanford tool for browsing alignments
http://encode.stanford.edu/~asimenos/ABC/
Protein Multiple Aligners
http://www.ebi.ac.uk/clustalw/
CLUSTALW – most widely used
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
MUSCLE – most scalable
http://probcons.stanford.edu/
PROBCONS – most accurate
CS262 Lecture 12, Win07, Batzoglou