Multiple Sequence Alignments CS262 Lecture 14, Win07, Batzoglou Progressive Alignment pxy pxyzw x y z pzw w • When evolutionary tree is known: Align closest first, in the order of the.
Download ReportTranscript Multiple Sequence Alignments CS262 Lecture 14, Win07, Batzoglou Progressive Alignment pxy pxyzw x y z pzw w • When evolutionary tree is known: Align closest first, in the order of the.
CS262 Lecture 14, Win07, Batzoglou
Multiple Sequence Alignments
Progressive Alignment
p xyzw p zw p xy x y z w • When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x , p y , to generate a new alignment with associated profile p result Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles CS262 Lecture 14, Win07, Batzoglou
Progressive Alignment
?
x y z w • When evolutionary tree is unknown: Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment Construct a tree
(UPGMA / Neighbor Joining / Other methods)
Align on the tree CS262 Lecture 14, Win07, Batzoglou
Some Resources
Genome Resources
Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2
Protein Multiple Aligners
http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate CS262 Lecture 14, Win07, Batzoglou
Real-world protein aligners
•
MUSCLE
High throughput One of the best in accuracy •
ProbCons
High accuracy Reasonable speed CS262 Lecture 14, Win07, Batzoglou
MUSCLE at a glance
1.
• Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.
Build tree T DRAFT based on those distances, with UPGMA 3.
• Progressive alignment over T DRAFT , resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.
5.
Measure new Kimura-based distances D(x, y) based on M DRAFT Build tree T based on D 6.
Progressive alignment over T, to build M 7.
• • Iterative refinement; for many rounds, do:
Tree Partitioning:
Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept CS262 Lecture 14, Win07, Batzoglou
PROBCONS at a glance
1.
2.
Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM • Re estimation of posterior matrices M’ xy M’ xy (i, j) = 1/N sequence z k M xz (i, k) with
probabilistic consistency
M yz (j, k); M’ xy = Avg z (M xz M zy ) 3.
4.
• • Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes aligned (i, j) in A Define E(x, y) = aligned (i, j) in Axy M’ xy (i, j) M’ xy (i, j) Build tree T with hierarchical clustering using similarity measure E(x, y) 5.
Progressive alignment on T to maximize E(.,.) 6.
• Iterative refinement; for many rounds, do:
Randomized Partitioning:
Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles CS262 Lecture 14, Win07, Batzoglou
Rapid Global Alignments How to align genomic sequences in (more or less) linear time
CS262 Lecture 14, Win07, Batzoglou
CS262 Lecture 14, Win07, Batzoglou
Motivation
• Genomic sequences are very long: Human genome = 3 x 10 9 Mouse genome = 2.7 x 10 9 –long –long • Aligning genomic regions is useful for revealing common gene structure It is useful to compare regions > 1,000,000-long CS262 Lecture 14, Win07, Batzoglou
The UCSC Browser
•
http://genome.ucsc.edu/cgi-bin/hgGateway
CS262 Lecture 14, Win07, Batzoglou
Main Idea
Genomic regions of interest contain islands of similarity, such as genes 1.
2.
3.
Find local alignments Chain an optimal subset of them Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others CS262 Lecture 14, Win07, Batzoglou
Saving cells in DP
1.
Find local alignments 2.
Chain -O(NlogN) L.I.S.
3.
Restricted DP CS262 Lecture 14, Win07, Batzoglou
Methods to CHAIN Local Alignments
Sparse Dynamic Programming O(N log N)
CS262 Lecture 14, Win07, Batzoglou
The Problem: Find a Chain of Local Alignments (x,y) (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight CS262 Lecture 14, Win07, Batzoglou
Quadratic Time Solution
• Build Directed Acyclic Graph (DAG): Nodes: local alignments [(x a ,x b ) (y a ,y b )] & score Directed edges: local alignments that can be chained • • • edge ( (x a , x b , y a , y b ) , (x c , x d , y c , y d ) ) x a < x b < x c < x d y a < y b < y c < y d Each local alignment is a node
v i
with alignment score
s i
CS262 Lecture 14, Win07, Batzoglou
Quadratic Time Solution
Initialization:
Find each node v a s.t. there is no edge (u, v a ) Set score of V(a) to be s a
Iteration:
For each v i , optimal path ending in v i has total score: V(i) = ma x j s.t. there is edge (v j , v i ) ( weight(v j , v i ) + V(j) )
Termination:
Optimal global chain: j = argmax ( V(j) ); trace chain from v j Worst case time: quadratic CS262 Lecture 14, Win07, Batzoglou
Sparse Dynamic Programming
Back to the LCS problem: • Given two sequences x = x 1 , …, x m y = y 1 , …, y n • Find the longest common subsequence Quadratic solution with DP • How about when “hits” x i = y j are sparse?
CS262 Lecture 14, Win07, Batzoglou
Sparse Dynamic Programming
15 3 24 16 20 4 24 3 11 18 4 20 24 3 11 15 11 4 18 20 • Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead CS262 Lecture 14, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
• Longest Increasing Subsequence • Given a sequence over an ordered alphabet x = x 1 , …, x m • Find a subsequence s = s 1 , …, s k s 1 < s 2 < … < s k CS262 Lecture 14, Win07, Batzoglou
Sparse Dynamic Programming – L.I.S.
Let input be w: w 1 ,…, w n INITIALIZATION: L: last LIS elt. array L[0] = -inf P: array of backpointers // L[j]: smallest j th L[1] = w element w i 1 L[2…n] = +inf B: array holding LIS elts; B[1] = 0 of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j – 1] < w[i] ≤ L[j] L[j]
w[i] B[j]
i P[i]
B[j – 1] }
•
That’s it!!!
Running time?
CS262 Lecture 14, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence
w
• Every matching point (i, j), is inserted into
w
as follows: • For each column j = 1…m, insert in
w
the points (i, j), in decreasing row i order • The 11 example points are inserted in the order given •
a = (y, x)
, chained iff
b = (y’, x’)
can be 4 20 y 24 3 11 15 11 4 18 20 15 3 24 x 16 20 4 24 3 11 18
6 4 2 7 1 8 10 3 5 9 11
a
is before
b
in w, and
y < y’
CS262 Lecture 14, Win07, Batzoglou
Sparse LCS expressed as LIS
Create a sequence
w
•
w
= (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now
w
’s elements as ordered lexicographically, where
(y, x) < (y’, x’)
if
y < y’ Claim:
of
w
An increasing subsequence is a common subsequence of x and y 4 20 y 24 3 11 15 11 4 18 20 15 3 24 x 16 20 4 24 3 11 18
6 4 2 7 1 8 10 3 5 9 11
CS262 Lecture 14, Win07, Batzoglou
Sparse Dynamic Programming for LIS
Example:
w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) 4 4.
5.
6.
7.
L = [L1] [L2] [L3] [L4] [L5] … 1.
(4,2) 2.
3.
(3,3) (3,3) (10,5) (2,5) (10,5) (2,5) (8,6) (1,6) (8,6) (1,6) (3,7) 8.
9.
10.
11.
(1,6) (3,7) (4,8) (1,6) (3,7) (4,8) (7,9) (1,6) (3,7) (4,8) (5,9) (1,6) (3,7) (4,8) (5,9) (9,10) 20 y 24 3 11 15 11 4 18 20 15 3 24 x 16 20 4 24 3 11 18
1 2 4 3 6 5 7 8 10 9 Longest common subsequence: s = 4, 24, 3, 11, 18 11
CS262 Lecture 14, Win07, Batzoglou
Sparse DP for rectangle chaining
• 1,…, N: rectangles h • (h j , l j ): • w(j): • V(j): y-coordinates of rectangle j weight of rectangle j optimal score of chain ending in j • L: list of triplets (l j , V(j), j) L is sorted by l j : smallest (North) to largest (South) value L is implemented as a balanced binary tree CS262 Lecture 14, Win07, Batzoglou l y
Sparse DP for rectangle chaining
Main idea: • Sweep through x coordinates • To the right of
b
, anything chainable to
a
is chainable to b • Therefore, if
V(b) > V(a)
subsequent chaining , rectangle a is “useless” for • In L, keep rectangles j sorted with increasing l coordinates j sorted with increasing V(j) score CS262 Lecture 14, Win07, Batzoglou V(a) V(b)
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from lowest to highest: 1.
When on the leftmost end of rectangle i: j a.
b.
j: rectangle in L, with largest l j V(i) = w(i) + V(j) < h i 2.
When on the rightmost end of i: a.
b.
i.
ii.
k: rectangle in L, with largest l k If V(i) > V(k): i l
INSERT REMOVE
(l i , V(i), i) in L all (l j , V(j), j) with V(j) V(i) & l j i l k Is k ever removed?
i CS262 Lecture 14, Win07, Batzoglou
Example
2 5 6 9 10 11 12 14 15 16 y a: 5 c: 3 b: 6 d: 4 e: 2 x
V a 5 b 11 c 8 d 12 e 13 L
l i V(i) i
5 5 a 15 12 d 16 13 3 1.
2.
a.
b.
When on the leftmost end of rectangle i: j: rectangle in L, with largest l V(i) = w(i) + V(j) j < h i a.
b.
When on the rightmost end of i: k: rectangle in L, with largest l k If V(i) > V(k): l i
i.
ii.
INSERT REMOVE
(l i , V(i), i) in L all (l j , V(j), j) with V(j) V(i) & l j l i CS262 Lecture 14, Win07, Batzoglou
Time Analysis
1.
Sorting the x-coords takes O(N log N) 2.
Going through x-coords: N steps 3.
Each of N steps requires O(log N) time: • • • • • Searching L takes log N Inserting to L takes log N All deletions are consecutive, so logN per deletion Each element is deleted at most once: < N logN for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree CS262 Lecture 14, Win07, Batzoglou
Whole-genome Alignment Pipelines
Given N species, phylogenetic tree: 1.
2.
Local Alignment between all pairs – BLAST In the order of the tree: 1.
2.
Synteny mapping: find long regions with lots of collinear alignments In each synteny region, 1.
2.
Chaining Global alignment Alternatively, all species are mapped to one reference (e.g., human) Then, in each unbroken synteny region between multiple species, perform chaining & progressive multiple alignment CS262 Lecture 14, Win07, Batzoglou
Examples
Human Genome Browser ABC CS262 Lecture 14, Win07, Batzoglou
Whole-genome alignment Rat —Mouse—Human CS262 Lecture 14, Win07, Batzoglou
Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned
CS262 Lecture 14, Win07, Batzoglou