Slides: Multiple sequence alignment

Download Report

Transcript Slides: Multiple sequence alignment

CS 5263 & CS 4593
Bioinformatics
Multiple Sequence Alignment
Multiple Sequence Alignment
• Motivation:
– A faint similarity between two sequences becomes
very significant if present in many sequences
• Definition
– Given N sequences x1, x2,…, xN: Insert gaps (-) in
each sequence xi, such that
• All sequences have the same length L
• Score of the alignment is maximum
• Two issues
– How to score an alignment?
– How to find a (nearly) optimal alignment?
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple
alignment
Example:
Induces: -
x: AC-GCGG-C
y: AC-GC-GAG
z: GCCGC-GAG
x: ACGCGG-C;
y: ACGC-GAC;
-
x: AC-GCGG-C;
z: GCCGC-GAG;
-
y: AC-GCGAG
z: GCCGCGAG
-
Sum Of Pairs (cont’d)
• The sum-of-pairs (SP) score of an
alignment is the sum of the scores of all
induced pairwise alignments
S(m) = k<l s(mk, ml)
s(mk, ml): score of induced alignment (k,l)
Example:
x:
y:
z:
AC-GCGG-C
AC-GC-GAG
GCCGC-GAG
A C G T A
1 -1 -1 -1 -1
C -1
(A,A) +
(A,G) x 2
= -1
(G,G) x 3 (-,A) x 2 +
=3
(A,A) = -1
Total score = (-1) + 3 + (-2) + 3 + 3
+ (-2) + 3 + (-1) + (-1) = 5
1 -1 -1 -1
G -1 -1
1 -1 -1
T -1 -1 -1
1 -1
- -1 -1 -1 -1
0
Multiple Sequence Alignments
Algorithms
• Can also be global or local
– We only talk about global for now
• A simple method
– Do pairwise alignment between all pairs
– Combine the pairwise alignments into a single
multiple alignment
– Is this going to work?
Compatible pairwise alignments
AAAATTTT
AAAATTTT-------TTTTGGGG
AAAATTTT-------TTTTGGGG
AAAA----GGGG
TTTTGGGG
AAAATTTT---AAAA----GGGG
AAAAGGGG
----TTTTGGGG
AAAA----GGGG
Incompatible pairwise alignments
AAAATTTT
AAAATTTT-------TTTTGGGG
----AAAATTTT
GGGGAAAA---?
TTTTGGGG
GGGGAAAA
TTTTGGGG-------GGGGAAAA
Multidimensional Dynamic
Programming (MDP)
Generalization of Needleman-Wunsh:
• Find the longest path in a high-dimensional cube
– As opposed to a two-dimensional grid
• Uses a N-dimensional matrix
– As apposed to a two-dimensional array
• Entry F(i1, …, ik) represents score of optimal
alignment for s1[1..i1], … sk[1..ik]
F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))
Multidimensional Dynamic
Programming (MDP)
• Example: in 3D (three sequences):
(i-1,j-1,k-1)
• 23 – 1 = 7 neighbors/cell
F(i-1,j-1,k-1) + S(xi, xj, xk),
F(i-1,j-1,k ) + S(xi, xj, -),
F(i-1,j ,k-1) + S(xi, -, xk),
F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk),
F(i-1,j ,k ) + S(xi, -, -),
F(i ,j-1,k ) + S(-, xj, -),
F(i ,j ,k-1) + S(-, -, xk)
(i-1,j-1,k)
(i,j-1,k-1)
(i,j-1,k)
(i-1,j,k-1)
(i-1,j,k)
(i,j,k-1)
(i,j,k)
Multidimensional Dynamic
Programming (MDP)
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Faster MDP
• Carrillo & Lipman, 1988
– Branch and bound
– Other heuristics
• Implemented in a tool called MSA
• Practical for about 6 sequences of length
about 200-300.
Faster MDP
• Basic idea: bounds of the optimal score of a
multiple alignment can be pre-computed
– Upper-bound: sum of optimal pair-wise alignment
scores, i.e.
S(m) = k<l s(mk, ml)  k<l s(k, l)
Optimal
msa
Score of the alignment
between k and l induced by m
Score of optimal alignment
between k and l
– lower-bounded: score computed by any approximate
algorithm (such as the ones we’ll talk next)
– For any partial path, if Scurrent + Sperspective < lowerbound, can give up that path
– Guarantees optimality
Progressive Alignment
•
•
Multiple Alignment is NP-hard
Most used heuristic: Progressive Alignment
Algorithm:
1.
2.
3.
4.
Align two of the sequences xi, xj
Fix that alignment
Align a third sequence xk to the alignment xi,xj
Repeat until all sequences are aligned
Running Time: O(NL2)
Each alignment takes O(L2)
Repeat N times
Progressive Alignment
x
y
z
w
• When evolutionary tree is known:
– Align closest first, in the order of the tree
Example:
Order of alignments:
1. (x,y)
2. (z,w)
3. (xy, zw)
Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment
Algorithm:
1. Find all dij: alignment dist (xi, xj)
•
High alignment score => short distance
2. Construct a tree
(similar to hierarchical clustering. Will discuss in future)
3. Align nodes in order of decreasing similarity
+ a large number of heuristics
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
Distance matrix
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
s1
s3
s2
s4
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
-ALSK
NA-SK
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
s1
s3
s2
s4
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
-ALSK
NA-SK
-TNSD
NT-SD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
s1
s3
s2
s4
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
-ALSK
-TNSD
NA-SK
NT-SD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
-ALSK
NA-SK
-TNSD
NT-SD
s1
s3
s2
s4
Iterative Refinement
Problems with progressive alignment:
• Depend on pair-wise alignments
• If sequences are very distantly related, much higher likelihood of
errors
• Initial alignments are “frozen” even when new evidence comes
Example:
x:
y:
GAAGTT
GAC-TT
Frozen!
z:
w:
GAACTG
GTACTG
Now clear: correct y should be GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1.
2.
3.
4.
Align most similar xi, xj
Align xk most similar to (xixj)
Repeat 2 until (x1…xN) are aligned
For j = 1 to N,
Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Progressive
alignment
Iterative Refinement (cont’d)
For each sequence y
1. Remove y
2. Realign y
z
x
y
(while rest fixed)
allow y to vary
x,z fixed projection
Note: Guaranteed to converge (why?)
Running time: O(kNL2), k: number of iterations
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x:
y:
z:
w:
GAAGTTA
GAC-TTA
GAACTGA
GTACTGA
After realigning y:
x:
y:
z:
w:
GAAGTTA
G-ACTTA
GAACTGA
GTACTGA
+ 3 matches
Iterative Refinement
• Example not handled well:
x:
y1:
y2:
y3:
GAAGTTA
GAC-TTA
GAC-TTA
GAC-TTA
z:
w:
GAACTGA
GTACTGA
Realigning any single yi
changes nothing
Restricted MDP
•
Similar to bounded DP in pair-wise alignment
1. Construct progressive multiple alignment m
2. Run MDP, restricted to radius R from m
z
y
Running Time: O(2N RN-1 L)
x
Restricted MDP
x:
y1:
y2:
y3:
GAAGTTA
GAC-TTA
GAC-TTA
GAC-TTA
z:
w:
GAACTGA
GTACTGA
• Within radius 1 of the optimal
 Restricted MDP will fix it.
Other approaches
• Statistical learning methods
– Profile Hidden Markov Models
• Consistency-based methods
– Still rely on pairwise alignment
• But consider a third seq when aligning two seqs
• If block A in seq x aligns to block B in seq y, and both aligns
to block C in seq z, we have higher confidence to say that the
alignment between A-B is reliable
• Essentially: change scoring system according to consistency
• Then apply DP as in other approaches
– Pioneered by a tool called T-Coffee
Multiple alignment tools
•
Clustal W (Thompson, 1994)
–
•
T-Coffee (Notredame, 2000)
–
–
–
•
“local”
PROBCONS (Do, 2004)
–
–
•
“local”
Align-m (Walle, 2004)
–
•
Iterative refinement
More efficient than most others
DIALIGN (Morgenstern, 1998, 1999, 2005)
–
•
Another popular tool
Consistency-based
Slower than clustalW, but generally more accurate for more distantly related sequences
MUSCLE (Edgar, 2004)
–
–
•
Most popular
Probabilistic consistency-based
Best accuracy on benchmarks
ProDA (Phuong, 2006)
–
Allow repeated and shuffled regions
In summary
• Multiple alignment scoring functions
– Sum of pairs
– Other funcs exist, but less used
• Multiple alignment algorithms:
– MDP
• Optimal
• too slow
• Branch & Bound doesn’t solve the problem entirely
–
–
–
–
Progressive alignment: clustalW
Iterative refinement
Restricted MDP
Consistency-based
Heuristic