Document 7417608

Download Report

Transcript Document 7417608

Multiple Sequence Alignment
Dynamic Programming
Multiple Sequence Alignment
VTISCTGSSSNIGAGNHVKWYQQLPG
VTISCTGTSSNIGSITVNWYQQLPG
LRLSCSSSGFIFSSYAMYWVRQAPG
LSLTCTVSGTSFDDYYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG
ATLVCLISDFYPGAVTVAWKADS
ATLVCLISDFYPGAVTVAWKADS
AALGCLVKDYFPEPVTVSWNSG-
VSLTCLVKGFYPSDIAVEWESNG-
• Goal: Bring the greatest number of similar characters into the same
column of the alignment
• Similar to alignment of two sequences.
CLUSTALW MSA
MSA of four oxidoreductase NAD binding domain protein sequences.
Red: AVFPMILW. Blue: DE. Magenta: RHK. Green: STYHCNGQ. Grey:
all others. Residue ranges are shown after sequence names.
Chenna et al. Nucleic Acids Research, 2003, Vol. 31, No. 13 3497-3500
Multiple Sequence Alignment:
Motivation
• Correspondence. Find out which parts “do the same thing”
– Similar genes are conserved across widely divergent species, often
performing similar functions
• Structure prediction
– Use knowledge of structure of one or more members of a protein
MSA to predict structure of other members
– Structure is more conserved than sequence
• Create “profiles” for protein families
– Allow us to search for other members of the family
• Genome assembly: Automated reconstruction of “contig” maps of
genomic fragments such as ESTs
• MSA is the starting point for phylogenetic analysis
Multiple Sequence Alignment:
Approaches
• Optimal Global Alignments -Dynamic programming
– Generalization of Needleman-Wunsch
– Find alignment that maximizes a score function
– Computationally expensive: Time grows as product of
sequence lengths
• Global Progressive Alignments - Match closelyrelated sequences first using a guide tree
• Global Iterative Alignments - Multiple re-building
attempts to find best alignment
• Local alignments
– Profiles, Blocks, Patterns
Scoring a multiple alignment
A
A
A
C
A
C
Sum of pairs
A
A
C
C
A
Star
A
C
A
C
Tree
Sum of Pairs
AAA
AAA
AAA
AAC
ACC
A
A
A
A
A
A
10α
A
A
A
A
C
+ (6α - 4β)
= 20α - 10β
A
C
A
C
+ (4α - 6β)
Sum-of-Pairs Scoring Function
Score of multiple alignment
=
∑i <j score(Si,Sj)
where
score(Si,Sj) = score of induced
pairwise alignment
Induced Pairwise Alignment
S1
S2
S3
S - T I S C T G - S - N I
L - T I – C N G S S - N I
L R T I S C S G F S Q N I
Induced pairwise alignment of S1, S2:
S1
S2
S T I S C T G - S N I
L T I – C N G S S N I
MSA: Dynamic Programming
• The two-sequence alignment algorithm can be
generalized to any number of sequences.
• E.g., for three sequences X, Y, W define
C[i,j,k] = score of optimum alignment
among X[1..i], Y[1..j], W[1..k]
• As for two sequences, divide possible
alignments into different classes, depending
on how they end.
– Use to devise recurrence relations for C[i,j,k]
– C[i,j,k] is the maximum out of all possibilities
MSA: 7 ways alignment can end
for 3 sequences
Xi
Yj
Wk
X1 . . . Xi-1 Xi
Y1 . . . Yj-1 Yj
W1 . . . Wk-1 Wk
Xi
-
Yj
-
Wk
Xi
Yj
-
Yj
Wk
Xi
Wk
Dynamic programming for three
sequences
Each alignment is a path through the
dynamic programming matrix
A
S
V S N —S
—S N A —
———A S
A
N
S
Start
V
S
N
S
Dynamic Programming for Three
Sequences
There are 7 ways to get to C[i,j,k]
C[i,j,k]
C[i-1,j,k-1]
C[i-1,j-1,k-1]
C[i-1,j,k-1]
For 3 seqs.
of length n,
time is
proportional
to n3
Enumerate all possibilities and choose the best one
Dynamic Programming MSA:
General Case
• For k sequences of length n, dynamic
programming algorithm does (2k-1) nk
operations
– Example: 6 sequences of length 100 require
6.4X1013 calculations
• Space for table is nk
• Implementations (e.g., WashU MSA 2.1) use
tricks and only search subset of dynamic
programming table
– Even this is expensive. E.g., Baylor CM Search
launcher limits MSA to 8 sequences of 800
characters and 10 minutes processing time
Problems with SP scoring
• Pair-wise comparisons can over-score
evolutionarily distant pairs.
• Reason: For 3 or more sequences, SP
scoring does not correspond to any
evolutionary tree
But not:
Overcoming problems with SP
scoring
• Use weights to incorporate evolution in sum of
pairs scoring:
– Some pair-wise alignments are more important
than others
• E.g., more important to have a good alignment between
mouse and human sequences than mouse and bird
– Assign different weights to different pair-wise
alignments.
• Weight decreases with evolutionary distance.
• Use star tree approach
– one sequence is assigned as the ancestor and all
others are contrasted it.
Star Alignments
• Construct multiple alignments using pair-wise
alignment relative to a fixed sequence
• Out of a set S = {S1, S2, . . . , Sr} of
sequences, pick sequence Sc that maximizes
star_score(c) = ∑ {sim(Sc, Si) : 1 ≤ i ≤ r, i ≠ c}
where sim(Si, Sj) is the optimal score of a
pair-wise alignment between Si and Sj
Algorithm
1. Compute sim(Si, Sj) for every pair (i,j)
2. Compute star_score(i) for every i
3. Choose the index c that minimizes
star_score(c) and make it the center of the
star
4. Produce a multiple alignment M such that,
for every i, the induced pairwise alignment
of Sc and Si is the same as the optimum
alignment of Sc and Si.
Step 4: Detail
Sc
AA--CCTT
Sc
A-ACC-TT
S1
AATGCC--
S2
AGACCGT-
Sc
A-A--CC-TT
S1
A-ATGCC---
S2
AGA--CCGT-