Slide 1

Transcript Slide 1

Multiple sequence alignment
algorithms
Elya Flax
&
Inbar Matarasso
1
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Outline










2
The importance of multiple string alignments in molecular
biology.
CLUSTAL W.
Family representation.
How to score multiple alignments.
The center star method for SP alignment.
consensus strings.
Approximating the optimal consensus multiple alignment.
Iterative pairwise alignment.
Progressive alignment and contemporary improvements.
Repeated-motif methods
Motivation
3

Why multiple string comparison?

Because many important commonalties are faint or
widely dispersed, they might not be apparent when
comparing two strings alone but may become clear,
or even obvious, when comparing a set of related
strings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Defenition

4
Definition: A global multiple alignment of
k>2 strings S={S1,S2,…,Sk} is a natural
generalization of alignment for two strings.
Chosen spaces are inserted into each of the
k strings so that the resulting strings have the
same length, defined to be l. Then the strings
are arrayed in k rows of l columns each, so
that each character and space of each string
is in a unique column.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Biological basis for multiple string
comparison

5
The second fact of biological sequence
comparison Evolutionarily and functionally
related molecular strings can differ
significantly throughout much of the string
and yet preserve the same three-dimensional
structure(s), or the same tow-dimensional
substructure(s) (motifs, domains), or the
same active sites, or the same or related
dispersed residues (DNA or amino acid).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Three “big-picture” biological uses for
multiple string comparison



6
The representation of protein families and
superfamilies.
The identification and representation of conserved
sequence features of DNA or protein that correlate
with structure and/or function.
The deduction of evolutionary history from DNA or
protein sequences.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
CLUSTAL W

Improving the sensitivity of progressive multiple
sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix
choice.
http://www.ebi.ac.uk/clustalw/
7
Sequences
results
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Family and superfamily representation


8
Often a set of strings (a family) is defined by
biological similarity, and one wants to find
subsequence commonalities that characterize or
represent the family.
There are three common kinds of family
representations that come from multiple string
comparison:
I.
Profile representation
II. Consensus sequence representation
III. Signature representation
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Family representation and alignment
with profiles

9
Definition: Given a multiple alignment of a
set of strings, a profile for that multiple
alignment specifies for each column the
frequency that each character appears in the
column. A profile is sometimes also called a
weight matrix in the biological literature.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Family representation and alignment
with profiles
a
a
a
c

10
b
b
c
b
c
a
c
_
_
b
b
b
a
a
_
c
a
b
c
_
C1 C2 C3 C4 C5
.75
.25
.50
.75
.75
.25 .25 .50
.25
.25 .25 .25
Often the values in the profile are converted to logodds ratio – If p(y,j) is the frequency that character y
appears in column j, and p(y) is the frequency that
character y appears anywhere in the multiply aligned
sequences, then log( p(y,j)/p(y) ) is commonly used
as the y,j profile entry.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Aligning a string to a profile


Given a profile P and a new string S, we want to
answer the question: “How well S, or substring of S,
fit the profile P” .
Since space is a legal character of a profile, a fit of S
to P should also allow the insertion of spaces into S,
and hence the question is naturally formalized as an
easy generalization of pure string alignment.
a
1
11
a
b
2
3
b
4
c
5
An alignment of string aabbc to the column positions of the previous alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
How to optimally align
a string to a profile
12

Recall that for two characters x and y, s(x,y) denotes
the alphabet-weight value assigned to aligning x with
y in the pure string alignment problem.

Definition: For character y and column j, let
p(y,j) be the frequency that character y
appears in column j of the profile, and let
S(x,j) denote y[s(x,y) × p(y,j)], the score for
aligning x with column j.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
How to optimally align
a string to a profile


Definition: Let V(i,j) denote the value of the optimal
alignment of substring S[1..i] with the first j columns
of C.
The recurrence: V(i,0)=s(S1(k),_) V(0,j)=S(_,k)
k≤i
k≤j
For I and j both strictly positive, the general recurrence is:
V(i,j) = max [

13
V(i-1,j-1) + S(S1(i),j),
V(i-1,j) + s(S1(i),__),
V(i,j-1) + S(_,j)
].
Time analysis: O(nm), where n is the length of S and 
is the size of the alphabet.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Profile to profile alignment

14
Another way that profiles are used is to
compare one protein set to another. In that
case, the profile for one set is compared to
the profile of the other.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Introduction to computing multiple
string alignments

15
Definition: Given a set of k > 2 strings
S={S1, S2, ...,Sk}, a local multiple alignment
of S is obtained by selecting one substring Si’
from each string Si  S and then globally
aligning those substrings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
How to score multiple alignments


To date, there is no objective function that
has been as well accepted for multiple
alignment as edit distance or similarity has
been for two-string alignment.
We will discuss three types of objective
functions:
I.
II.
III.
16
sum-of-pairs functions
consensus functions
tree functions
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
How to score multiple alignments

17
Definition: Given a multiple alignment M, the
induced pairwise alignment of two strings
Si and Sj is obtained from M by removing all
rows except the two rows for Si and Sj. That
is, the induced alignment is multiple
alignment M restrict to Si and Sj. Any two
opposing spaces in that induced alignment
can be removed if desired.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
How to score multiple alignments

18
Definition: The score of an induced pairwise
alignment is determined using any chosen
scoring scheme for tow-string alignment in
the standard manner.
A
A
A
T
G
_
A
A
A
A
_
T
A
G
C
T
G
_
G
_
G
A
T
G
A
A
_
G
SP score 14
4
5
5
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Multiple alignment with the sum-ofpairs (SP) objective function
19

Definition: The sum of pairs (SP) score of
multiple alignment M is the sum of the scores
of pairwise global alignments induced by M.

The SP alignment problem
Compute a
global multiple alignment M with minimum
sum-of-pairs score.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
An exact solution to the
SP alignment problem



20
Via dynamic programming – for k strings of
length n, it takes (nk) time.
We will develop the dynamic programming
recurrence only for the case of three strings.
We will develop an accelerant to the basic
dynamic programming solution that
somewhat increases the number of strings
that can be optimally aligned.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
An exact solution to the
SP alignment problem

21
Definition: Let S1, S2 and S3 denote three
strings of length n1,n2 and n3, respectively,
and let D(i,j,k) be the optimal SP score for
aligning S1[1..i], s2[1..j] and s3[1..k]. The
score for a match, mismatch, or space is
specified by the variables smatch, smis and
sspace respectively.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Recurrences for a nonboundary cell(i,j)
22
For i:=1 to n1 do
for j:=1 to n2 do
for k:=1 to n3 do
begin
if (S1(i)=S2(j)
then sij:=smatch
else cij:=smis;
if (S1(i)=S3(k)
then cik:=smatch
else cik:=smis;
if (S2(j)=S3(k)
then cjk:=smatch
else cjk:=smis;
d1:=D(i-1,j-1,k-1)+cij+cik+cjk;
d2:=D(i-1,j-1,k)+cij+2*sspace;
d3:=D(i-1,j,k-1)+cik+2*sspace;
d4:=D(i,j-1,k-1)+cjk+2*sspace;
d5:=D(i-1,j,k)+2*sspace;
d6:=D(i,j-1,k)+2*sspace;
d7:=D(i,j,k-1)+2*sspace;
D(i,j,k):=min[d1,d2,d3,d4,d5,d6,d7];
end;
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
D values for boundary cells

Let D1,2(i,j) denote the familiar pairwise distance
between substrings S1[1..i] and S2[1..j], and let
D1,3(i,k) and D2,3(j,k) denote the analogous
pairwise distance. Then,
I.
II.
III.
IV.
23
D(i,j,0)=D1,2(i,j)+(i+j)*sspace
D(i,0,k)=D1,3(i,k)+(i+k)*sspace
D(i,j,0)=D2,3(j,k)+(J+k)*sspace
D(0,0,0)=0
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A speed up for the exact solution


24
The program for multiple alignment that was
shown uses recurrences in backward
direction.
In forward dynamic programming when
D(i,j,k) is set, D(i,j,k) is sent forward the
seven cells that can be influenced by it.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A speed up for the exact solution
25

Definition: Let d1,2(i,j) be the edit distance
between suffixes S1[i..n] and S2[j..n] of string
S1 and S2. Define d1,3(i,k) and d2,3(j,k)
analogously.

All these d values can be computed in O(n2) time by
reversing the strings and computing three pairwise
distances.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A speed up for the exact solution
26

Suppose that some multiple alignment of S1, S2, and
S3 is known and that the alignment has SP score z.

Key idea of the heuristic speed up Recall
that D(i,j) is the optimal SP score for aligning
S1[1..i], S2[1..j], and S3[1..k]. If
D(i,j,k)+d1,2(i,j)+d1,3(i,k)+d2,3(j,k) is greater
than z, then node (i,j,k) cannot be on any
optimal path and so D(i,j,k) need not be sent
forward to any cell.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A bounded-error approximation
method for SP alignment


27
The method is provably fast (runs in
polynomial worst-case time) and yet
produced alignments whose SP score is
guaranteed to be less than twice the score of
optimal SP alignment.
Recall that for two strings, D(Si,Sj) is the
(optimal) weighted edit distance between Si
and Sj.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
An initial key idea: alignments
consistent with a tree

28
Definition: Let S be a set of strings, and let
T be a tree where each node is labeled with
a distinct string from S. Then, a multiple
alignment M of S is called consistent with T
if the induced pairwise alignment of Si and Sj
has score D(Si,Sj) for each pair of strings
(Si,Sj) that label adjacent nodes in T.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A bounded-error approximation
method for SP alignment
AXZ
AXZ
1
2
3 AXXZ
a)
b)
4
AYZ
5
29
3
A X X _ Z
1
A X _ _ Z
2
A _ X _ Z
4
A Y _ _ Z
5
A Y X Y Z
AYXYZ
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
An initial key idea: alignments
consistent with a tree

30
Theorem: For any set of strings S and for
any tree T whose nodes are labeled by
distinct strings of S, we can efficiently find a
multiple alignment M(T) of S that is
consistent with T
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The center star method for SP
alignment
31

We will describe the method in terms of an alphabetweighted scoring scheme for two-string alignment,
and let s(x,y) be the score contributed when a
character x is aligned opposite a character y.

Definition: A scoring scheme satisfies the
triangle inequality if for any three
characters x,y and z, s(x,z)≤ s(x,y) + s(y,z).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The center star method for SP
alignment

32
Definition: Given a set of k strings S, define
a center string Sc  S as a string in S that
minimizes SjSD(Sc,Sj), and let M denote the
minimum sum. Define the center star to be a
star tree of k nodes, with the center node
labeled Sc and with each of the k-1
remaining nodes labeled by a distinct string
in S-Sc.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The center star method for SP
alignment
S4
S2
S3
S3
S1
S6
A generic center star for six strings, where the center string Sc is S3
33
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The center star method for SP
alignment

Definition: Define the multiple alignment Mc of
the set of strings S to be the multiple
alignment consistent with the center star.
Definition: Define d(Si,Sj) as the score of the
pairwise alignment of strings Si and Sj
induced by Mc. Denote the score of an
alignment M as d(M).

 d(Si,Sj)≥D(Si,Sj), d(Mc)=i<jd(Si,Sj), d(Si,Sc)=D(Si,Sc)

34
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The center star method for SP
alignment

35
Lemma: Assume that the two-string scoring
scheme satisfies the triangle inequality. Then
for any strings Si and Sj in S, d(Si,Sj) ≤
d(Si,Sc) + d(Sc + Sj) = D(Si,Sc) + D(Sc + Sj)
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The center star method for SP
alignment

36
Definition: Let M* be the optimal multiple
alignment of the k strings of S. Let d*(Si,Sj)
be the score of the pairwise alignment of
strings Si and Sj induced by M*. Then
d(M*)=i<jd*(Si,Sj).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The center star method for SP
alignment

Theorem: d(Mc)/d(M*) ≤ 2(k-1)/k <2.
Corollary:
kM≤i<jD(Si,Sj)≤d(M*)≤d(Mc)≤[2(k-1)/ki<jD(Si,Sj).

37
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Steiner consensus strings

Definition: Given a set of strings S, and
given another string S’, the consensus error
of a string S’ relative to S is
E(S’)=

38
Si S D
(S’, Si).
Note that S’ need not be from S.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Steiner consensus strings

39
Definition: Given a set of strings S, an
optimal Steiner string S* for S is a string
that minimizes the consensus error E(S*)
over all possible strings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Steiner consensus strings

40
Lemma: Let S have k strings, and assume
that the two-string scoring scheme satisfies
the triangle
inequality. Then there exists a
_
string
_ S S such that
E(S) / E(S*) ≤ 2 – 2/k < 2
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Steiner consensus strings

Recall that Sc is a string that minimizes
 Si S D (Sc, Si) over all strings in S.

41
Theorem: Assuming that the scoring scheme
satisfies the triangle inequality,
E(Sc) / E(S*) ≤ 2 – 2/k < 2
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Consensus strings from multiple
alignment

42
Definition: Given a multiple alignment M of a
set of strings S, the consensus character of
column I of M is the character that minimizes
the summed distance to it from all the
characters in column i. let d(i) denote the
minimum sum in column i.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Consensus strings from multiple
alignment

43
Definition: The consensus string SM
derived from alignment M is the
concatenation of the consensus characters
for each column of M.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Consensus strings from multiple
alignment

Definition: Let M be a multiple alignment of
a set of strings S, and let SM be its
consensus string containing q characters.
Then the alignment error of SM equals

i=q
i=1d(i),
and the alignment error of M is
defined as the alignment error of SM.
44
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Consensus strings from multiple
alignment

45
Definition: The optimal consensus
multiple alignment is a multiple alignment M
for input set S whose consensus string SM
has smallest alignment error over all possible
multiple alignments of S
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Consensus strings from multiple
alignment

46
Definition: Given set S of k strings, let T be
the star tree with Steiner string S* at the root
and each of the k strings at distinct leaves of
T. Then the multiple alignment of SUS*
consistent with T is said to be consistent
with S*.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Consensus strings from multiple
alignment

47
Theorem: Let S’ denote the consensus
string of the optimal consensus multiple
alignment. Then, removal of the spaces from
S’ creates the optimal Steiner string S*.
Conversely’ removal of the row for S* from
the multiple alignment consistent with S*
creates the optimal consensus multiple
alignment of S.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Approximating the optimal consensus
multiple alignment

48
Theorem: Assuming the triangle inequality,
the multiple alignment Mc created by the
center star method has an SP score that is
never more than 2 – 2/k times the SP score
of the optimal SP alignment, and it has a
(consensus) alignment error that is never
more than 2 – 2/k times the alignment error
of the optimal consensus multiple alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Multiple alignment to a (phylogenetic)
tree

49
Definition: Given an input tree T with a
distinct string (from a set of strings S) written
at each leaf, a phylogenetic alignment for T
is an assignment of one string to each
internal node of T. Note that the strings
assigned to internal nodes need not be
distinct and need not be from the input
strings S.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Multiple alignment to a (phylogenetic)
tree

50
Definition: If strings S and S’ are assigned
to the endpoints of an edge (i,j), then (i,j) had
edge distance D(S,S’). The distance along a
path is the sum of the distances on the edges
in the path. The distance of a phylogenetic
alignment is the total of all the edge
distances in the tree.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Multiple alignment to a (phylogenetic)
tree
51

The phylogenetic alignment problem for T
find an assignment of strings to internal
nodes of T (one string to each node) that
minimizes the distance of the alignment.

The consensus alignment problem is a special case
of the phylogenetic alignment problem (i.e., when
tree T is a star).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A heuristic for phylogenetic alignment
52

Definition: A phylogenetic alignment is
called a lifted alignment if for every internal
node V, the string assigned to V is also
assigned to one of V’s children.

We will show that the best lifted alignment in T has a
total distance less than twice that of the optimal
phylogenetic alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A heuristic for phylogenetic alignment
S6
S6
S5
S5
S2
S1
53
S2
S3
S6
S4
S5
S6
S7
S8
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The transformation creating T
54
L

We will construct the lifted alignment T out of T*
which is the optimal phylogenetic alignment.

Definition: we say a node has been lifted after it
has been labeled by a string in the leaf set S.

Let Sv* be the string labeling internal node V in T*.
S1, S2 ,…., Sk – v’s children. We lift Sj if
D(Sv*,Sj)≤ D(Sv*,Si) for any i from 1 to k.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
L
The transformation creating T
S3
Sv*
V
V
5
7
S1

55
S2
L
6
0
3
S3
S4
S1
S2
S3
S4
The lifting operation at node V. The numbers on the edges are
the distances from Sv* to the lifted strings labeling its children.
Note that after the lift, one edge will have zero distance.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The error analysis

56
L
Theorem: The lifted alignment T has
total distance less or equal to twice that
of the optimal phylogenetic T* of T.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Computing the minimum distance
lifted alignment
57

The best lifted alignment is computed by dynamic
programming.

Definition: Let Tv be the subtree of T rooted
at node V. Let d(V,S) denote the distance of
the best lifted alignment of Tv under the
requirement that string S is assigned to node
V (assuming of course that S is a string at a
leaf of Tv.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Computing the minimum distance
lifted alignment


We start with the assumption that all the leaves have already
been processed.
S’- a string written at a leaf; V’-child of V.

If V is a node all of whose children are leaves
d(V,S)= S’ D (S, S’).

For a general internal node V, the dynamic
programming recurrence is
d(V,S)= min [ D (S, S’) + d(V’,S’) ]
V’
58
S’
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Computing the minimum distance
lifted alignment

59
Theorem: The optimal lifted alignment
can be computed in polynomial time as
a function of size of the tree and the
lengths of the input strings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Iterative pairwise alignment
60

The target is to iteratively merge two multiple alignments of two
subsets of strings into a single multiple alignment of the union
of those subsets.

As an example we will explain the average linkage
method, and is also known as UPGMA, for
“Unweighted Pair-Group Method using arithmetic
Averages”. At each merge step, the new multiple
alignment could be created by aligning some
representation of the two smaller alignments (for
example, by aligning profiles or consensus
sequences).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Iterative pairwise alignment
61

multiple alignments serve the purpose of
characterizing protein families and for identifying
important molecular structures, but….

Doolittle: “ ….what we’re really interested in
is a historical alignment. The historical
alignment ought to reflect, as accurately as
possible, the series of divergences that led to
the contemporary sequences…..”
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Iterative pairwise alignment

62
Iterative alignment methods determine a sequence
of merges of disjoint subsets of strings. Hence the
history of those merges can be described by a binary
tree T. Each leaf of T represents a single string from
the input set, and each node of T specifies a merge
of the strings found at the leaves of its subtree. Each
node also represents a multiple alignment created by
the merge at that node.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Progressive alignment
63

A pair of strings with minimum edit distance (or
greatest similarity) is likely obtained from the pair of
taxa that has most recently diverged.

Any spaces (gaps) that appear in the optimal
pairwise alignment of those two strings in preserved
throughout the entire sequence of successive
merges.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Progressive alignment

64
The progressive alignment method is
explicitly aimed at building an evolutionary
tree from molecular data while
simultaneously constructing an evolutionarily
informative multiple alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Improvements to
progressive alignment


65
Sequence weighting – the weights are normalized
such that the biggest one is set to 1. closely related
sequences receive lowered weights. Highly
divergent sequences receive high weights.
Initial gap penalties – a gap opening penalty (GOP)
is given for every gap, and gap extension penalty
(GEP) gives the cost of every space in the gap.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Improvements to
progressive alignment


66
Weight matrices – Two main series of weight
matrices are offered to the user: Dayhoff PAM,
BLOSUM.
Divergent sequences – The most divergent
sequences are usually the most difficult to align
correctly. It is sometimes better to delay the
incorporation of these sequences until all of the more
easily aligned sequences are merged first.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Progressive alignment
Pairwise alignment:
calculate distance
Hbb_Human 1
matrix
Hbb_Horse
-
2 .17
Hba_Human 3 .59
.60
-
Hba_Horse
4 .59
.59
.13
Myg_Phyca
5 .77
.77
.75 .75
Glb5_Petma 6 .81
.82
.73 .74 .80
Lgb2_Luplu
.86
.86 .88 .93 .90
7 .87
1
67
-
2
3
-
4
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
-
5
6
Progressive alignment
Unrooted Neighborjoining tree
Myg_Phyca
Hba_Human
Hba_Horse
Glb5_Petma
Hbb_Human
Hbb_Horse
Lgb2_Luplu
68
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Progressive alignment
Rooted NJ tree (guide tree)
and sequence weights
Hbb_Human
Hbb_Horse
Hba_Human
Hba_Horse
Myg_Phyca
Glb5_Petma
Lgb2_Luplu
Progressive alignment:
Align following the
guide tree
69
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Repeated-motif methods



70
The second major approach used in multiple
alignment methods.
Definition: a motif is a substring or a small
subsequence that is common to many of the
strings in the set.
“width” refers to the length of the motif, and
“multiplicity” refers to the number of strings
that it appears in.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Repeated-motif methods

Repeated-motif method general algorithm:
1. Find a “good” motif (wide and with high multiplicity)
2. The strings containing it are shifted so that the
occurrences of the motif are aligned with each other.
3.The problems divides into two sub problems, one for
substrings on each side of the motif.
71
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Repeated-motif methods
4. Continue this recursion until no sufficiently wide or
high motif is found.
5. The remaining sub problems can be solved by
iterative alignment methods.
6. Strings that did not contain the first good motif are
aligned separately.
7. Finally, the two alignments are merged.
72
Summary










73
The importance of multiple string alignments in molecular
biology.
CLUSTAL W.
Family representation.
How to score multiple alignments.
The center star method for SP alignment.
consensus strings.
Approximating the optimal consensus multiple alignment.
Iterative pairwise alignment.
Progressive alignment and contemporary improvements.
Repeated-motif methods
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Bibliography
74

Algorithms on strings, trees, and sequences :
computer science and computational
biology; Gusfield Dan; Cambridge : Cambridge
University Press, 1997

Nucleic Acids Research, 1994, Vol. 22, No. 22,
Oxford University Press.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.