Transcript Lecture 7

Multiple sequence alignment
Why?






It is the most important means to assess relatedness
of a set of sequences
Gain information about the structure/function of a
query sequence (conservation patterns)
Construct a phylogenetic tree
Putting together a set of sequenced fragments
(Fragment assembly)
Recognise alternative splice sites
Many bioinformatics methods depend on it
(secondary/tertiary structure)
Multiple sequence alignment (MSA) of 12 * Flavodoxin + cheY
Pairwise alignment



Now we know how to do it:
How do we get a multiple
alignment (three or more
sequences)?
Multiple alignment: much greater
combinatorial explosion than with
pairwise alignment…..
Multi-dimensional dynamic
programming
(Murata et al. 1985)
Simultaneous Multiple alignment
Multi-dimensional dynamic programming
MSA (Lipman et al., 1989, PNAS 86,
4412)
 extremely slow and memory intensive
 up to 8-9 sequences of ~250 residues
DCA (Stoye et al., 1997, CABIOS 13,
625)
 still very slow
Alternative multiple alignment
methods









Biopat (Hogeweg Hesper 1984, first method
ever)
MULTAL (Taylor 1987)
DIALIGN (Morgenstern 1996)
PRRP (Gotoh 1996)
Clustal (Thompson Higgins Gibson 1994)
Praline (Heringa 1999)
T-Coffee (Notredame Higgins Heringa 2000)
HMMER (Eddy 1998) [Hidden Markov Model]
SAGA (Notredame Higgins1996) [Genetic
algorithm]
Progressive multiple alignment
general principles
1
2
1
3
Score 1-2
4
5
Score 4-5
Score 1-3
Scores
5×5
Scores to distances
Guide tree
Similarity
matrix
Iteration possibilities
Multiple alignment
General progressive multiple
alignment technique
(follow generated tree)
d
1
3
1
3
2
5
1
3
2
5
root
1
3
2
5
4
Progressive multiple alignment
Problem:
Accuracy is very important
Errors are propagated into the
progressive steps
“Once a gap, always a
gap”
Feng & Doolittle, 1987
Pair-wise alignment quality versus sequence
identity
(Vogt et al., JMB 249, 816-831,1995)
Multiple alignment profiles
Gribskov et al. 1987
i
A
C
D



W
Y
Gap
penalties
0.3
0.1
0



0.3
0.3
1.0
0.5
Position dependent gap penalties
Profile-sequence alignment
sequence
profile
ACD……VWY
Profile-profile alignment
profile
A
C
D
.
.
Y
profile
ACD……VWY
Clustal, ClustalW, ClustalX



CLUSTAL W/X (Thompson et al., 1994) uses Neighbour
Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in
phylogenetic analysis, to construct guide tree.
Sequence blocks are represented by profiles, in which the
individual sequences are additionally weighted according to
the branch lengths in the NJ tree.
Further carefully crafted heuristics include:




(i) local gap penalties
(ii) automatic selection of the amino acid substitution matrix,
(iii) automatic gap penalty adjustment
(iv) mechanism to delay alignment of sequences that appear to
be distant at the time they are considered.
CLUSTAL (W/X) does not allow iteration (Hogeweg and
Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999,
2002)
Strategies for multiple sequence
alignment

Profile pre-processing

Secondary structure-induced
alignment
Globalised local alignment
Matrix extension


Objective: try to avoid (early) errors
Pre-profile generation
1
2
1
3
Score 1-2
4
5
Score 4-5
Score 1-3
Cut-off
1
1
2
3
4
5
2
2 134
5
5
5
1
2
3
4
Pre-alignments
A
C
D
.
.
Y
A
C
D
.
.
Y
A
C
D
.
.
Y
Pre-profiles
Pre-profile alignment
Pre-profiles
1
2
3
4
5
A
C
D
.
.
Y
A
C
D
.
.
Y
A
C
D
.
.
Y
A
C
D
.
.
Y
Final alignment
A
C
D
.
.
Y
1
2
3
4
5
Pre-profile alignment
1
2
3
4
5
12
3
4
5
21
3
4
5
31
2
4
5
41
2
3
5
5
1
2
3
4
Final alignment
1
2
3
4
5
Strategies for multiple sequence
alignment

Profile pre-processing

Secondary structure-induced
alignment

Globalised local alignment
Matrix extension

Objective: try to avoid (early) errors
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
One of the Molecular
Biology Dogma’s
“Structure more conserved
than sequence”
Secondary structure-induced
alignment
Using secondary structure for
alignment
Dynamic programming
search matrix
M
D
A
A
S
T
I
L
C
G
S
Amino acid exchange
weights matrices
MDAGSTVILCFV
HHHCCCEEEEEE
H
H H
H
H
C
C
E
E
E
C
C
H
C
C
E
E
Default
Flavodoxin-cheY
Using predicted secondary structure
1fx1
FLAV_DESVH
FLAV_DESGI
FLAV_DESSA
FLAV_DESDE
2fcr
FLAV_ANASP
FLAV_ECOLI
FLAV_AZOVI
FLAV_ENTAG
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
1fx1
FLAV_DESVH
FLAV_DESGI
FLAV_DESSA
FLAV_DESDE
2fcr
FLAV_ANASP
FLAV_ECOLI
FLAV_AZOVI
FLAV_ENTAG
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
-PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF
e eeee b ssshhhhhhhhhhhhhhttt eeeee stt
tttttt seeee b ee sss
ee ttthhhhtt ttss tt eeeee
MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf
e eeeeee
hhhhhhhhhhhhhhh
eeeeee
eeeeee
hhhhhh
eeeee
MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf
e eeeeee
hhhhhhhhhhhhhh
eeeeee
hhhhhh eeeeeee
hhhhhh
eeeeee
MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf
eeeeee
hhhhhhhhhhhhhh
eeeee
eeeee
hhhhhhh h
eeeee
MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf
eeee
hhhhhhhhhhhhhh
eeeee
hhhhhhhhhhheeeee
hhhhhhh hh
eeeee
--K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF
eeeee ssshhhhhhhhhhhhhggg
b
eeggg s gggggg seeeeeee stt s
s s sthhhhhhhtggg
tt eeeee
SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf
eeeee
hhhhhhhhhhhh
eee
hhh hhhhhhheeeeee
hhhhhhhhh
eeeeee
-AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf
eee
hhhhhhhhhhhh
eee
hhh hhhhhhheeeee
hhhhh
eeeeee
-AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf
eee
hhhhhhhhhhhhh
hhh hhhhhhheeeee
hhhhhhhhh
eeeeee
MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf
eeee
hhhhhhhhhhhh
hhh hhhhhhheeeee
hhhhh
eeeee
----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF
eeeee ssshhhhhhhhhhhhhhhtt
eeeettt sttttt seeeeee btttb
ttthhhhhhh hst t tt eeeee
M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf
hhhhhhhhhhhhhh
eeeee
hhhhhhhh eeeee
eeeee
M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf
eee
hhhhhhhhhhhhhh
eeeeee
hhhhhhhhhh eeee
hhhhhhhhh
eeeee
ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV
tt eeee s hhhhhhhhhhhhhht
eeeesshh hhhhhhhh
eeeee
s sss
hhhhhhhhhh ttttt eeee
GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------eee s ss sstthhhhhhhhhhhttt ee s
eeees
gggghhhhhhhhhhhhhh
GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------eee
hhhhhhhhhhhh
eeeee
eeeee
hhhhhhhhhhhhhh
GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------eee
hhhhhhhhhhhh
eeeee
hhhhhhhhhhh
GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------hhhhhhhhhhhh
eeeee
e
eee
ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------e
hhhhhhhhhhhhhh
eeeee
ee
hhhhhhhhhhh
GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhht
GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL-----hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhh
GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhhhh
GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-e
hhhhhhhhhhhhhh
eeeee
hhhhhhhhhhh
GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L-----hhhhhhhhhhhhhhh
eeee
hhhhhhh
hhhhhhhhhhhh
G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------e
eesss shhhhhhhhhhhhtt ee s
eeees
ggghhhhhhhhhhhht
G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------hhhhhhhhhhh
eeeee
eeee
h hhhhhhhh
STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-hhhhhhhhhhhhhh eeeee
hhhh hhh
hhhhhhhhhhhh h
-----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM-----ess hhhhhhhhhtt see
ees
s
hhhhhhhhhhhhhhht
Strategies for multiple sequence
alignment

Profile pre-processing
Secondary structure-induced
alignment

Globalised local alignment

Matrix extension

Objective: try to avoid (early) errors
Globalised local alignment
1. Local (SW) alignment (M + Po,e)
+
=
2. Global (NW) alignment (no M or Po,e)
Double dynamic programming
M = BLOSUM62, Po= 0, Pe= 0
M = BLOSUM62, Po= 12, Pe= 1
M = BLOSUM62, Po= 60, Pe= 5
Strategies for multiple sequence
alignment

Profile pre-processing
Secondary structure-induced
alignment
Globalised local alignment

Matrix extension


Objective: try to avoid (early) errors
Matrix extension
T-Coffee
Tree-based Consistency Objective Function
For alignmEnt Evaluation
Cedric Notredame
Des Higgins
Jaap Heringa
J. Mol. Biol., 302, 205-217;2000
Matrix extension – T COFFEE
2
1
3
1
4
1
3
2
4
2
4
3
Integrating alignment methods
and alignment information with
T-Coffee
• Integrating different pair-wise alignment
techniques (NW, SW, ..)
• Combining different multiple alignment
methods (consensus multiple alignment)
• Combining sequence alignment methods
with structural alignment techniques
• Plug in user knowledge
Using different sources of alignment
information
Clustal
Clustal
Structure alignments
Dialign
Lalign
Manual
T-Coffee
Search matrix extension
T-Coffee
• Combine different alignment techniques by adding scores:
W(A(x), B(y)) = S(A(x), B(y))
– A(x) is residue x in sequence A
– summation is over the scores S of the global and local
alignments containing the residue pair (A(x), B(y))
– S is sequence identity percentage of the associated alignment
• Combine direct alignment seqA- seqB with each seqAseqI-seqB:
W’(A(x), B(y)) = W(A(x), B(y)) +
IA,BMin(W(A(x), I(z)), W(I(z), B(y)))
– Summation over all third sequences I other than A or B
T-Coffee
Other
sequences
Direct
alignment
Search matrix extension
Evaluating multiple alignments

Conflicting standards of truth
evolution
 structure
 function






With orphan sequences no additional information
Benchmarks depending on reference alignments
Quality issue of available reference alignment
databases
Different ways to quantify agreement with reference
alignment (sum-of-pairs, column score)
“Charlie Chaplin” problem
Evaluating multiple alignments

As a standard of truth, often a reference alignment
based on structural superpositioning is taken
Evaluation measures
Query
Reference
Column score
Sum-of-Pairs score
Evaluating multiple alignments
SP
BAliBASE alignment nseq * len
Summary

Weighting schemes simulating simultaneous multiple
alignment
Profile pre-processing (global/local)
 Matrix extension (well balanced scheme)


Smoothing alignment signals


Using additional information


globalised local alignment
secondary structure driven alignment
Schemes strike balance between speed and
sensitivity
References



Heringa, J. (1999) Two strategies for sequence
comparison: profile-preprocessed and secondary
structure-induced multiple alignment. Comp.
Chem. 23, 341-364.
Notredame, C., Higgins, D.G., Heringa, J. (2000)
T-Coffee: a novel method for fast and accurate
multiple sequence alignment. J. Mol. Biol., 302,
205-217.
Heringa, J. (2002) Local weighting schemes for
protein multiple sequence alignment. Comput.
Chem., 26(5), 459-477.
Where to find this….
http://www.ibivu.cs.vu.nl/teaching