Multiple Alignments and Multivariate Analysis Clustal: 1988-2006 Multiple Alignments Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse.

Download Report

Transcript Multiple Alignments and Multivariate Analysis Clustal: 1988-2006 Multiple Alignments Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse.

Multiple Alignments and
Multivariate Analysis
Clustal: 1988-2006
Multiple Alignments
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Phylogenetic Analysis
Homology Detection
Homology Modeling
Secondary Str. Prediction
Profile Analysis
VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP
-VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA
* * * * * ****
* * *** *
* *
* * ***
*
KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF
QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL
** ***** *
** *
** ** ** *** ** **
*
** *
GKEFTPPVQAAYQKVVAGVANALAHKYH
PAEFTPAVHASLDKFLASVSTVLTSKYR
**** * *
* * *
* **
Dynamic Programming
•Needleman and Wunsch, 1970
•O(L2) algorithm
Maximise score (or minimise distance)
•Gap penalties
•Amino acid weight matrix
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Weighted Sums
of Pairs: WSP
N
i 1
W D
i  2 j 1
Time O(LN)
ij
ij
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Sequences
2
3
4
5
6
7
Time
1 second
150 seconds
6.25 hours
39 days
16 years
2404 years
Weighted Sums
of Pairs: WSP
N
i 1
W D
i  2 j 1
Time O(LN)
ij
ij
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Progressive Alignment:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
Feng and Doolittle, 1987
Barton and Sternberg, 1987
Willie Taylor, 1987, 1988
Hogeweg and Hesper, 1984
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
Clustal
• 35000 citations
• Clustal1-Clustal4 1988
– Paul Sharp, Dublin
• Clustal V 1992
– EMBL Heidelberg,
• Rainer Fuchs
• Alan Bleasby
• Clustal W 1994-2006, Clustal X 1997-2006
– Toby Gibson, EMBL, Heidelberg
– Julie Thompson, ICGEB, Strasbourg
• Clustal W and Clustal X 2.0 early 2007
– University College Dublin
Since 1994?
Benchmarks
Protein structure alignments and superpositions
•
•
•
•
•
•
Barton and Sternberg; Fitch and McLure
Dali
BaliBase
Homstrad
Oxbench
Prefab etc. etc.
Protein structure analysis
•APDB
O'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003)
APDB: a novel measure for benchmarking sequence alignment methods without reference alignments.
Bioinformatics. 2003;19 Suppl 1:i215-21.
RNA alignments
•Bralibase (Gardner PP, Wilm A & Washietl S (2005) NAR. )
Which Method is Best?
• Clustal W????
• MSA (Lipman, Altschul, Kececioglu)
• DCA (Stoye), PRRP (Gotoh) , SAGA (Notredame)
• T-Coffee (Notredame)
• 3-D Coffee M-Coffee
• MAFFT (Katoh) and MUSCLE (Edgar)
• Probcons (Do, Brudno, Batzoglu)
For Global Protein alignments!!!
Clustal W and X 2.0?
• Jan 2007
• Re-engineered in C++
• Aim to increase accuracy
– Iteration (Wallace, I. M., O'Sullivan, O. and Higgins, D. G., 2005
Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics
21:1408.)
• Reduce run times
Multivariate Analysis?
ADE-4
http://pbil.univ-lyon1.fr/ADE-4/
Thioulouse J., Chessel D., Dolédec S., & Olivier J.M.
(1997) ADE-4: a multivariate analysis and graphical
display software. Statistics and Computing, 7, 1, 75-83.
Between Group Analysis BGA
Dolédec, S. & Chessel, D. (1987)
Acta Oecologica, Oecologica Generalis, 8, 3, 403-426.
Supervised Correspondence Analysis or PCA
CO-Inertia Analysis CIA
Dolédec, S. & Chessel, D. (1994) Freshwater Biology, 31, 277-294.
Thioulouse, J. & Lobry, J.R. (1995) CABIOS, 11, 321-329
2 datasets; Simultaneous CA or PCA
• MADE4
–
Culhane, A., Thiolouse, J., Perriere, G., Higgins, D.G. (2005)
MADE4: an R package for multivariate analysis of gene expression
data. Bioinformatics. 21(11):2789-2790.
Use CA, PCA for Sequences?
PCOORD on sequence distances:
Higgins, D.G. (1992) Sequence ordinations: a
multivariate analysis approach to analysing large
sequence data sets. CABIOS, 8, 15-22.
PCA on dipeptide composition:
Van Heel, M. (1991) A new family of powerful
multivariate statistical sequence analysis techniques.
J. Mol Biol. 220(4): 877-887.
PCA on alignment columns:
Casari G, Sander C, Valencia A. (1995) A method
to predict functional residues in proteins. Nat Struct
Biol. 2(2):171-8.
Supervised PCA or CA?
Malate Dehydrogenases
Lactate Dehydrogenases
Between Group Analysis
samples
genes
GSVD
N
d = 0.05
Trypsin-like serine proteases
15 Chymotrypsins
EC_1_6
EC_1_5
EC_1_7
EC_1_4
EC_1_3
EC_1_8
EC_1_9
EC_1_2
EC_1_12
EC_1_11
EC_1_10
EC_1_15
EC_1_16
EC_1_14
EC_1_13
EC_1_17
EC_1_18
EC_1_0
EC_1_1
EC_1_19
EC_4_88
EC_4_117
EC_4_13
EC_4_38
EC_4_37
EC_4_0
EC_4_39
EC_4_36
EC_4_15
EC_4_44
EC_4_14
EC_4_6
EC_4_4
EC_4_3
EC_4_48
EC_4_7
EC_4_2
EC_4_16
EC_4_17
EC_4_12
EC_4_24
EC_4_47
EC_4_115
EC_4_21
EC_4_81
EC_4_10
EC_4_46
EC_4_9
EC_4_8
EC_4_76
EC_4_70
EC_4_63
EC_4_11
EC_4_22
EC_4_49
EC_4_78
EC_4_23
EC_4_45
EC_4_56
EC_4_5
EC_4_20
EC_4_71
EC_4_77
EC_4_35
EC_4_55
EC_4_43
EC_4_85
EC_4_18
EC_4_93
EC_4_53
EC_4_25
EC_4_86
EC_4_52
EC_4_51
EC_4_40
EC_4_64
EC_4_34
EC_4_66
EC_4_42
EC_4_1
EC_4_41
EC_4_113
EC_4_92
EC_4_114
EC_4_79
EC_4_83
EC_4_19
EC_4_54
EC_4_73
EC_4_72
EC_4_75
EC_4_74
EC_4_84
EC_4_80
EC_4_69
EC_4_57
EC_4_29
EC_4_90
EC_4_68
EC_4_27
EC_4_89
EC_4_67
EC_4_31
EC_4_82
EC_4_91
EC_4_62
EC_4_30
EC_4_65
EC_4_26
EC_4_32
X5PTP_EC_4
EC_4_58
EC_4_61
EC_4_33
EC_4_97
EC_4_96
EC_4_50
EC_4_87
EC_4_101
EC_4_100
EC_4_109
EC_4_59
EC_4_28
EC_4_60
EC_4_110
EC_4_116
EC_4_95
EC_4_108
EC_4_107
EC_4_94
EC_4_98
EC_4_111
EC_4_112
EC_4_102
EC_4_99
EC_4_103
EC_4_104
EC_4_106
EC_4_105
10 Elastases
31 Trypsins
EC_36_3
EC_36_2
EC_36_6
EC_36_4
EC_36_5
EC_36_1
EC_36_0
d = 0.1
X54V
X265S
X232M
X154T
X95N
X3N
X93F
X137C
X243Q
X82E
X87L
X7A
X180Q
X155T
X14W
X165N
X229S
X183L
X181A
Chymotrypsin
Tripsin
X98W
X66T
Elastase
X98Y
X155S
X93I
X275G
X154V
X228K
X162S
X70R
X229D
X204S
X132Y
X273K
X16S
X18I
X10N
X92I
X196Y
X82G
X232Q
d = 0.1
8 e-04
X54V
X265S
X232M
X154T
X95N
X3N
X93F
X137C
X243Q
X82E
X87L
X7A
X180Q
Trypsin
Tripsin
X98W
X66T
Elastase
X98Y
X155S
X93I
X275G
X154V
X228K
X162S
X70R
X229D
X204S
X132Y
X273K
X16S
X18I
X10N
X92I
4 e-04
Chymotrypsin
X196Y
X82G
X232Q
e+00
X155T
X14W
X165N
X229S
X183L
X181A
d = 0.1
8 e-04
X54V
X265S
X232M
X154T
X95N
X3N
X93F
X137C
X243Q
X82E
X87L
X7A
X180Q
Trypsin
Tripsin
X98W
X66T
Elastase
X98Y
X155S
X93I
X275G
X154V
X228K
X162S
X70R
X229D
X204S
X132Y
X273K
X16S
X18I
X10N
X92I
4 e-04
Chymotrypsin
X196Y
X82G
X232Q
e+00
X155T
X14W
X165N
X229S
X183L
X181A
BGA With CA or PCA?
• CA:
– Pretty pictures
– Sequences/residues plots
– Finds any clear/simple patterns
• Binary aa variables
• PCA:
– Use continuous variables
• e.g. aa properties: size, charge, hydrophobicity etc.
d = 10
EC_1_3
EC_1_8
EC_1_4
EC_1_6
EC_1_7
EC_1_9
EC_1_5
EC_1_1
EC_1_2
EC_1_10
EC_1_12
EC_1_11
EC_1_15
EC_1_0
EC_1_16
EC_4_17
EC_4_37
EC_1_19
EC_4_16
EC_4_38
EC_4_23
EC_4_36
EC_4_85
EC_4_45
EC_4_114
EC_4_27
EC_4_20
EC_4_63
EC_4_19
EC_4_31
EC_4_113
EC_1_13
EC_4_15
EC_4_30
EC_4_14
EC_4_0
EC_4_115
EC_4_29
EC_4_34
EC_4_18
EC_4_64
EC_4_1
EC_1_14
EC_4_35
EC_4_111
EC_4_112
EC_4_39
EC_4_86
EC_4_57
EC_4_52
EC_4_116
EC_1_17 EC_4_88
EC_4_22
EC_4_70
EC_4_51
EC_4_83
EC_4_26
EC_4_71
EC_4_32
EC_4_76
EC_4_24
EC_4_55
EC_4_43
EC_4_90
EC_4_89
EC_4_101
EC_4_100
EC_4_28
EC_4_47
EC_4_40
EC_4_56
EC_4_42
EC_4_93
EC_4_84
EC_4_91
EC_4_94
EC_4_117
EC_4_48
EC_4_77
EC_4_78
EC_4_82
EC_4_61
EC_4_92
EC_4_33
EC_4_44
EC_4_54
EC_4_73
EC_4_72
EC_4_60
EC_4_110
EC_4_96
EC_4_95
EC_4_66
EC_4_67
EC_4_25
EC_4_81
EC_4_41
EC_4_53
EC_4_59
EC_4_105
EC_4_97
EC_4_98
EC_4_74
EC_4_102
EC_4_103
EC_4_13
EC_4_49
EC_4_65
EC_1_18
EC_4_21
EC_4_79
EC_4_75
EC_4_4
EC_4_3
EC_4_109
EC_4_108
EC_4_104
EC_4_80
EC_4_50
EC_4_2
EC_4_107
5PTP_EC_4
EC_4_58
EC_4_68
EC_4_10
EC_4_7
EC_4_9
EC_4_8
EC_4_62
EC_4_106
EC_4_6
EC_4_69
EC_4_99
EC_4_5
EC_4_11
EC_4_87
EC_4_12
EC_4_46
15 Chymotrypsins
d=
Chymotrypsin
31 Trypsins
Tripsin
Sequences
BGA with PCA
using
5 amino acid
properties (A-E)
EC_36_3
EC_36_2
EC_36_4
EC_36_6
EC_36_5
EC_36_0
EC_36_1
10 Elastases
Elastase
40
d = 0.5
Eigenvalues
X227B
X260C
X260E
X47D
X260D
30
X227A
X240D
X1E
X243A X1C
X277B
X229A
X7B
X275B
X229B
X275A
Residue weights
20
X232C
X267D
X95C
X95E
X95B
X229E
X216A
X229D
X165A
X275C
10
X275E
X196C
X232A
X82B
0
X272A
X255C
X273A
X255E
X47E
X185B
X255A
X1D X136A
X136B
BGA on Alignments
• Focus on any split in the data
• Binary or Property coding
– CA or PCA
• Sequence Weighting
• Pseudocounts
Clustal
Toby Gibson, EMBL
Julie Thompson, ICGEB, Strasbourg
BGA, CIA, MADE4
Aedín Culhane
Guy Perriere
Jean Thiolouse
Ian Jeffery
Ailís Fagan
Iteration
Benchmarking
Clustal W 2.0
Gordon Blackshields
Mark Larkin
Paul McGettigan
Iain Wallace
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA
SeqB
SeqC
SeqD
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
LAST
FAST
VERY
----
FA-T
CA-T
FAST
FA-T
CAT
--CAT
CAT
N
Weighted Sums of Pairs
i 1
W D
i  2 j 1
ij
ij
MSA
Branch and Bound
Lipman, Altschul and Kececioglu, 1989
FastMSA
Tweaked MSA
Gupta, Kececioglu and Schaeffer, 1995
DCA
Divide and Conquer
Stoye, Moulton and Dress, 1997
SAGA
Genetic Algorithm
Notredame and Higgins, 1996
PRRP
Iteration
Gotoh, 1996
Genetic Algorithm
Selection (WSP)
Mutation
Recombination (cross-overs)
Genetic Algorithm
Selection (WSP)
Mutation
Recombination (cross-overs)
Genetic Algorithm
Selection (WSP)
Mutation
Recombination (cross-overs)
SAGA
• Cedric Notredame
• Sequence Alignment by Genetic Algorithm
• Optimise any objective function
• Notredame, C. and Higgins, D.G. (1996)
SAGA: Sequence alignment by genetic algorithm. Nucleic Acids
Research, 24:1515-1524.
Structure Test Cases
MSA
SAGA
Test case
N
Length Score
seqs
WSP
Structure CPU- Score
match % time WSP
Structure CPUmatch % time
Cytc
6
129
1051257 74
7
1051257 74
960
GCR
8
60
371875
75
3
371650
82
75
Ac Protease 5
183
379997
80
13
379997
80
331
S Protease
6
280
574884
91
184
574884
91
3500
Chtp
6
247
111924
-
4525
111579
-
3542
Dfr secstr
4
189
171979
82.03
5
171975
82.50
411
Sbt
4
296
271747
80
7
271747
80
210
Globin
7
167
659036
94
7
659036
94
330
Plasto
5
132
236343
54.03
22
236195
54.05
510
Structure Test Cases
MSA
SAGA
Test case
N
Length Score
seqs
WSP
Structure CPU- Score
match % time WSP
Structure CPUmatch % time
Cytc
6
129
1051257 74
7
1051257 74
960
GCR
8
60
371875
75
3
371650
82
75
Ac Protease 5
183
379997
80
13
379997
80
331
S Protease
6
280
574884
91
184
574884
91
3500
Chtp
6
247
111924
-
4525
111579
-
3542
Dfr secstr
4
189
171979
82.03
5
171975
82.50
411
Sbt
4
296
271747
80
7
271747
80
210
Globin
7
167
659036
94
7
659036
94
330
Plasto
5
132
236343
54.03
22
236195
54.05
510
Which method is best?
• Best score?
• Empirical tests?
– Sets of test cases
•
•
•
•
•
Fitch and McLure
BaliBase
Homstrad
Oxbench
Prefab etc. etc.
– APDB
O'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003)
APDB: a novel measure for benchmarking sequence alignment methods without reference
alignments. Bioinformatics. 2003;19 Suppl 1:i215-21.
COFFEE
• Consistency based Objective Function For
Evaluation of Ehhhh things
• Maximum Weight Trace (John Kececioglu)
• Maximise similarity to a LIBRARY of
residue pairs
• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An
objective function for multiple sequence alignments. Bioinformatics
14: 407-422.
Pairs of Residues
Human
Horse
Human
Horse
beta
beta
alpha
alpha
e.g.
Seq N, Residue I
Seq M, Residue J
Weight = w
VHLTPEEKSAVTALWGKVN–-VDEVGGEAL
VQLSGEEKAAVLALWDKVN–-EEEVGGEAL
–VLSPADKTNVKAAWGKVGAHAGEYGAEAL
–VLSAADKTNVKAAWSKVGGHAGEYGAEAL
% Match
Test Case
Avg
N
% ID seq
COFFE
SAGA
PRRP
MSA
SAGA
ClustalW PILEUP
SAM
HMM
Ac prot
21
14
50.2
48.8
51.2
39.2
40.9
27.9
Binding
31
7
64.5
76.2
64.2
50.0
66.6
36.9
Cytc
42
6
90.7
89.4
67.3
89.1
94.6
67.3
Fniii
17
9
47.0
36.3
45.2
42.0
37.8
16.2
Gcr
36
8
83.1
92.8
80.8
80.8
80.8
85.7
Globin
24
17
85.2
87.0
78.0
86.4
72.6
67.8
Igb
24
37
78.1
74.9
70.1
74.8
52.4
67.2
Lzm
39
6
72.3
71.1
72.3
72.2
72.3
55.3
Phenyldiox 22
8
64.7
49.9
55.6
58.5
37.4
45.7
Sbt
61
7
96.9
96.7
96.0
96.9
97.4
90.6
sprot
27
15
66.6
64.3
68.5
62.5
57.9
61.7
% Match
Test Case
Avg
N
% ID seq
COFFE
SAGA
PRRP
MSA
SAGA
ClustalW PILEUP
SAM
HMM
Ac prot
21
14
50.2
48.8
51.2
39.2
40.9
27.9
Binding
31
7
64.5
76.2
64.2
50.0
66.6
36.9
Cytc
42
6
90.7
89.4
67.3
89.1
94.6
67.3
Fniii
17
9
47.0
36.3
45.2
42.0
37.8
16.2
Gcr
36
8
83.1
92.8
80.8
80.8
80.8
85.7
Globin
24
17
85.2
87.0
78.0
86.4
72.6
67.8
Igb
24
37
78.1
74.9
70.1
74.8
52.4
67.2
Lzm
39
6
72.3
71.1
72.3
72.2
72.3
55.3
Phenyldiox 22
8
64.7
49.9
55.6
58.5
37.4
45.7
Sbt
61
7
96.9
96.7
96.0
96.9
97.4
90.6
sprot
27
15
66.6
64.3
68.5
62.5
57.9
61.7
72.6
71.5
68.1
65.5
64.5
56.4
T-Coffee
• Heuristic approximation to COFFEE
– Uses progressive alignment (Trees)
• Heterogenous data
–
–
–
–
Sequences
Structures
Genomes
ESTs
• Notredame, C, Higgins, DG and Heringa, J. (2000)
T-Coffee: A novel method for fast and accurate multiple sequence
alignment. J.Mol.Biol., 302: 205-217.
T-Coffee
• Mixed data sources
– Primary library from
• Lalign (SIM):
– 10 best local alignments
• Clustalw
Default
– All pairwise alignments
• SAP (Willie Taylor, Structure Superposition)
• Multiple alignments
• Check library for CONSISTENCY
– Upweight pairs of residues that agree with other pairs
Mixing Heterogenous Information
Local Alignment
Global Alignment
Multiple Alignment
Specialist
Structural
T-Coffee
Multiple Sequence Alignment
Copyright Cédric Notredame,
2000, all rights reserved
Mixing Heterogenous Information
Structure Superposition
Weighted Residue Pairs
Copyright Cédric Notredame,
2000, all rights reserved
e.g. SAP
Taylor and Orengo
Increasing Structure Numbers
% accuracy
tRNA-synt_2b 19% ID
100
80
60
40
20
0
0
2
3
4
no of Structures
5
%accuracy
Including Structures in an
Alignment
80
60
40
20
0
66.49
35.24
clustalw
38.39
T_Coffee Default T_Coffee plus all
structures
3D-Coffee
O’Sullivan, O., Suhre, K., Abergel, C., Higgins, DG and Notredame, C
(2004) J.Mol.Biol.
Recent Developments
• 20-30 new programs in past 2 years
• MUSCLE
– Bob Edgar, ISMB, 2004
– Iteration/progressive alignment
• FAST
• Big Alignments
• PROBCONS
– Tom Do, Michael Brudno, Serafim Batzoglou
– ISMB 2004
– “P-Coffee”
• VERY accurate
Iteration Revisited
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
Iteration Revisited
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
Iteration Revisited
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-
Iteration Revisited
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-
Remove EACH Sequence
Remove BEST Sequence
Random
Tree based
Iterate
Iterate
Iterate
RF
RB
Random
Tree
Iteration on HomStrad 184
Method
Default
Remove
Each
Remove Best
Random
ProbCons
64.88
64.69
65.00
64.27
Muscle v3.3
63.12
63.77***
63.70**
63.39
T-Coffee
62.87
63.24
63.38
62.70
Muscle v3.2
61.76
63.58***
63.57***
62.75**
ClustalW
59.87
61.54***
61.44***
60.99**
FFT-NSI (Mafft)
59.65
62.10***
62.05***
61.55***
Average
59.32
60.70***
60.88***
60.57***
Quick Tree
63.45**
63.69***
62.47**
Slow Tree
63.10*
63.27**
61.74
Tree Based
Wallace, O’Sullivan and Higgins, 2004, Bioinformatics, 21:1408
Combining Multiple Alignment Methods
Clustal W
T-Coffee
Probcons
Specialist
Muscle
T-Coffee
Multiple Sequence Alignment
Copyright Cédric Notredame,
2000, all rights reserved
us
c
v3
+f
ft
+g
in
s
i
m
a
+p
c
si
.5
2
fe
e
+f
in
le
of
v6
ns
ns
i
+c
lu
st
al
w
+f
ftn
s2
+f
ftn
s1
+d
ia
lig
nt
+d
ia
lig
+p
n
oa
-g
lo
ba
+p
l
oa
-lo
ca
l
+m
le
co
us
c
+t
c
+m
pr
ob
Combining Multiple Alignment methods with T-Coffee
67.80
67.60
67.40
67.20
67.00
66.80
66.60
66.40
66.20
66.00
65.80
us
c
v3
+f
ft
+g
in
s
i
m
a
+p
c
si
.5
2
fe
e
+f
in
le
of
v6
ns
ns
i
+c
lu
st
al
w
+f
ftn
s2
+f
ftn
s1
+d
ia
lig
nt
+d
ia
lig
+p
n
oa
-g
lo
ba
+p
l
oa
-lo
ca
l
+m
le
co
us
c
+t
c
+m
pr
ob
Combining Multiple Alignment methods with T-Coffee
67.80
67.60
67.40
67.20
67.00
66.80
66.60
66.40
66.20
66.00
65.80
The Wisdom of Crowds
James Surowiecki
Crowds are surprisingly good at accurate decisions
Better than “experts”
Only if they do not form a “mob”
M-Coffee combine 8 methods
70.00
68.00
66.00
64.00
62.00
60.00
58.00
56.00
54.00
52.00
50.00
Poa -global +Dialign-T +ClustalW
+PCMA
+FINSI
+T-Coffee +Muscle v6 +ProbCons
Combined
51.96
58.32
62.75
65.15
65.94
66.73
67.38
67.75
Default
51.90
57.92
61.15
63.73
64.22
65.37
66.04
66.41
Clustal
Toby Gibson, EMBL
Julie Thompson, ICGEB, Strasbourg
BGA, CIA, MADE4
Aedín Culhane
Guy Perriere
Jean Thiolouse
Ian Jeffery
Ailís Fagan
Iteration
Benchmarking
Clustal W 2.0
Gordon Blackshields
Mark Larkin
Paul McGettigan
Iain Wallace
BaliBASE
Thompson, JD, Plewniak, F. and Poch, O. (1999)
NAR and Bioinformatics
•ICGEB Strasbourg
•141 manual alignments using structures
•5 sections
•core alignment regions marked
3. Two groups (12)
1. Equidistant
(82)
2. Orphan
(23)
4. Long internal gaps
(13)
5. Long terminal gaps
(11)
Compare Methods
• Sam
HMM
Hughey and Krogh, 1996
• Dialign
Local multiple alignments
Morgenstern, 1999
• ClustalW
Progressive alignment
Thompson, Higgins and Gibson, 1994
• Prrp
Iterative WSP
Gotoh, 1996
• T-Coffee
Pairwise library
Notredame, Higgins and Heringa, 2000
% alignment columns correct
Core alignment blocks only
Balibase 1 (82) 2 (23)
Method
SAM
Dialign
ClustalW
PRRP
46.8
71.0
78.5
78.6
20.0
25.2
32.2
32.5
3 (12) 4 (13)
5 (11) Total
13.9
35.1
42.5
50.2
42.7
80.4
74.3
82.7
43.9
74.7
65.7
51.1
39.8
61.5
66.4
66.4
% alignment columns correct
Core alignment blocks only
Balibase 1 (82) 2 (23)
Method
SAM
Dialign
ClustalW
PRRP
T-Coffee
46.8
71.0
78.5
78.6
80.7
20.0
25.2
32.2
32.5
37.3
3 (12) 4 (13)
5 (11) Total
13.9
35.1
42.5
50.2
52.9
42.7
80.4
74.3
82.7
88.7
43.9
74.7
65.7
51.1
83.2
39.8
61.5
66.4
66.4
72.1
• Clustal, Clustal1-4
Clustal
TCD
– Higgins DG, Sharp PM. (1988)
CLUSTAL: a package for performing multiple sequence alignment on a microcomputer.
Gene. 73(1):237-44.
– Higgins DG, Sharp PM. (1989)
Fast and sensitive multiple sequence alignments on a microcomputer.
Comput Appl Biosci. 5(2):151-3.
• ClustalV
Heidelberg
– Higgins DG, Bleasby AJ, Fuchs R. (1992)
CLUSTAL V: improved software for multiple sequence alignment.
Comput Appl Biosci. 8(2):189-91.
• ClustalW
Hinxton
– Thompson JD, Higgins DG, Gibson TJ. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic Acids Res. 22(22):4673-80.
• ClustalX
UCC
– Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. (1997)
The CLUSTAL_X windows interface: flexible strategies for multiple sequence
alignment aided by quality analysis tools.
Nucleic Acids Res. 25(24):4876-82.
Clustal re-engineering in C++
• Problems:
•
•
•
•
Code has become very complex.
18 code files (up to 5229 lines).
400 Global variables.
500 functions
• Wish to:
•
•
•
•
•
Simplify the code.
Improve structure of code (modularisation)
Make easier to make functional changes.
Make easier to understand code.
Improve portability
– Qt Cross platform C++ GUI toolbox.
The Local Minimum Problem: Clustal is “Greedy”
local minimum
Energy
Location
Global minimum