Multiple Alignments and Multivariate Analysis Clustal: 1988-2006 Multiple Alignments Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse.
Download ReportTranscript Multiple Alignments and Multivariate Analysis Clustal: 1988-2006 Multiple Alignments Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse.
Multiple Alignments and Multivariate Analysis Clustal: 1988-2006 Multiple Alignments Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Phylogenetic Analysis Homology Detection Homology Modeling Secondary Str. Prediction Profile Analysis VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSA * * * * * **** * * *** * * * * * *** * KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL ** ***** * ** * ** ** ** *** ** ** * ** * GKEFTPPVQAAYQKVVAGVANALAHKYH PAEFTPAVHASLDKFLASVSTVLTSKYR **** * * * * * * ** Dynamic Programming •Needleman and Wunsch, 1970 •O(L2) algorithm Maximise score (or minimise distance) •Gap penalties •Amino acid weight matrix Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Weighted Sums of Pairs: WSP N i 1 W D i 2 j 1 Time O(LN) ij ij Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Sequences 2 3 4 5 6 7 Time 1 second 150 seconds 6.25 hours 39 days 16 years 2404 years Weighted Sums of Pairs: WSP N i 1 W D i 2 j 1 Time O(LN) ij ij Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Progressive Alignment: Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Feng and Doolittle, 1987 Barton and Sternberg, 1987 Willie Taylor, 1987, 1988 Hogeweg and Hesper, 1984 Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Clustal • 35000 citations • Clustal1-Clustal4 1988 – Paul Sharp, Dublin • Clustal V 1992 – EMBL Heidelberg, • Rainer Fuchs • Alan Bleasby • Clustal W 1994-2006, Clustal X 1997-2006 – Toby Gibson, EMBL, Heidelberg – Julie Thompson, ICGEB, Strasbourg • Clustal W and Clustal X 2.0 early 2007 – University College Dublin Since 1994? Benchmarks Protein structure alignments and superpositions • • • • • • Barton and Sternberg; Fitch and McLure Dali BaliBase Homstrad Oxbench Prefab etc. etc. Protein structure analysis •APDB O'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics. 2003;19 Suppl 1:i215-21. RNA alignments •Bralibase (Gardner PP, Wilm A & Washietl S (2005) NAR. ) Which Method is Best? • Clustal W???? • MSA (Lipman, Altschul, Kececioglu) • DCA (Stoye), PRRP (Gotoh) , SAGA (Notredame) • T-Coffee (Notredame) • 3-D Coffee M-Coffee • MAFFT (Katoh) and MUSCLE (Edgar) • Probcons (Do, Brudno, Batzoglu) For Global Protein alignments!!! Clustal W and X 2.0? • Jan 2007 • Re-engineered in C++ • Aim to increase accuracy – Iteration (Wallace, I. M., O'Sullivan, O. and Higgins, D. G., 2005 Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics 21:1408.) • Reduce run times Multivariate Analysis? ADE-4 http://pbil.univ-lyon1.fr/ADE-4/ Thioulouse J., Chessel D., Dolédec S., & Olivier J.M. (1997) ADE-4: a multivariate analysis and graphical display software. Statistics and Computing, 7, 1, 75-83. Between Group Analysis BGA Dolédec, S. & Chessel, D. (1987) Acta Oecologica, Oecologica Generalis, 8, 3, 403-426. Supervised Correspondence Analysis or PCA CO-Inertia Analysis CIA Dolédec, S. & Chessel, D. (1994) Freshwater Biology, 31, 277-294. Thioulouse, J. & Lobry, J.R. (1995) CABIOS, 11, 321-329 2 datasets; Simultaneous CA or PCA • MADE4 – Culhane, A., Thiolouse, J., Perriere, G., Higgins, D.G. (2005) MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics. 21(11):2789-2790. Use CA, PCA for Sequences? PCOORD on sequence distances: Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. CABIOS, 8, 15-22. PCA on dipeptide composition: Van Heel, M. (1991) A new family of powerful multivariate statistical sequence analysis techniques. J. Mol Biol. 220(4): 877-887. PCA on alignment columns: Casari G, Sander C, Valencia A. (1995) A method to predict functional residues in proteins. Nat Struct Biol. 2(2):171-8. Supervised PCA or CA? Malate Dehydrogenases Lactate Dehydrogenases Between Group Analysis samples genes GSVD N d = 0.05 Trypsin-like serine proteases 15 Chymotrypsins EC_1_6 EC_1_5 EC_1_7 EC_1_4 EC_1_3 EC_1_8 EC_1_9 EC_1_2 EC_1_12 EC_1_11 EC_1_10 EC_1_15 EC_1_16 EC_1_14 EC_1_13 EC_1_17 EC_1_18 EC_1_0 EC_1_1 EC_1_19 EC_4_88 EC_4_117 EC_4_13 EC_4_38 EC_4_37 EC_4_0 EC_4_39 EC_4_36 EC_4_15 EC_4_44 EC_4_14 EC_4_6 EC_4_4 EC_4_3 EC_4_48 EC_4_7 EC_4_2 EC_4_16 EC_4_17 EC_4_12 EC_4_24 EC_4_47 EC_4_115 EC_4_21 EC_4_81 EC_4_10 EC_4_46 EC_4_9 EC_4_8 EC_4_76 EC_4_70 EC_4_63 EC_4_11 EC_4_22 EC_4_49 EC_4_78 EC_4_23 EC_4_45 EC_4_56 EC_4_5 EC_4_20 EC_4_71 EC_4_77 EC_4_35 EC_4_55 EC_4_43 EC_4_85 EC_4_18 EC_4_93 EC_4_53 EC_4_25 EC_4_86 EC_4_52 EC_4_51 EC_4_40 EC_4_64 EC_4_34 EC_4_66 EC_4_42 EC_4_1 EC_4_41 EC_4_113 EC_4_92 EC_4_114 EC_4_79 EC_4_83 EC_4_19 EC_4_54 EC_4_73 EC_4_72 EC_4_75 EC_4_74 EC_4_84 EC_4_80 EC_4_69 EC_4_57 EC_4_29 EC_4_90 EC_4_68 EC_4_27 EC_4_89 EC_4_67 EC_4_31 EC_4_82 EC_4_91 EC_4_62 EC_4_30 EC_4_65 EC_4_26 EC_4_32 X5PTP_EC_4 EC_4_58 EC_4_61 EC_4_33 EC_4_97 EC_4_96 EC_4_50 EC_4_87 EC_4_101 EC_4_100 EC_4_109 EC_4_59 EC_4_28 EC_4_60 EC_4_110 EC_4_116 EC_4_95 EC_4_108 EC_4_107 EC_4_94 EC_4_98 EC_4_111 EC_4_112 EC_4_102 EC_4_99 EC_4_103 EC_4_104 EC_4_106 EC_4_105 10 Elastases 31 Trypsins EC_36_3 EC_36_2 EC_36_6 EC_36_4 EC_36_5 EC_36_1 EC_36_0 d = 0.1 X54V X265S X232M X154T X95N X3N X93F X137C X243Q X82E X87L X7A X180Q X155T X14W X165N X229S X183L X181A Chymotrypsin Tripsin X98W X66T Elastase X98Y X155S X93I X275G X154V X228K X162S X70R X229D X204S X132Y X273K X16S X18I X10N X92I X196Y X82G X232Q d = 0.1 8 e-04 X54V X265S X232M X154T X95N X3N X93F X137C X243Q X82E X87L X7A X180Q Trypsin Tripsin X98W X66T Elastase X98Y X155S X93I X275G X154V X228K X162S X70R X229D X204S X132Y X273K X16S X18I X10N X92I 4 e-04 Chymotrypsin X196Y X82G X232Q e+00 X155T X14W X165N X229S X183L X181A d = 0.1 8 e-04 X54V X265S X232M X154T X95N X3N X93F X137C X243Q X82E X87L X7A X180Q Trypsin Tripsin X98W X66T Elastase X98Y X155S X93I X275G X154V X228K X162S X70R X229D X204S X132Y X273K X16S X18I X10N X92I 4 e-04 Chymotrypsin X196Y X82G X232Q e+00 X155T X14W X165N X229S X183L X181A BGA With CA or PCA? • CA: – Pretty pictures – Sequences/residues plots – Finds any clear/simple patterns • Binary aa variables • PCA: – Use continuous variables • e.g. aa properties: size, charge, hydrophobicity etc. d = 10 EC_1_3 EC_1_8 EC_1_4 EC_1_6 EC_1_7 EC_1_9 EC_1_5 EC_1_1 EC_1_2 EC_1_10 EC_1_12 EC_1_11 EC_1_15 EC_1_0 EC_1_16 EC_4_17 EC_4_37 EC_1_19 EC_4_16 EC_4_38 EC_4_23 EC_4_36 EC_4_85 EC_4_45 EC_4_114 EC_4_27 EC_4_20 EC_4_63 EC_4_19 EC_4_31 EC_4_113 EC_1_13 EC_4_15 EC_4_30 EC_4_14 EC_4_0 EC_4_115 EC_4_29 EC_4_34 EC_4_18 EC_4_64 EC_4_1 EC_1_14 EC_4_35 EC_4_111 EC_4_112 EC_4_39 EC_4_86 EC_4_57 EC_4_52 EC_4_116 EC_1_17 EC_4_88 EC_4_22 EC_4_70 EC_4_51 EC_4_83 EC_4_26 EC_4_71 EC_4_32 EC_4_76 EC_4_24 EC_4_55 EC_4_43 EC_4_90 EC_4_89 EC_4_101 EC_4_100 EC_4_28 EC_4_47 EC_4_40 EC_4_56 EC_4_42 EC_4_93 EC_4_84 EC_4_91 EC_4_94 EC_4_117 EC_4_48 EC_4_77 EC_4_78 EC_4_82 EC_4_61 EC_4_92 EC_4_33 EC_4_44 EC_4_54 EC_4_73 EC_4_72 EC_4_60 EC_4_110 EC_4_96 EC_4_95 EC_4_66 EC_4_67 EC_4_25 EC_4_81 EC_4_41 EC_4_53 EC_4_59 EC_4_105 EC_4_97 EC_4_98 EC_4_74 EC_4_102 EC_4_103 EC_4_13 EC_4_49 EC_4_65 EC_1_18 EC_4_21 EC_4_79 EC_4_75 EC_4_4 EC_4_3 EC_4_109 EC_4_108 EC_4_104 EC_4_80 EC_4_50 EC_4_2 EC_4_107 5PTP_EC_4 EC_4_58 EC_4_68 EC_4_10 EC_4_7 EC_4_9 EC_4_8 EC_4_62 EC_4_106 EC_4_6 EC_4_69 EC_4_99 EC_4_5 EC_4_11 EC_4_87 EC_4_12 EC_4_46 15 Chymotrypsins d= Chymotrypsin 31 Trypsins Tripsin Sequences BGA with PCA using 5 amino acid properties (A-E) EC_36_3 EC_36_2 EC_36_4 EC_36_6 EC_36_5 EC_36_0 EC_36_1 10 Elastases Elastase 40 d = 0.5 Eigenvalues X227B X260C X260E X47D X260D 30 X227A X240D X1E X243A X1C X277B X229A X7B X275B X229B X275A Residue weights 20 X232C X267D X95C X95E X95B X229E X216A X229D X165A X275C 10 X275E X196C X232A X82B 0 X272A X255C X273A X255E X47E X185B X255A X1D X136A X136B BGA on Alignments • Focus on any split in the data • Binary or Property coding – CA or PCA • Sequence Weighting • Pseudocounts Clustal Toby Gibson, EMBL Julie Thompson, ICGEB, Strasbourg BGA, CIA, MADE4 Aedín Culhane Guy Perriere Jean Thiolouse Ian Jeffery Ailís Fagan Iteration Benchmarking Clustal W 2.0 Gordon Blackshields Mark Larkin Paul McGettigan Iain Wallace SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT SeqA SeqB SeqC SeqD GARFIELD GARFIELD GARFIELD -------- THE THE THE THE LAST FAST VERY ---- FA-T CA-T FAST FA-T CAT --CAT CAT N Weighted Sums of Pairs i 1 W D i 2 j 1 ij ij MSA Branch and Bound Lipman, Altschul and Kececioglu, 1989 FastMSA Tweaked MSA Gupta, Kececioglu and Schaeffer, 1995 DCA Divide and Conquer Stoye, Moulton and Dress, 1997 SAGA Genetic Algorithm Notredame and Higgins, 1996 PRRP Iteration Gotoh, 1996 Genetic Algorithm Selection (WSP) Mutation Recombination (cross-overs) Genetic Algorithm Selection (WSP) Mutation Recombination (cross-overs) Genetic Algorithm Selection (WSP) Mutation Recombination (cross-overs) SAGA • Cedric Notredame • Sequence Alignment by Genetic Algorithm • Optimise any objective function • Notredame, C. and Higgins, D.G. (1996) SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Research, 24:1515-1524. Structure Test Cases MSA SAGA Test case N Length Score seqs WSP Structure CPU- Score match % time WSP Structure CPUmatch % time Cytc 6 129 1051257 74 7 1051257 74 960 GCR 8 60 371875 75 3 371650 82 75 Ac Protease 5 183 379997 80 13 379997 80 331 S Protease 6 280 574884 91 184 574884 91 3500 Chtp 6 247 111924 - 4525 111579 - 3542 Dfr secstr 4 189 171979 82.03 5 171975 82.50 411 Sbt 4 296 271747 80 7 271747 80 210 Globin 7 167 659036 94 7 659036 94 330 Plasto 5 132 236343 54.03 22 236195 54.05 510 Structure Test Cases MSA SAGA Test case N Length Score seqs WSP Structure CPU- Score match % time WSP Structure CPUmatch % time Cytc 6 129 1051257 74 7 1051257 74 960 GCR 8 60 371875 75 3 371650 82 75 Ac Protease 5 183 379997 80 13 379997 80 331 S Protease 6 280 574884 91 184 574884 91 3500 Chtp 6 247 111924 - 4525 111579 - 3542 Dfr secstr 4 189 171979 82.03 5 171975 82.50 411 Sbt 4 296 271747 80 7 271747 80 210 Globin 7 167 659036 94 7 659036 94 330 Plasto 5 132 236343 54.03 22 236195 54.05 510 Which method is best? • Best score? • Empirical tests? – Sets of test cases • • • • • Fitch and McLure BaliBase Homstrad Oxbench Prefab etc. etc. – APDB O'Sullivan O, Zehnder M, Higgins D, Bucher P, Grosdidier A, Notredame C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics. 2003;19 Suppl 1:i215-21. COFFEE • Consistency based Objective Function For Evaluation of Ehhhh things • Maximum Weight Trace (John Kececioglu) • Maximise similarity to a LIBRARY of residue pairs • Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14: 407-422. Pairs of Residues Human Horse Human Horse beta beta alpha alpha e.g. Seq N, Residue I Seq M, Residue J Weight = w VHLTPEEKSAVTALWGKVN–-VDEVGGEAL VQLSGEEKAAVLALWDKVN–-EEEVGGEAL –VLSPADKTNVKAAWGKVGAHAGEYGAEAL –VLSAADKTNVKAAWSKVGGHAGEYGAEAL % Match Test Case Avg N % ID seq COFFE SAGA PRRP MSA SAGA ClustalW PILEUP SAM HMM Ac prot 21 14 50.2 48.8 51.2 39.2 40.9 27.9 Binding 31 7 64.5 76.2 64.2 50.0 66.6 36.9 Cytc 42 6 90.7 89.4 67.3 89.1 94.6 67.3 Fniii 17 9 47.0 36.3 45.2 42.0 37.8 16.2 Gcr 36 8 83.1 92.8 80.8 80.8 80.8 85.7 Globin 24 17 85.2 87.0 78.0 86.4 72.6 67.8 Igb 24 37 78.1 74.9 70.1 74.8 52.4 67.2 Lzm 39 6 72.3 71.1 72.3 72.2 72.3 55.3 Phenyldiox 22 8 64.7 49.9 55.6 58.5 37.4 45.7 Sbt 61 7 96.9 96.7 96.0 96.9 97.4 90.6 sprot 27 15 66.6 64.3 68.5 62.5 57.9 61.7 % Match Test Case Avg N % ID seq COFFE SAGA PRRP MSA SAGA ClustalW PILEUP SAM HMM Ac prot 21 14 50.2 48.8 51.2 39.2 40.9 27.9 Binding 31 7 64.5 76.2 64.2 50.0 66.6 36.9 Cytc 42 6 90.7 89.4 67.3 89.1 94.6 67.3 Fniii 17 9 47.0 36.3 45.2 42.0 37.8 16.2 Gcr 36 8 83.1 92.8 80.8 80.8 80.8 85.7 Globin 24 17 85.2 87.0 78.0 86.4 72.6 67.8 Igb 24 37 78.1 74.9 70.1 74.8 52.4 67.2 Lzm 39 6 72.3 71.1 72.3 72.2 72.3 55.3 Phenyldiox 22 8 64.7 49.9 55.6 58.5 37.4 45.7 Sbt 61 7 96.9 96.7 96.0 96.9 97.4 90.6 sprot 27 15 66.6 64.3 68.5 62.5 57.9 61.7 72.6 71.5 68.1 65.5 64.5 56.4 T-Coffee • Heuristic approximation to COFFEE – Uses progressive alignment (Trees) • Heterogenous data – – – – Sequences Structures Genomes ESTs • Notredame, C, Higgins, DG and Heringa, J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J.Mol.Biol., 302: 205-217. T-Coffee • Mixed data sources – Primary library from • Lalign (SIM): – 10 best local alignments • Clustalw Default – All pairwise alignments • SAP (Willie Taylor, Structure Superposition) • Multiple alignments • Check library for CONSISTENCY – Upweight pairs of residues that agree with other pairs Mixing Heterogenous Information Local Alignment Global Alignment Multiple Alignment Specialist Structural T-Coffee Multiple Sequence Alignment Copyright Cédric Notredame, 2000, all rights reserved Mixing Heterogenous Information Structure Superposition Weighted Residue Pairs Copyright Cédric Notredame, 2000, all rights reserved e.g. SAP Taylor and Orengo Increasing Structure Numbers % accuracy tRNA-synt_2b 19% ID 100 80 60 40 20 0 0 2 3 4 no of Structures 5 %accuracy Including Structures in an Alignment 80 60 40 20 0 66.49 35.24 clustalw 38.39 T_Coffee Default T_Coffee plus all structures 3D-Coffee O’Sullivan, O., Suhre, K., Abergel, C., Higgins, DG and Notredame, C (2004) J.Mol.Biol. Recent Developments • 20-30 new programs in past 2 years • MUSCLE – Bob Edgar, ISMB, 2004 – Iteration/progressive alignment • FAST • Big Alignments • PROBCONS – Tom Do, Michael Brudno, Serafim Batzoglou – ISMB 2004 – “P-Coffee” • VERY accurate Iteration Revisited --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE Iteration Revisited --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE Iteration Revisited --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Iteration Revisited --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS- Remove EACH Sequence Remove BEST Sequence Random Tree based Iterate Iterate Iterate RF RB Random Tree Iteration on HomStrad 184 Method Default Remove Each Remove Best Random ProbCons 64.88 64.69 65.00 64.27 Muscle v3.3 63.12 63.77*** 63.70** 63.39 T-Coffee 62.87 63.24 63.38 62.70 Muscle v3.2 61.76 63.58*** 63.57*** 62.75** ClustalW 59.87 61.54*** 61.44*** 60.99** FFT-NSI (Mafft) 59.65 62.10*** 62.05*** 61.55*** Average 59.32 60.70*** 60.88*** 60.57*** Quick Tree 63.45** 63.69*** 62.47** Slow Tree 63.10* 63.27** 61.74 Tree Based Wallace, O’Sullivan and Higgins, 2004, Bioinformatics, 21:1408 Combining Multiple Alignment Methods Clustal W T-Coffee Probcons Specialist Muscle T-Coffee Multiple Sequence Alignment Copyright Cédric Notredame, 2000, all rights reserved us c v3 +f ft +g in s i m a +p c si .5 2 fe e +f in le of v6 ns ns i +c lu st al w +f ftn s2 +f ftn s1 +d ia lig nt +d ia lig +p n oa -g lo ba +p l oa -lo ca l +m le co us c +t c +m pr ob Combining Multiple Alignment methods with T-Coffee 67.80 67.60 67.40 67.20 67.00 66.80 66.60 66.40 66.20 66.00 65.80 us c v3 +f ft +g in s i m a +p c si .5 2 fe e +f in le of v6 ns ns i +c lu st al w +f ftn s2 +f ftn s1 +d ia lig nt +d ia lig +p n oa -g lo ba +p l oa -lo ca l +m le co us c +t c +m pr ob Combining Multiple Alignment methods with T-Coffee 67.80 67.60 67.40 67.20 67.00 66.80 66.60 66.40 66.20 66.00 65.80 The Wisdom of Crowds James Surowiecki Crowds are surprisingly good at accurate decisions Better than “experts” Only if they do not form a “mob” M-Coffee combine 8 methods 70.00 68.00 66.00 64.00 62.00 60.00 58.00 56.00 54.00 52.00 50.00 Poa -global +Dialign-T +ClustalW +PCMA +FINSI +T-Coffee +Muscle v6 +ProbCons Combined 51.96 58.32 62.75 65.15 65.94 66.73 67.38 67.75 Default 51.90 57.92 61.15 63.73 64.22 65.37 66.04 66.41 Clustal Toby Gibson, EMBL Julie Thompson, ICGEB, Strasbourg BGA, CIA, MADE4 Aedín Culhane Guy Perriere Jean Thiolouse Ian Jeffery Ailís Fagan Iteration Benchmarking Clustal W 2.0 Gordon Blackshields Mark Larkin Paul McGettigan Iain Wallace BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999) NAR and Bioinformatics •ICGEB Strasbourg •141 manual alignments using structures •5 sections •core alignment regions marked 3. Two groups (12) 1. Equidistant (82) 2. Orphan (23) 4. Long internal gaps (13) 5. Long terminal gaps (11) Compare Methods • Sam HMM Hughey and Krogh, 1996 • Dialign Local multiple alignments Morgenstern, 1999 • ClustalW Progressive alignment Thompson, Higgins and Gibson, 1994 • Prrp Iterative WSP Gotoh, 1996 • T-Coffee Pairwise library Notredame, Higgins and Heringa, 2000 % alignment columns correct Core alignment blocks only Balibase 1 (82) 2 (23) Method SAM Dialign ClustalW PRRP 46.8 71.0 78.5 78.6 20.0 25.2 32.2 32.5 3 (12) 4 (13) 5 (11) Total 13.9 35.1 42.5 50.2 42.7 80.4 74.3 82.7 43.9 74.7 65.7 51.1 39.8 61.5 66.4 66.4 % alignment columns correct Core alignment blocks only Balibase 1 (82) 2 (23) Method SAM Dialign ClustalW PRRP T-Coffee 46.8 71.0 78.5 78.6 80.7 20.0 25.2 32.2 32.5 37.3 3 (12) 4 (13) 5 (11) Total 13.9 35.1 42.5 50.2 52.9 42.7 80.4 74.3 82.7 88.7 43.9 74.7 65.7 51.1 83.2 39.8 61.5 66.4 66.4 72.1 • Clustal, Clustal1-4 Clustal TCD – Higgins DG, Sharp PM. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 73(1):237-44. – Higgins DG, Sharp PM. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. Comput Appl Biosci. 5(2):151-3. • ClustalV Heidelberg – Higgins DG, Bleasby AJ, Fuchs R. (1992) CLUSTAL V: improved software for multiple sequence alignment. Comput Appl Biosci. 8(2):189-91. • ClustalW Hinxton – Thompson JD, Higgins DG, Gibson TJ. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22):4673-80. • ClustalX UCC – Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25(24):4876-82. Clustal re-engineering in C++ • Problems: • • • • Code has become very complex. 18 code files (up to 5229 lines). 400 Global variables. 500 functions • Wish to: • • • • • Simplify the code. Improve structure of code (modularisation) Make easier to make functional changes. Make easier to understand code. Improve portability – Qt Cross platform C++ GUI toolbox. The Local Minimum Problem: Clustal is “Greedy” local minimum Energy Location Global minimum