Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur [email protected] Complete genomes Tree of life • 1387 projects 261 published (01-03-05) • 654 prokaryotes • 472 eukaryotes http://www.genomesonline.org/
Download ReportTranscript Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur [email protected] Complete genomes Tree of life • 1387 projects 261 published (01-03-05) • 654 prokaryotes • 472 eukaryotes http://www.genomesonline.org/
Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur [email protected] 207 21 Complete genomes Tree of life • 1387 projects 261 published (01-03-05) • 654 prokaryotes 33 • 472 eukaryotes http://www.genomesonline.org/ Cumulated number of available completely sequenced genomes 300 261 270 240 224 210 180 165 150 116 120 90 71 60 42 30 0 2 1 95 5 2 96 12 3 97 19 4 98 24 5 99 6 00 7 01 8 02 9 03 10 04 11 03-05 Completely sequenced Genomes that span the three domains of life are growing at a rapid rate List and references GOLD Genome sequencing projects There are several web-based resources that document the progress of completely sequenced genomes and their reference publication, including: GOLD Genomes Online Database http://wit.integratedgenomics.com/GOLD/ GNN Genome News Network http://www.genomenewsnetwork.org/index.php Resources for genomes There are two main resources for genomes: EBI European Bioinformatics Institute http://www.ebi.ac.uk/genomes/ NCBI National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov But many others resources from sequencing Institutions: Sanger The welcome Trust Sanger Institut http://www.sanger.ac.uk/ TIGR The Institute for Genomic Research http://www.tigr.org Genolevures http://cbi.labri.fr/Genolevures/index.php Definitions Genome The genome of a cell is formed by the collection of the DNA it comprises. The genome size is the total of its DNA bases. Gene Is a particular DNA sequence situated in a specific position on a chromosome and that codes for a specific function. Protein Is a sequence composed of amino-acids ordered according to the DNA sequences of the gene it codes for. Proteome Is the set of proteins in an organism. Genomics Is the exhaustive study of genomes: genetic material, genes; their functions, their organization.... Chronology of completely sequenced genomes • 1977: first viral genome (5386 base pairs; encoding 11 genes). Sanger et al. sequence bacteriophage fX174. • 1981: Human mitochondrial genome. 16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA) • 1986: Chloroplast genome. 156,000 base pairs (most are 120 kb to 200 kb) 1995: first genome of a free-living organism, the bacterium Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes. 1996: first genome of an archaeal genome: Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes. 1997: first eukaryotic genome : Saccharomyces cerevisiae S288C; International collaboration; 16 Chromosomes; 12,057 Kb, ~6000 genes. 1998: first multicellular organism Nematode Caenorhabditis elegans; 97 Mb; ~19,000 genes. 1999: first human chromosome: Chromosome 22 (49 Mb, 673 genes)) • 2000: Fruitfly Drosophila melanogaster (137 Mb; ~13,000 genes) •2000 first plant genome: Arabidopsis thaliana (115,428 Mb; 22670 genes • 2001: draft sequence of the human genome (x Mb; ~28000 genes) • 2002: plasmodium falciparum (22,9 Mb; 5334 genes) • 2002: mouse genome (x Mb; ~28000 genes) • 2004: Fish draft Tetraodon nigroviridis genome (x Mb; ~28000 genes); How big are genome sizes? Viral genomes: 1 kb to 350 kb (Mimivirus: 1.2 Mb) Bacterial genomes: 0.5 Mb to 13 Mb; Eukaryotic genomes: 8 Mb to 670 Gb; DOGS: http://www.cbs.dtu.dk/databases/DOGS/abbr_table.bysize.txt Comparative genomics Analyses of the genetic material of different species help understanding the similarity and differences between genomes, their evolution and the evolution of their genes. •Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,... •Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes; •understanding gene and genome evolution Evolution Evolutionary processes include: Ancestor Expansion* Phylogeny* genesis duplication HGT Exchange* species genome HGT loss Deletion* and selection Gene duplications are traditionally considered to be a major evolutionary source of protein new functions Understanding how duplications happened and how important is this evolutionary process is a key goal of genome analysis > Some examples S. cerevisiae genome Colours reveal Duplications Kellis et al. Nature, 2004 Duplication Speciation Deletion Actual content of the 2 copies Reconstruction of the ancestral organization Kellis et al. Nature, 2004 Kellis et al. Nature, 2004 Nature Reviews Genetics 3; 827-837 (2002); SPLITTING PAIRS: THE DIVERGING FATES OF DUPLICATED GENES Original version Actual version Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206. Genome duplication. a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows). b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431, 946-857. 2004. Jaillon et al. Nature 431, 946857. 2004. Inter-genomic comparisons • Compositional comparisons between species (nuc and aa compositions); • Gene, protein conservation between species (rate of conservation); • Orthologs; families of orthologs; • Specific and non-specific genes; • Genes exclusively conserved in one or in a subset of species (or in domains); • Gene Dictionary; • Gene conservation profiles; • Genome tree construction; • Genome multiple alignments. Methodology Fp 1 i p 1 j kij • • • • • • • • • • n •• • • • •• • •• • • F1 • • • • • • • • • sup Matrice T kij > 0 Correspondence Analysis Classification • orthogonal system; • use of euclidean distance; Amino Acid composition org sc ce dm ca sp ath hs Ala 5.5 6.2 7.5 5.0 6.3 6.2 7.0 Arg 4.4 5.2 5.6 3.7 4.9 5.5 5.6 Asn 6.1 4.9 4.7 6.7 5.2 4.4 3.7 Asp 5.8 5.2 5.2 5.9 5.4 5.4 4.9 Cys 1.3 2.1 1.9 1.1 1.5 1.9 2.2 Gln 3.9 4.1 5.2 4.5 3.8 3.5 4.7 Glu 6.5 6.4 6.4 6.4 6.6 6.7 7.0 Gly 5.0 5.3 6.2 5.1 5.0 6.3 6.6 mj 5.5 mth 7.3 af 7.8 ph 6.4 pa 6.7 ape 9.7 ssp2 5.6 pfu 6.6 sto 5.6 pyae 9.9 ta 7.0 tv 6.4 h 13.1 3.9 6.8 5.8 5.5 5.7 7.8 4.7 5.3 4.2 6.5 5.5 4.7 6.5 5.3 3.3 3.2 3.5 3.3 2.0 5.0 3.5 4.9 2.6 4.3 4.8 2.1 5.5 5.9 4.9 4.3 4.6 4.2 4.7 4.4 4.6 4.3 5.7 5.5 9.0 1.3 1.2 1.2 0.6 0.6 0.8 0.6 0.6 0.7 0.9 0.6 0.6 0.7 1.5 1.9 1.8 1.6 1.7 1.8 2.1 1.8 2.1 2.1 2.2 2.1 2.6 8.7 8.1 8.9 8.3 8.8 7.3 6.8 8.9 7.0 7.0 6.0 6.4 6.7 6.3 8.0 7.2 7.0 7.3 8.8 6.4 7.1 6.3 7.7 7.3 7.0 8.5 His Ile Leu Lys 2.1 6.6 9.6 7.3 2.3 6.2 8.7 6.5 2.7 4.9 9.2 5.6 2.1 7.1 9.2 7.3 2.3 6.1 9.9 6.5 2.3 5.4 9.5 6.4 2.5 4.4 9.8 5.7 Met 2.1 2.6 2.4 1.8 2.1 2.4 2.2 Phe 4.6 5.0 3.6 4.4 4.6 4.3 3.7 Pro 4.3 4.9 5.5 4.5 4.8 4.7 6.1 Ser 9.0 8.0 8.3 9.0 9.4 9.0 8.0 Thr 5.8 5.8 5.6 6.2 5.4 5.1 5.3 Trp 1.1 1.1 1.0 1.0 1.1 1.3 1.2 Tyr 3.3 3.2 3.0 3.5 3.4 2.9 2.8 1.4 10.4 9.5 10.4 2.2 4.2 3.4 4.5 4.0 0.7 4.4 1.9 7.7 9.5 4.6 2.9 3.6 4.3 6.1 5.0 0.8 3.2 1.5 7.2 9.5 6.9 2.6 4.6 3.9 5.5 4.2 1.0 3.6 1.5 8.8 10.3 7.7 2.4 4.6 4.5 5.9 4.5 1.2 3.8 1.5 8.5 10.2 7.8 2.4 4.4 4.3 5.0 4.2 1.2 3.8 1.6 5.5 11.0 3.9 2.2 2.9 5.5 6.7 4.3 1.3 3.5 1.3 9.4 10.3 7.7 2.2 4.4 3.8 6.7 4.7 1.1 4.8 1.5 8.7 10.1 8.1 2.2 4.4 4.3 4.9 4.4 1.2 4.0 1.3 9.9 10.3 8.0 2.1 4.5 3.9 6.7 4.8 1.0 4.9 1.5 6.3 10.5 5.7 1.9 3.6 5.0 4.9 4.4 1.5 4.3 1.6 9.0 8.4 5.6 3.2 4.7 4.0 7.6 4.8 0.9 4.6 1.5 9.2 8.8 6.9 2.7 4.7 3.8 7.5 4.8 0.8 4.8 2.2 3.6 8.3 1.6 1.7 3.1 4.7 5.2 6.8 1.1 2.5 Growth t° •Glu •Lys •Arg GC% r=0.83 p<1.e-4 org Glu Gln mj 8.7 1.5 mth 8.1 1.9 af 8.9 1.8 ph 8.3 1.6 pa 8.8 1.7 •Glnapem 7.3 1.8 ssp2 6.8 2.1 pfu 8.9 1.8 sto 7.0 2.1 pyae 7.0 2.1 ta 6.0 2.2 tv 6.4 2.1 ae 9.6 2.0 Tekaia, F., Yeramian, E. and Dujon B. (2002) Gene. 297 pp. 51-60. tm 8.9 2.0 Lys+Arg 14.3 11.4 12.7 13.2 13.5 11.7 12.4 13.4 12.2 12.2 11.1 11.6 14.3 13.1 Growth t° QuickTime™ et un décompresseur TIFF (non compressé) sont requis pour visionner cette image. 2005 GC% PE, PPE families Protein size statistics Dom n org mean std n prot min E 38 443.1 403.6 364538 10 A 19 279.9 199.6 42499 10 B 53 311.2 233.2 155538 11 Max 9638 4436 7463 Proteome comparisons: Methodology Species specific comparisons • bestp1np blastp, pam250, SEG filter • allp1np • segmatchp1np NP P1 proteome1 new proteome • bestnpp1 • allnpp1 • segmatchnpp1 • bestpnnp Pn • allpnnp proteomen • segmatchpnnp SPECSO • bestnppn • allnppn • segmatchnppn bestnppi allnppi np1 size pij e-value1 HS/IS/NS np1 size pij e-value1 HS/IS/NS 100 species: E:28, A: 19, B: 53 np1 size pik e-value HS/IS/NS • Paralogs • Orthologs The expected number of HSPs with score at least S is given by: E = Kmne-S. m and n are sequence and database lengths. Dom E E E E E E E E E E E E E E E E E E E A A A A A A A A A A A A A A A A A A Code SC SP NCU CALBI MGR FG AN EC UN CE CBR DM AG ATH HS MUS FR PF CI RN MJ MTH AF PH PA APEM TA TV H SSP2 PFU STO PYAE MA MK MMA MBUR MFR Size 5829 4962 10082 6165 11109 11640 9541 1996 20844 14713 17878 16112 22671 27625 28097 33609 5334 15851 21205 1773 1871 2409 2061 1765 1865 1478 1526 2058 2977 2208 2826 2605 4528 1687 3371 2676 2911 Organism S. cerevisiae S. pomb e Neurospora crassa C. albicans Magnapo rthe Grisea Fusarium Graminearum Aspergillus nidulan s E. cuniculi C. elegans Caeno rhabditi s briggsae D. melanogaster Anopheles gambiae A. thalina Homo sapi ens Mus musculus Fugu rubripes P. falciparum Ciona in testinali s Rattus nor vegicus M. jannaschii M. thermoa utotrop hicum A. fulgidus P. horiko shii OT3 P. abyssi A. pernix K1) Thermopla sma a cidophilum Thermopla sma volcanium Haloba cterium sp. NRC-1 Sulfolobus solfataricus P2 P. furiosi s Sulfolobus tokoda ii Pyrobac ulum aerophilum Methanosarcina acetivorans (C2A) Methanopyrus kandl eri AV19 Methanosarcina mazei strain Goe1 Methanococcoides burtonii Methanogenium frigidum Taxonomi c class. Ascomycota Ascomycota Ascomycota Ascomycota Ascomycota Ascomycota Ascomycota Microsporidia Eumetazoa Eumetazoa Arthropoda Arthropoda Streptoph yta Eumetazoa Chordata Eumetazoa Apicompl exa Eumetazoa Eumetazoa Methanococci Euryarchaeota Euryarchaeota Euryarchaeota Euryarchaeota Crenarchaeota Euryarchaeota Euryarchaeota Euryarchaeota Crenarchaeota Euryarchaeota Crenarchaeota Crenarchaeota Euryarchaeota Euryarchaeota Euryarchaeota Euryarchaeota Euryarchaeota B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B HI 1713 H. influenzae Gammap roteobact eria MG 479 M. genitalium Mycoplasmatal es MP 677 M. pneumoniae Mycoplasmatal es Ssp 3168 Synechocystis sp. Cyanobac teria EC 4290 E. coli Gammap roteobact eria HP 1577 H. pylori Epsilonprot eobacteria BS 4100 B. subtilis Bacillus BH 4066 Bascillus halodu rans Bacillus BB 1639 B. burgdorf eri Spirochaetes AE 1522 A. aeolicus Aquificales MT 3996 M. tuberculosis H37 R Actinobacteria MTC 4203 M. tuberculosis CDC 1551 Actinobacteria ML 1604 Mycobacterium leprae Actinobacteria TP 1031 T. pallidum Spirochaetes CT 877 C. trachoma tis Chlamydiae RP 837 R. prowazekii Alphaproteoba cteria CJ 1634 C. jejuni Epsilonprot eobacteria CP 1052 C. pneumoniae Chlamydiae TM 1849 T. mariti ma Thermotoga e DR 3117 D. radiodura ns Deinococcus-Thermus NM 2081 N. meningitidis Betaprot eobacteria XF 2830 Xylella fastidiosa Gammap roteobact eria VC 3837 Vibrio cholerae Gammap roteobact eria PAE 5570 Pseudomona s aeruginosa Gammap roteobact eria B 575 Buchnera sp. Gammap roteobact eria LMO 2846 Listeria monocytogenes Bacilli LIN 2968 Listeria innocua Bacilli STY 4395 Salmonella Typhi Gammap roteobact eria YP 3895 Yersinia pestis Gammap roteobact eria SAMU 50 2714 Staphylococcus aureus Mu50 Bacilli SAN315 2594 Staphylococcus aureus N315 Bacilli SPY 1696 Streptococcus pyogenes M1 Bacilli MM 7275 Mesorhizobium loti Alphaproteoba cteria SM 6205 Sinorhizobium meliloti Alphaproteoba cteria AGRT 5299 Agroba cterium tumefaciens Alphaproteoba cteria MB 3953 Mycobacterium Bovis Actinobacteria SCO 7810 Streptom yces coelicolor Actinobacteria UU 614 Ureaplasma urealyticum Mycoplasmatal es SHFL 4068 Shigella flexneri Gammap roteobact eria LL 2321 Lactococcus lactis subsp. lacti s Bacilli RCO 1374 Rickettsia co norii Malish 7 Alphaproteoba cteria CCR 3737 Caulobacter crescentus CB15 Alphaproteoba cteria NOS 5366 Nostoc sp Cyanobacteria TSE 2475 Thermosynechococcus elonga tus BP-1 Cyanoba cteria TTE 2588 Thermoanaerobacter tengcongensis strain MB4T Clostridia BFL 583 Candidatus Blochmannia floridanus Gammaproteobacteria PRO 1882 Prochlorococcus marinus subsp. marinus str. Cyanoba cteria PMT 2265 Prochlorococcus marinus str. MIT 9313 Cyanoba cteria PMM 1712 Prochlorococcus marinus subsp. pastoris str. Cyanoba cteria WS 2044 Wolinella succinog enes Epsilonp roteobacteria PL 4683 Photorhabdus luminescens subsp. laumondii Gammaproteobacteria Homolog - Paralog - Ortholog O A A1A1 BB 11 Species-1 B Homologs: A1, B1, A2, B2 Paralogs: A1 vs B1 and A2 vs B2 Orthologs: A1 vs A2 and B1 vs B2 AA22 BB 22 Sequence analysis Species-2 a S1 S2 b Example Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome SC vs SC BLASTP 2.2.1 [Apr-13-2001] ............................ Query= YAL005c SSA1 heat shock protein of HSP70 family, cytosolic (642 letters) Database: S. cerevisiae proteome version 22/05/2002 5829 sequences; 2,798,770 total letters ................................................ Sequences producing significant alignments: Score E (bits) Value YAL005c SSA1 heat shock protein of HSP70 family, cyt... 674 0.0 YLL024c SSA2 heat shock protein of HSP70 family, cyt... 663 0.0 YER103w SSA4 heat shock protein of HSP70 family, cyt... 589 e-169 YBL075c SSA3 heat shock protein of HSP70 family, cyt... 588 e-169 YJL034w KAR2 nuclear fusion protein 480 e-136 YDL229w SSB1 heat shock protein of HSP70 family 428 e-120 YNL209w SSB2 heat shock protein of HSP70 family, cyt... 427 e-120 YJR045c SSC1 mitochondrial heat shock protein 70-rel... 336 5e-93 YEL030w heat shock protein of HSP70 family 324 2e-89 YLR369w SSQ1 mitochondrial heat shock protein 70 296 4e-81 YBR169c SSE2 heat shock protein of the HSP70 family 173 7e-44 YPL106c SSE1 heat shock protein of HSP70 family 172 1e-43 YHR064c regulator protein involved in pleiotro... 143 6e-35 YKL073w LHS1 chaperone of the ER lumen 100 4e-22 YLR135w subunit of SLX1P/Ybr228p-SLX4P complex... 330.13 ................... bestscsc YAL002w YAL003w YAL004w YAL005c YAL007c ( SC / SC ) 1176 206 215 642 215 allscsc YLL024c YOR016c NS NS NS HS 0.0 HS 1e-44 ( SC / SC ) YAL002w 1176 - NS YAL003w 206 - NS YAL004w 215 - NS YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c 642 642 642 642 642 642 642 642 642 642 642 642 642 YLL024c YER103w YBL075c YJL034w YDL229w YNL209w YJR045c YEL030w YLR369w YBR169c YPL106c YHR064c YKL073w HS HS HS HS HS HS HS HS HS HS HS HS HS 0.0 0.0 0.0 e-147 e-130 e-130 e-100 2e-96 1e-87 2e-47 4e-47 7e-38 5e-24 YAL007c YAL007c YAL007c YAL007c 215 215 215 215 YOR016c YGL200c YHR110w YDL018c HS IS IS IS 1e-44 5e-05 0.017 0.021 - Paralogs - multiple matches - Partitions/clustering Multiple matches of sc in sc ORF matches in sc YAL 005c 13 YAL 007c 1 YDR214w 1 YDR216w 2 YDR399w 1 YDR406w 9 YDR409w 1 YCR 040w 1 YKL218c 1 YKL219w 14 YKL220c 6 YKL221w 2 YKL222c 3 YKL223w 5 YKL224c 22 YKR001c 2 YKR003w 5 YBR104w 6 YBR105c 1 YKR013w 2 YKR014c 13 .................................... .......................... Max : YDR477w 77 SC/CE bestscce YAL002w YAL003w YAL004w YAL005c YAL007c YAL009w YAL019w YAL020c YAL021c CE/SC (SC / CE) 1176 206 215 642 215 259 1131 333 837 allscce bestcesc C42C1.4 F54H12.6 F26D10.3 F57B10.5 F16D3.7 M03C11.8 F07C3.4 ZC518.3 HS HS NS HS HS IS HS IS HS 2e-15 4e-22 e-172 9e-08 0.013 7e-92 7e-04 5e-47 (SC / CE) 1259 213 640 640 203 516 1 038 356 949 425 600 allcesc YAL002w 1176 C42C1.4 HS 2e-15 YAL003w 206 YAL003w 206 F54H12.6 HS Y41E3.10 HS 4e-22 2e-17 YAL004w 215 - NS YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c F26D10.3 F44E5.4 F44E5.5 C12C8.1 C15H9.6 F43E2.8 C37H5.8 F11F1.1 F54C9.2 K09C4.3 T28F3.2 C30C11.4 T24H7.2 T14G8.3 HS HS HS HS HS HS HS HS HS HS HS HS HS HS 642 642 642 642 642 642 642 642 642 642 642 642 642 642 C42C1.4 F54H12.6 F26D10.3 F26D10.3 F57B10.5 F16D3.7 M03C11.8 AC3.1 AC3.2 AC3.3 AC3.4 e-172 e-153 e-153 e-152 e-148 e-144 e-104 1e-77 4e-51 4e-47 2e-45 7e-43 2e-34 8e-33 Orthologs ( CE / SC) YAL002w Y AL003w Y ER103w Y ER103w Y AL007c YHL003c Y AL019w YLR189c YNL326c HS HS HS HS HS IS HS NS IS NS HS 8e-16 4e-20 e-174 e-174 7e-13 9e-04 2e-87 0.038 1e-12 (CE / SC ) C42C1.4 1259 YAL002w HS 8e-16 F54H12.6 213 YAL003w HS 4e-20 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 640 640 640 640 640 640 640 640 640 640 640 640 640 640 YER103w YBL075c YLL024c YAL005c YJL034w YDL229w YNL209w YJR045c YEL030w YLR369w YPL106c YBR169c YHR064c YKL073w HS HS HS HS HS HS HS HS HS HS HS HS HS HS e-174 e-174 e-172 e-171 e-141 e-129 e-129 e-100 2e-97 1e-83 2e-45 5e-45 8e-36 3e-22 segmatchSCCE Test siz Hit YAL002w 1176 C42C1.4 YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c 642 642 642 642 642 642 642 642 642 642 642 642 642 642 642 F26D10.3 F44E5.5 F44E5.4 C12C8.1 C15H9.6 F43E2.8 C37H5.8 F11F1.1b F11F1.1a F54C9.2 K09C4.3 K09C4.3 C30C11.4 T24H7.2 T14G8.3 siz e-val %id %sim gap Ssiz dT eT dH eH 1259 5e-14 16 44 7 674 438 1111 547 1196 640 645 645 643 661 657 657 607 614 469 310 310 776 925 926 1e-159 1e-142 1e-142 1e-141 1e-137 1e-134 1e-96 1e-73 8e-72 3e-47 2e-43 1e-04 1e-39 1e-31 3e-30 73 63 63 62 60 58 46 36 36 38 71 54 26 24 24 84 79 79 79 78 76 67 60 60 66 88 70 50 50 51 0 0 0 0 1 1 2 0 2 2 0 605 3 607 5 613 605 3 607 5 611 605 3 607 5 611 605 3 607 5 611 603 5 607 36 641 606 1 606 29 637 606 2 607 31 632 599 4 602 2 600 599 4 602 2 607 379 2 380 52 433 186 4 189 6 192 61 327 387 189 249 8 600 5 604 4 647 3 506 4 509 26 548 6 510 4 513 28 560 P7.1 P7.1.C4.1 • • P4.2.C3.1 • • • • • • • • • • • P7.1.C3.1 P4.1 A set of genes defines a "partition" if and only if a) each member of the set has at least one significant match with another member of the set; b) no member of the set has significant matches with members not included in the set; c) the set is minimal. MCL: Markov Cluster algorithm • • Partitions/MCL Clustering • Stijn van Dongen: A cluster algorithm for graphs. http://micans.org/mcl/ • Each gene is identified by its partition and its MCL cluster Markov Cluster (MCL) algorithm http://micans.org/mcl/ • Traditionally, most methods deal with similarity relationships in a pairwise manner, while graph theory allows classification of proteins into families based on a global treatment of all relationships in similarity space simultaneously. • Similarity between proteins are arranged in a matrix that represents a connection graph. • Nodes of the graph represent proteins, and edges represent sequence similarity that connects such proteins. • A weight is assigned to each edge by taking -log10(E-value) obtained by a BLAST comparison. •These weights are transformed into probabilities associated with a transition from one protein to another within this graph. •This matrix is passed through iterative rounds of matrix multiplication and matrix inflation until there is little or no net change in the matrix. The final matrix is then interpreted as a protein family clustering. • The inflation value parameter of the MCL algorithm is used to control the granularity of these clusters. blastp proteome specific comparisons all protein significant hits Adapted from Enright et al. NAR 2002. Example of Partition/MCL clustering P6 19 Total number of distinct ORFs= 6 -------------------YKL212w 623 YIL002c 4e-11 YIL002c 946 YOR109w 7e-94 YIL002c 946 YNL106c 5e-90 YIL002c 946 YOL065c 3e-10 YNL106c 1183 YIL002c 1e-89 YOR109w 1107 YIL002c 1e-90 YKL212w 623 YOR109w 2e-34 YKL212w 623 YNL106c 3e-34 YKL212w 623 YNL325c 8e-29 YNL106c 1183 YKL212w 1e-33 YNL325c 879 YKL212w 6e-25 YOR109w 1107 YKL212w 2e-30 YNL106c 1183 YOR109w 0.0 YNL106c 1183 YNL325c 2e-22 YNL325c 879 YNL106c 1e-22 YOL065c 384 YNL106c 4e-10 YOR109w 1107 YNL106c 0.0 YNL325c 879 YOR109w 4e-20 YOR109w 1107 YNL325c 2e-16 YOL065c P6.9.C6.48 YIL002c P6.9.C6.48 YNL325c P6.9.C6.48 YKL212w P6.9.C6.48 YOR109w P6.9.C6.48 YNL106c P6.9.C6.48 6 6 6 6 6 6 6 6 6 6 6 6 Example of Partition/MCL clustering P6 22 Total number of distinct ORFs= 6 -------------------YBR208c YBR208c YBR208c YBR208c YBR208c YBR218c YGL062w YMR293c YNR016c YBR218c YGL062w YBR218c YBR218c YMR207c YNR016c YGL062w YGL062w YMR207c YNR016c YMR207c YMR207c YNR016c 1835 YBR218c 1835 YGL062w 1835 YMR207c 1835 YNR016c 1835 YMR293c 1180 YBR208c 1178 YBR208c 464 YBR208c 2233 YBR208c 1180 YGL062w 1178 YBR218c 1180 YNR016c 1180 YMR207c 2123 YBR218c 2233 YBR218c 1178 YMR207c 1178 YNR016c 2123 YGL062w 2233 YGL062w 2123 YBR208c 2123 YNR016c 2233 YMR207c 2e-53 1e-52 6e-34 5e-33 3e-10 5e-51 1e-47 1e-11 6e-34 0.0 0.0 6e-30 4e-29 4e-35 4e-36 2e-27 3e-27 1e-35 3e-35 1e-34 0.0 0.0 YMR293c YBR218c YGL062w YBR208c YMR207c YNR016c P6.8.C4.88 P6.8.C4.88 P6.8.C4.88 P6.8.C4.88 P6.8.C2.782 P6.8.C2.782 6 6 6 6 6 6 4 4 4 4 2 2 Large scale predicted proteome comparisons Gene Dictionary ORF size match Partition size gene YAL015c 399 HS P2.140 S. cerevisiae C. elegans etc.. 2 /YOL043c 9e-71 1 1 1 /R10E4.5 6e-26 / . . . . . . . . . . . . . . . . . . . . . . . . . . . 820 HS P3.46 2 2 0 / / . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... ..... ..... ..... .... . ........ . . Rv0006 838 IS singleton …… . …. …. thrA …. 3 thrA /YJR139c 4e-31 1 gyrA / …. …. …. 0 0 0 /K12D12.1 6e-09 …. Table : 541880 predicted proteins x 100 species / . Protein conservation profiles (phylogenetic profiles) E A B S1..............I.............I................Sn G1,1 100000000000000000000000000000000000000000000000 G2,1 111111111111111111111111111111111111111111111111 G3,1 111111111111111111111111111111111111111111111111 ....................................................... Gn1,1 000001110001000000000000000000000000000000000000 G1,2 000000000000000000010100000000000000000000000000 G2,2 000000000000000000000000000000000111000011100011 ........................................................ Gn2,2 111111110011111111111111011101110101111111111111 ........................................................ G1,n 011110100000000000000000001000000000000000000000 G2,n 011111100000000000000000000000000000000000000000 G3,n 011111100011111111100011011011110100111111101111 ........................................................ Gnp,n 100110000000000000000000000000000000000000000000 Table : 541880 predicted proteins x 100 species Ancestral weight matrix j i i Wii: weight of ancestral duplication; • Wii Wij: weight of ancestral conservation of i in j; j • Wij nsi •W nsi: jj nsj nonspecific genes in species i. Ancestral duplication and ancestral conservation org SC SP CE DM AG CA ATH HS MUS FR PF ECUN MJ MTH AF PH PA APEM TA TV H SSP2 PFU STO PYAE MA MK MMA HI ….. tnsp SC 40.5 58.4 38.1 40.5 40.9 71.8 40.3 43.0 41.7 42.0 25.9 19.5 11.5 13.6 14.4 16.3 14.3 15.5 15.2 15.4 14.8 16.7 17.0 18.6 15.6 16.0 13.0 14.8 13.0 SP 63.9 37.4 46.6 50.2 50.2 65.5 47.8 53.3 52.5 52.6 31.2 23.4 13.3 16.2 16.5 18.7 15.2 20.1 17.5 17.8 17.7 19.4 22.8 23.1 19.5 18.9 14.6 17.4 14.3 CE 17.5 18.8 65.2 39.2 39.8 18.4 21.7 40.0 39.5 40.0 13.1 8.9 4.9 4.6 5.9 5.0 5.4 4.8 5.9 6.2 5.8 7.1 6.5 6.8 5.3 7.1 4.0 6.4 4.8 DM 27.1 29.3 51.9 65.8 73.1 27.7 31.5 61.3 62.1 60.7 19.3 13.1 6.7 7.4 8.2 7.1 7.5 7.3 8.3 8.3 8.3 9.1 9.3 8.6 8.2 10.8 6.2 9.2 7.3 AG 22.3 26.3 50.6 69.9 59.5 25.7 30.3 54.5 54.7 59.9 15.9 10.8 6.0 7.6 8.7 9.2 7.3 10.6 8.3 8.7 9.8 9.4 11.1 11.4 9.9 12.5 6.1 9.5 8.5 CA 65.9 54.3 35.5 37.5 38.0 35.8 37.0 39.7 39.1 39.5 22.2 16.2 10.2 11.2 11.8 11.1 11.9 10.3 12.7 13.3 12.0 14.2 13.3 13.7 11.8 14.7 10.7 13.5 11.1 ATH 23.4 25.0 27.5 29.5 30.6 24.3 83.6 32.1 31.5 32.7 16.3 11.4 6.0 8.0 8.7 9.7 7.4 9.4 8.2 8.3 10.2 9.5 12.3 11.1 9.5 9.7 6.9 8.1 8.7 HS 22.9 25.0 44.6 50.3 50.2 23.2 25.6 66.7 76.8 68.7 17.2 12.0 4.8 5.1 5.6 5.2 5.5 5.2 5.3 5.6 5.5 6.2 7.0 5.9 5.8 7.4 4.6 6.6 4.4 MUS 27.3 29.6 54.4 62.7 60.3 27.8 29.7 90.8 77.8 81.8 21.0 15.2 5.6 6.1 6.6 6.0 6.4 5.9 6.3 6.8 6.6 7.4 8.0 7.1 6.9 8.7 5.4 7.9 5.4 FR 18.0 20.0 42.4 47.9 48.7 18.5 21.9 68.8 67.7 63.4 13.2 9.0 3.7 4.0 4.5 4.1 4.3 3.9 4.2 4.4 4.5 4.9 5.6 4.5 4.5 6.4 3.5 5.3 4.0 PF 22.5 24.6 24.8 26.5 26.5 22.3 26.2 28.2 27.6 27.6 28.3 13.6 8.7 8.3 8.6 7.9 8.3 7.2 8.6 8.7 8.0 9.5 9.1 9.1 8.1 9.8 7.3 9.7 8.2 ECUN 35.8 38.4 34.8 36.3 36.0 35.7 33.4 37.7 37.2 37.4 28.9 26.1 15.4 15.2 15.4 15.3 15.9 14.9 14.8 15.0 13.9 15.9 17.1 15.7 15.0 17.0 14.1 15.8 8.7 74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1 Wij PF ECUN NCU CALBI SP MGR SC AN FG CI AG FR CE DM CBR HS RN MUS ATH NEK MFR APEM MJ MK H TV MTH TA PH PA PYAE PFU AF STO MBUR SSP2 MMA MA BFL RCO B PRO PMM MG UU CT TP RP PMT HP HI CJ CP AE ML XF DR NM TSE SPY CCR WS TM MP LL VVYJ Ssp VPR TTE VC MTC BB MB MT SAMU5 SAN315 PL NOS EC YP BS STY SHFL LMO LIN BH AGRT SCO MM PAE SM Percent duplication Ancestral duplication 85 80 Intra-species duplication 75 70 65 60 E A B 55 50 45 40 35 30 25 20 15 10 5 0 mean= 52.1 30. std= 17.8 11.7 Species 38.4 11.2 Specific and nonspecific proteins Large scale proteome comparisons allow estimation of: • Specific proteins (genes) are proteins that have no match outside their own proteome. (no homolog in other species). • Non-specific proteins (genes) are proteins that are conserved in at least one other species (have homologs outside its own proteome). CBR RN MUS AN SP FG CE HS SC AG DM CA MGR FR CI NCU ATH ECUN PF PA TA MMA TV SSP2 MTH AF PH PFU MBUR STO MJ MA MFR NEK MK H APEM PYAE SAN315 BFL MT MG MB B RP SAMU50 LMO SHFL EC CT LIN HI MP Ssp AE ML PMM BS TSE MTC STY YP WS SM CJ TM AGRT VC PAE PRO VPR BH HP SPY PL VVYJ CP PMT TTE LL MM UU CCR NOS TP NM RCO DR SCO XF BB Specific and nonspecific proportions 100 90 80 70 60 50 40 30 20 10 0 E mean% 76.2 A B 84.3 87.6 Species specific genes 100% 0 different phylum same phylum genes Orthologs 100 species ==> 367143 orthologs Si a Sj b Structural orthologs according to the 3 domains of life (100 species: 367143 orthologous genes) A 5% EB AB EA 2% 0% 2% 2 P .a ll 0% AC R 1% B 21% E 69% Total Partitions: 37826 note: ~6% include genes from at least 2 domains of life. Evolution by Module Evolution by Module (A. gambiae paralogs) GST: orthologs Genome trees The three-domain proposal based on the ribosomal RNA tree. Woese et al. PNAS. 87:4576-4579. (1990) The three-domain proposal, with continuous lateral gene transfer among domains. Doolittle Science 284:2124-2128. (1999) Martin & Embley, Nature 431:152-5.2004 The two-empire proposal, separating eukaryotes from prokaryotes and eubacteria from archaebacteria. Mayr, D. PNAS 95:9720-23. (1998). The ring of life, incorporating lateral gene transfer but preserving the prokaryote–eukaryote divide. Rivera MC and Lake JA. Nature 431: 152-155. (2004) The 1.2-Megabase Genome Sequence of Mimivirus Didier Raoult, Stéphane Audic, Catherine Robert, Chantal Abergel, Patricia Renesto, Hiroyuki Ogata, Bernard La Scola, Marie Suzan, Jean-Michel Claverie. Sciences, 306:1344-1350. (2004) The tree was inferred with the use of a maximum likelihood method based on the concatenated sequences of seven universally conserved protein sequences: arginyl-tRNA synthetase, methionyl-tRNA synthetase, tyrosyltRNA synthetase, RNA polymerase II largest subunit, RNA polymerase II second largest subunit, PCNA, and 5'3' exonuclease. The alignment contains 3164 sites without insertions and deletions. Bootstrap percentages are shown along the branches. Evolutionary biology: Early evolution comes full circle. Martin W, Embley TM. Nature, 431; 134-137. (2004) The ring of life provides evidence for a genome fusion origin of eukaryotes Rivera, M.C. & Lake, J.A. Nature, 431; 152-155. (2004) “Our analyses indicate that the eukaryotic genome resulted from a fusion of two diverse prokaryotic genomes, and therefore at the deepest levels linking prokaryotes and eukaryotes, the tree of life is actually a ring of life.” Genomic Databases and the Tree of Life Keith A. Crandall and Jennifer E. Buhay Sciences, 306; 1144-1145. (2004) Prospects for Building the Tree of Life from Large Sequence Databases Amy C. Driskell, Cécile Ané, J. Gordon Burleigh, Michelle M. McMahon, Brian C. O'Meara, Michael J. Sanderson . Sciences, 306; 1172-1174. (2004) Species tree • 16/18s rRNA tree (Woese 1990); • main difficulties include extensive incongruence between alternative phylogenies generated from single-gene data sets; Alternative solutions: integrative methods • “supertree” (consensus tree from a set of individual gene phylogenetic trees); • “phylogenomic tree” based on concatenation of a gene sample common to the considered species; S1 . Sn • (these methods suffer difficulties related to the phylogenetic tree construction: sequence global alignment difficulties; substitution variations between species;...) Genome trees • The concept of genome tree is based on overall gene content similarity; • Genome trees consider more than single gene information; Gene tree - Species tree • Time Duplication • Duplication A B C Species tree Speciation Speciation A A B C B Gene tree C Evolutionary processes include: Ancestor Expansion* Phylogeny* genesis duplication HGT Exchange* species genome HGT loss Deletion* and selection Universal tree (Woese 1990 ): • 16s rRNA (most conserved sequences) • main difficulties include extensive incongruence between alternative phylogenies generated from single-gene data sets • tree that takes into account the whole make up of the species genomes? Genome trees: data matrices T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species} Tij is the overall similarity score between species j and i. • Ancestral duplication and ancestral conservation T = {Tij = wij = (number of proteins in j conserved in i)/size(j)); i=1,n; j=1,n }. 541880 total proteins • Shared orthologous genes {sij = (shared orthologs between i and j) } T = {Tij = sij/size(j); i=1,n; j=1,n } 442460 non-specific prot. • Distinct shared conservation profiles {sij = (distinct shared conservation profiles between i and j) } T = { Tij = sij/sjj ; i=1,n; j=1,n} 28365 / 184130 d.c.prof Ancestral duplication and ancestral conservation org SC SP CE DM AG CA ATH HS MUS FR PF ECUN MJ MTH AF PH PA APEM TA TV H SSP2 PFU STO PYAE MA MK MMA HI ….. tnsp SC 40.5 58.4 38.1 40.5 40.9 71.8 40.3 43.0 41.7 42.0 25.9 19.5 11.5 13.6 14.4 16.3 14.3 15.5 15.2 15.4 14.8 16.7 17.0 18.6 15.6 16.0 13.0 14.8 13.0 SP 63.9 37.4 46.6 50.2 50.2 65.5 47.8 53.3 52.5 52.6 31.2 23.4 13.3 16.2 16.5 18.7 15.2 20.1 17.5 17.8 17.7 19.4 22.8 23.1 19.5 18.9 14.6 17.4 14.3 CE 17.5 18.8 65.2 39.2 39.8 18.4 21.7 40.0 39.5 40.0 13.1 8.9 4.9 4.6 5.9 5.0 5.4 4.8 5.9 6.2 5.8 7.1 6.5 6.8 5.3 7.1 4.0 6.4 4.8 DM 27.1 29.3 51.9 65.8 73.1 27.7 31.5 61.3 62.1 60.7 19.3 13.1 6.7 7.4 8.2 7.1 7.5 7.3 8.3 8.3 8.3 9.1 9.3 8.6 8.2 10.8 6.2 9.2 7.3 AG 22.3 26.3 50.6 69.9 59.5 25.7 30.3 54.5 54.7 59.9 15.9 10.8 6.0 7.6 8.7 9.2 7.3 10.6 8.3 8.7 9.8 9.4 11.1 11.4 9.9 12.5 6.1 9.5 8.5 CA 65.9 54.3 35.5 37.5 38.0 35.8 37.0 39.7 39.1 39.5 22.2 16.2 10.2 11.2 11.8 11.1 11.9 10.3 12.7 13.3 12.0 14.2 13.3 13.7 11.8 14.7 10.7 13.5 11.1 ATH 23.4 25.0 27.5 29.5 30.6 24.3 83.6 32.1 31.5 32.7 16.3 11.4 6.0 8.0 8.7 9.7 7.4 9.4 8.2 8.3 10.2 9.5 12.3 11.1 9.5 9.7 6.9 8.1 8.7 HS 22.9 25.0 44.6 50.3 50.2 23.2 25.6 66.7 76.8 68.7 17.2 12.0 4.8 5.1 5.6 5.2 5.5 5.2 5.3 5.6 5.5 6.2 7.0 5.9 5.8 7.4 4.6 6.6 4.4 MUS 27.3 29.6 54.4 62.7 60.3 27.8 29.7 90.8 77.8 81.8 21.0 15.2 5.6 6.1 6.6 6.0 6.4 5.9 6.3 6.8 6.6 7.4 8.0 7.1 6.9 8.7 5.4 7.9 5.4 FR 18.0 20.0 42.4 47.9 48.7 18.5 21.9 68.8 67.7 63.4 13.2 9.0 3.7 4.0 4.5 4.1 4.3 3.9 4.2 4.4 4.5 4.9 5.6 4.5 4.5 6.4 3.5 5.3 4.0 PF 22.5 24.6 24.8 26.5 26.5 22.3 26.2 28.2 27.6 27.6 28.3 13.6 8.7 8.3 8.6 7.9 8.3 7.2 8.6 8.7 8.0 9.5 9.1 9.1 8.1 9.8 7.3 9.7 8.2 ECUN 35.8 38.4 34.8 36.3 36.0 35.7 33.4 37.7 37.2 37.4 28.9 26.1 15.4 15.2 15.4 15.3 15.9 14.9 14.8 15.0 13.9 15.9 17.1 15.7 15.0 17.0 14.1 15.8 8.7 74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1 Wij Agroba cterium tumefaciens Sinorhizobium meliloti Mesorhizobium loti Mycobacterium leprae Deinococcus radiod urans Synechosystis sp. Yersinia pestis Vibrio cholerae Salmonella Typhi Pseudomona s aeruginosa Escherichia coli Haemophilus influenzae Xylella fastidiosa Neisseria meningitidis Campylobacter jejuni Aquifex aeol icus Thermotoga maritima Bascillus halodu rans Bacillus subtilis Listeria monocytogenes EGD Stap hylococcus aureus N315 Stap hylococcus aureus Mu50 Listeria innocua Streptococcus pyogenes M1 Mycobacterium tuberculosis cdc 1551 Mycobacterium tuberculosis Chlamydia trachomat is Helicobacter pylori Rickettsia pro wazekii Treponema pal lidum Buchnera sp. Borrelia burgdo rferi Chlamydia pneumoniae Mycoplasma p neumoniae Mycoplasma g enitalium Pyrococcus Furiosus Pyrococcus horiko shii Pyrococcus abyssi Methanop yrus kandl eri AV19 Archaeoglobus fulgidus Methanoba cterium th ermoa utotrop hicum Methanococcus janna schii Methanosarcina acetivorans (C2A) Methanosarcina mazei strain Goe1 Sulfolobus solfata ricus P2 Sulfolobus tokoda ii Thermopla sma a cidophilum Haloba cterium sp. NRC-1 Aeropyrum pernix K1 Pyrobac ulum aerophi lum Thermopla sma volca nium homo sapiens Fugu rubrip es Mus musculus Drosophi la melanogaster Anoph eles gambia e Caenorhabditis elegans Arabidop sis thaliana Plasmodi um falciparum E. cuniculi Candida albicans Schizosaccharomyces pomb e S accharomyces cerevi siae Genome tree: Ancestral duplication and conservation B • A• E • Tekaia, F., Lazcano, A.,B. Dujon (1999). Genome Res. 12:17-25. • “whole genome” species clustering tree; • species are clustered into 3 phylogenetic domains; • bacterial species cluster with archaeal species; • similar species cluster together; • low resolution of deep clustering; • evolutionary side effects are taken into account; Shared orthologous genes (partial) org SC SP CE DM AG CA ATH HS MUS FR PF ECUN SC 0 2532 1533 1660 1671 3371 1582 1789 1733 1731 890 600 SP 2532 0 1753 1917 1907 2588 1754 2060 2032 2024 1008 645 CE 1533 1753 0 3910 3869 1611 1902 4036 3994 4047 1015 580 DM 1660 1917 3910 0 7018 1728 2094 5057 5147 5035 1106 616 AG 1671 1907 3869 7018 0 1738 2160 5016 5013 5059 1085 617 CA 3371 2588 1611 1728 1738 0 1590 1850 1824 1827 873 595 ATH 1582 1754 1902 2094 2160 1590 0 2404 2406 2399 1067 539 HS 1789 2060 4036 5057 5016 1850 2404 0 14053 10286 1185 638 MUS 1733 2032 3994 5147 5013 1824 2406 14053 0 10304 1169 632 FR 1731 2024 4047 5035 5059 1827 2399 10286 10304 0 1146 626 PF 890 1008 1015 1106 1085 873 1067 1185 1169 1146 0 453 ECUN 600 645 580 616 617 595 539 638 632 626 453 0 MJ 238 233 214 216 242 230 279 223 216 217 169 142 MTH 254 247 237 247 278 245 306 251 248 249 171 141 AF 261 255 254 260 303 248 310 260 263 265 182 151 PH 251 245 250 259 297 237 281 273 258 271 187 155 PA 267 261 255 268 311 256 312 276 273 278 189 156 APEM 212 233 228 228 251 215 242 248 237 230 165 136 TA 264 260 252 254 279 261 298 268 264 261 182 141 TV 263 255 256 249 276 258 296 260 258 270 184 138 H 255 264 258 249 284 248 318 271 267 272 173 140 SSP2 302 317 293 292 326 300 360 310 309 311 200 155 PFU 264 284 256 275 324 286 316 292 274 280 195 150 STO 281 291 273 263 313 278 329 293 282 298 196 143 PYAE 245 258 236 249 285 238 278 258 246 256 170 143 MA 303 316 298 293 368 301 369 329 326 326 200 161 MK 210 214 195 204 216 211 244 205 202 195 160 125 MMA 289 298 276 280 338 280 349 305 299 297 194 160 HI 268 273 231 243 388 268 382 259 259 267 181 86 sij Salmonella Typhi Escherichia coli Yersinia pestis Vibrio cholerae Pseudomona s aeruginosa Synechosystis sp. Thermotoga maritima Aquifex aeol icus Deinococcus radiod urans Xylella fastidiosa Neisseria meningitidis Haemophil us inf luenzae Rickettsia pro wazekii Buchnera sp. Helicobacter pylori Campylobacter jejuni Sinorhizobium meliloti Mesorhizobium loti Agroba cterium tumefaciens Borrelia burgdo rferi Treponema pal lidum Chlamydia pneumoniae Chlamydia tra chomat is Mycoplasma p neumoniae Mycoplasma g enitalium Listeria innocua Listeria mo nocytogenes EGD Streptococcus pyogenes M1 Bacillus subtilis Bascillus halodu rans Stap hylococcus aureus Mu50 Stap hylococcus aureus N315 Mycobacterium tuberculosis cdc 1551 Mycobacterium tuberculosis Mycobacterium leprae Methanop yrus kandl eri AV19 Methanococcus janna schii Methanoba cterium th ermoa utotrop hicum Archaeoglobus fulgidus Haloba cterium sp. NRC-1 Methanosarcina mazei strain Goe1 Methanosarcina acetivorans (C2A) Pyrococcus Furiosus Pyrococcus abyssi Pyrococcus horiko shii Sulfolobus solfata ricus P2 Sulfolobus tokoda ii Pyrobac ulum aerophi lum Genome tree: shared orthologs: Tij = 100*Sij/size(j) B • A• Aeropyrum pernix K1 Thermopla sma a cidophilum Thermopla sma volca nium Mus musculus E• homo sapiens Fugu rubrip es Caenorhabditis elegans Drosophi la melanogaster Anoph eles gambia e Arabidop sis thaliana Plasmodi um falciparum E. cuniculi Schizosaccharomyces pomb e Candida albicans Saccharomyces cer evisiae • 3 phylogenetic domains; • bacterials cluster with archaeal species; • similar species cluster together; • better resolution of deep species clustering; • Evolutionary side effects (HGT, duplication, loss) are not completely eliminated; Conservation profiles p 011111100011111111100011011011110100111111101111 • a “conservation profile” is an n-component vector describing a protein conservation pattern across n species. Components are 0 and 1, following absence or presence of homologs. • Conservation profile is the trace of protein evolutionary histories jointly captured in a set of species (multidimensional feature); • Conservation profiles are signatures of evolutionary relationships; • Considering distinct conservation profiles, reduces the effects of noisy evolutionary processes (less noisy phylogenetic signals); • Each conservation profile brings equal amount of information regardless of the size of the set of genes that have identical c. profiles; • => give evidence of evolutionary history in a set of species Distinct conservation profiles S1 Sj SS i i Si S1……………………… .….Sn gi,1 01 000 000 0000 000 000 000 00 gi,2 10 000 001 0101 000 000 000 00 . . . gi,p 01 000 000 0000 000 010 000 00 Sn St ep 2 St ep1 St ep 3 Si S1……………………… ....Sn weight 01 000 00 00 000 000 0000 000w 0 i,1 10 000 001 0101 000 000 000w 00 i,2 …………………………… . 01 000 000 0000 000 010 000w 00 i,k St ep4 S1………………………… Snweight 01 000 000 0000 000 000 000W 001 10 000 001 0101 000 000 000W 002 …………………………… . 01 000 000 0000 000 010 000W 00t Wl = {wi,m; i=1,n; m=1,n} is the weight of the conservat ion profile l. 100 species ===> 541880 proteins 442460 Distinct conservation profiles non-specific proteins i.e. conservation profiels 184130 Drastic reduction distinct conservation profiles 28365 distinct conservation profiles associated with at least 2 proteins from distinct species Distribution of distinct conservation profiles according to the three phylogenetic domains E A 1% 2% B 11% EA 3% EA B 39% EB 11% AB 33% Occurrences of shared conservation profiles • Tij = sij, where sij is the number of occurrences of distinct shared conservation profiles between species i and j; • Tij = sij/sjj. E A B S1..............I.............I................Sn 100000000000000000000000000000000000000000000000 111111111111111111111111111111111111111111111111 000001110001000000000000000000000000000000000000 000000000000000000000000000000000111000011100011 ................................................ Occurrences of shared distinct conservation profiles spec SC SP CE DM AG CA ATH HS MUS FR PF ECUN SC 2328 387 239 262 274 400 338 285 299 288 146 96 SP 387 2208 267 301 317 351 377 318 334 320 152 102 CE 239 267 3153 575 506 284 364 642 656 670 188 116 DM 262 301 575 2747 653 305 416 718 729 725 203 124 AG 274 317 506 653 4052 269 477 612 657 650 165 107 CA 400 351 284 305 269 1906 315 345 362 338 171 107 ATH 338 377 364 416 477 315 5762 451 477 469 190 110 HS 285 318 642 718 612 345 451 3813 1511 1134 231 127 MUS 299 334 656 729 657 362 477 1511 4134 1140 229 133 FR 288 320 670 725 650 338 469 1134 1140 4280 215 132 PF 146 152 188 203 165 171 190 231 229 215 1251 95 ECUN 96 102 116 124 107 107 110 127 133 132 95 572 MJ 41 46 32 39 48 45 60 39 41 39 21 13 MTH 54 56 40 53 63 53 73 51 54 50 30 21 AF 56 52 57 62 78 54 74 64 66 65 31 19 PH 41 41 46 45 58 44 59 47 51 47 24 14 PA 49 47 51 48 56 53 72 51 52 50 25 16 APEM 51 51 48 51 65 51 63 57 60 54 29 17 TA 55 59 63 61 72 57 83 66 68 65 31 19 TV 58 56 65 59 68 52 82 61 66 65 29 18 H 65 68 64 65 77 61 101 71 73 71 34 23 SSP2 71 75 73 72 87 70 95 80 87 76 32 20 PFU 52 57 57 51 64 57 73 56 62 56 28 18 STO 59 59 65 67 71 56 75 65 66 64 28 17 PYAE 59 56 48 53 73 53 81 60 67 62 24 15 MA 71 75 76 83 102 84 113 85 93 85 44 33 MK 43 45 33 40 48 44 56 38 41 36 21 12 MMA 77 72 65 73 89 76 105 74 81 66 41 28 HI 71 76 70 67 101 79 116 74 74 78 46 23 sij Profiles Conservation Orthologs • Tekaia, F. and B. Dujon (1999). Pervasiveness of gene conservation and persistence of duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600. • Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole proteome comparisons. Genome Res. 12:17-25. • Tekaia, F., Gordon, S.V., Garnier, T., Brosch, R., Barrel, B.G. and S.T. Cole (1999). Analysis of the proteome of Mycobacterium tuberculosis in silico. Tubercle and Lung Disease, 79:329-342. • Genolevures program: - F. Tekaia, G. Blandin, A. Malpertuy, et al. (2000): Methods and strategies used for sequence analysis and annotation. FEBS 487,1:17-30. - A. Malpertuy, F. Tekaia, S. Casaregola, et al. (2000): «Yeast specific» genes. FEBS 487,1:113-121. - G. Blandin, P. Durrens, F. Tekaia, et al. (2000). The genome of Saccharomyces cerevisiae revisited. FEBS 487,1:31-36. • Tekaia, F., Yeramian, E. and Dujon B. (2002) Amino acid composition of genomes, lifestyle of organisms and evolutionary trends : a global picture with correspondence analysis. Gene. 297 pp. 51-60. • Tekaia, F., Yeramian, E. in prep Genome tree based on conservation profiles Systematic analysis of completely sequenced organisms: http://www.pasteur.fr/~tekaia/sacso.html