Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur [email protected] Complete genomes Tree of life • 1387 projects 261 published (01-03-05) • 654 prokaryotes • 472 eukaryotes http://www.genomesonline.org/

Download Report

Transcript Large scale proteome comparisons Genome trees Fredj Tekaia Institut Pasteur [email protected] Complete genomes Tree of life • 1387 projects 261 published (01-03-05) • 654 prokaryotes • 472 eukaryotes http://www.genomesonline.org/

Large scale proteome comparisons
Genome trees
Fredj Tekaia
Institut Pasteur
[email protected]
207
21
Complete genomes
Tree of life
• 1387 projects
261 published (01-03-05)
• 654 prokaryotes
33
• 472 eukaryotes
http://www.genomesonline.org/
Cumulated number of available completely sequenced genomes
300
261
270
240
224
210
180
165
150
116
120
90
71
60
42
30
0
2
1
95
5
2
96
12
3
97
19
4
98
24
5
99
6
00
7
01
8
02
9
03
10
04
11
03-05
Completely sequenced Genomes that span the three
domains of life are growing at a rapid rate
List and references
GOLD
Genome sequencing projects
There are several web-based resources that document the
progress of completely sequenced genomes and their
reference publication, including:
GOLD
Genomes Online Database
http://wit.integratedgenomics.com/GOLD/
GNN
Genome News Network
http://www.genomenewsnetwork.org/index.php
Resources for genomes
There are two main resources for genomes:
EBI
European Bioinformatics Institute
http://www.ebi.ac.uk/genomes/
NCBI
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov
But many others resources from sequencing Institutions:
Sanger
The welcome Trust Sanger Institut
http://www.sanger.ac.uk/
TIGR
The Institute for Genomic Research
http://www.tigr.org
Genolevures
http://cbi.labri.fr/Genolevures/index.php
Definitions
Genome
The genome of a cell is formed by the collection of the DNA it comprises.
The genome size is the total of its DNA bases.
Gene
Is a particular DNA sequence situated in a specific position on a chromosome and
that codes for a specific function.
Protein
Is a sequence composed of amino-acids ordered according to the DNA sequences
of the gene it codes for.
Proteome
Is the set of proteins in an organism.
Genomics
Is the exhaustive study of genomes: genetic material, genes; their functions, their
organization....
Chronology of completely sequenced genomes
• 1977: first viral genome (5386 base pairs;
encoding 11 genes). Sanger et al. sequence
bacteriophage fX174.
• 1981: Human mitochondrial genome. 16,500 base
pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)
• 1986: Chloroplast genome. 156,000 base pairs
(most are 120 kb to 200 kb)
1995: first genome of a free-living organism, the
bacterium Haemophilus influenzae, by TIGR, 1830 Kb,
1713 genes.
1996: first genome of an archaeal genome:
Methanococcus jannaschii DSM 2661, by TIGR, 1664 Kb,
1773 genes.
1997: first eukaryotic genome : Saccharomyces cerevisiae
S288C; International collaboration; 16 Chromosomes;
12,057 Kb, ~6000 genes.
1998: first multicellular organism Nematode
Caenorhabditis elegans; 97 Mb; ~19,000 genes.
1999: first human chromosome: Chromosome 22 (49
Mb, 673 genes))
• 2000: Fruitfly Drosophila melanogaster (137 Mb;
~13,000 genes)
•2000 first plant genome: Arabidopsis thaliana (115,428
Mb; 22670 genes
• 2001: draft sequence of the human genome (x Mb;
~28000 genes)
• 2002: plasmodium falciparum (22,9 Mb; 5334 genes)
• 2002: mouse genome (x Mb; ~28000 genes)
• 2004: Fish draft Tetraodon nigroviridis genome (x Mb;
~28000 genes);
How big are genome sizes?
Viral genomes: 1 kb to 350 kb (Mimivirus: 1.2 Mb)
Bacterial genomes: 0.5 Mb to 13 Mb;
Eukaryotic genomes: 8 Mb to 670 Gb;
DOGS: http://www.cbs.dtu.dk/databases/DOGS/abbr_table.bysize.txt
Comparative genomics
Analyses of the genetic material of different species help
understanding the similarity and differences between genomes,
their evolution and the evolution of their genes.
•Intra-genomic comparisons help understanding the degree of
duplication (genome regions; genes) and genes organization,...
•Inter-genomic comparisons help understanding the degree of
similarity between genomes; degree of conservation between genes;
•understanding gene and genome evolution
Evolution
Evolutionary processes include:
Ancestor
Expansion*
Phylogeny*
genesis
duplication
HGT
Exchange*
species genome
HGT
loss
Deletion*
and selection
Gene duplications are traditionally considered to be a
major evolutionary source of protein new functions
Understanding how duplications happened and how important is this
evolutionary process is a key goal of genome analysis
> Some examples
S. cerevisiae genome
Colours reveal Duplications
Kellis et al. Nature, 2004
Duplication
Speciation
Deletion
Actual content of the 2 copies
Reconstruction of the ancestral
organization
Kellis et al. Nature, 2004
Kellis et al. Nature, 2004
Nature Reviews Genetics 3; 827-837 (2002);
SPLITTING PAIRS: THE DIVERGING FATES OF DUPLICATED GENES
Original version
Actual version
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.
Genome duplication.
a, Distribution of Ks values of
duplicated genes in Tetraodon (left)
and Takifugu (right) genomes.
Duplicated genes broadly belong to
two categories, depending on their
Ks value being below or higher than
0.35 substitutions per site since the
divergence between the two puffer
fish (arrows).
b, Global distribution of ancient
duplicated genes (Ks > 0.35) in the
Tetraodon genome. The 21
Tetraodon chromosomes are
represented in a circle in numerical
order and each line joins duplicated
genes at their respective position on
a given pair of chromosomes.
Jaillon et al. Nature 431, 946-857. 2004.
Jaillon et al.
Nature 431, 946857. 2004.
Inter-genomic comparisons
• Compositional comparisons between species (nuc and aa
compositions);
• Gene, protein conservation between species (rate of
conservation);
• Orthologs; families of orthologs;
• Specific and non-specific genes;
• Genes exclusively conserved in one or in a subset of
species (or in domains);
• Gene Dictionary;
• Gene conservation profiles;
• Genome tree construction;
• Genome multiple alignments.
Methodology
Fp
1
i
p
1
j
kij
•
•
•
•
•
•
•
•
• •
n
••
•
•
•
•• •
••
•
•
F1
•
•
•
•
•
•
•
•
•
sup
Matrice T
kij > 0
Correspondence
Analysis
Classification
• orthogonal system;
• use of euclidean distance;
Amino Acid composition
org
sc
ce
dm
ca
sp
ath
hs
Ala
5.5
6.2
7.5
5.0
6.3
6.2
7.0
Arg
4.4
5.2
5.6
3.7
4.9
5.5
5.6
Asn
6.1
4.9
4.7
6.7
5.2
4.4
3.7
Asp
5.8
5.2
5.2
5.9
5.4
5.4
4.9
Cys
1.3
2.1
1.9
1.1
1.5
1.9
2.2
Gln
3.9
4.1
5.2
4.5
3.8
3.5
4.7
Glu
6.5
6.4
6.4
6.4
6.6
6.7
7.0
Gly
5.0
5.3
6.2
5.1
5.0
6.3
6.6
mj
5.5
mth
7.3
af
7.8
ph
6.4
pa
6.7
ape
9.7
ssp2
5.6
pfu
6.6
sto
5.6
pyae 9.9
ta
7.0
tv
6.4
h
13.1
3.9
6.8
5.8
5.5
5.7
7.8
4.7
5.3
4.2
6.5
5.5
4.7
6.5
5.3
3.3
3.2
3.5
3.3
2.0
5.0
3.5
4.9
2.6
4.3
4.8
2.1
5.5
5.9
4.9
4.3
4.6
4.2
4.7
4.4
4.6
4.3
5.7
5.5
9.0
1.3
1.2
1.2
0.6
0.6
0.8
0.6
0.6
0.7
0.9
0.6
0.6
0.7
1.5
1.9
1.8
1.6
1.7
1.8
2.1
1.8
2.1
2.1
2.2
2.1
2.6
8.7
8.1
8.9
8.3
8.8
7.3
6.8
8.9
7.0
7.0
6.0
6.4
6.7
6.3
8.0
7.2
7.0
7.3
8.8
6.4
7.1
6.3
7.7
7.3
7.0
8.5
His Ile Leu Lys
2.1 6.6 9.6 7.3
2.3 6.2 8.7 6.5
2.7 4.9 9.2 5.6
2.1 7.1 9.2 7.3
2.3 6.1 9.9 6.5
2.3 5.4 9.5 6.4
2.5 4.4 9.8 5.7
Met
2.1
2.6
2.4
1.8
2.1
2.4
2.2
Phe
4.6
5.0
3.6
4.4
4.6
4.3
3.7
Pro
4.3
4.9
5.5
4.5
4.8
4.7
6.1
Ser
9.0
8.0
8.3
9.0
9.4
9.0
8.0
Thr
5.8
5.8
5.6
6.2
5.4
5.1
5.3
Trp
1.1
1.1
1.0
1.0
1.1
1.3
1.2
Tyr
3.3
3.2
3.0
3.5
3.4
2.9
2.8
1.4 10.4 9.5 10.4 2.2 4.2 3.4 4.5 4.0 0.7 4.4
1.9 7.7 9.5 4.6 2.9 3.6 4.3 6.1 5.0 0.8 3.2
1.5 7.2 9.5 6.9 2.6 4.6 3.9 5.5 4.2 1.0 3.6
1.5 8.8 10.3 7.7 2.4 4.6 4.5 5.9 4.5 1.2 3.8
1.5 8.5 10.2 7.8 2.4 4.4 4.3 5.0 4.2 1.2 3.8
1.6 5.5 11.0 3.9 2.2 2.9 5.5 6.7 4.3 1.3 3.5
1.3 9.4 10.3 7.7 2.2 4.4 3.8 6.7 4.7 1.1 4.8
1.5 8.7 10.1 8.1 2.2 4.4 4.3 4.9 4.4 1.2 4.0
1.3 9.9 10.3 8.0 2.1 4.5 3.9 6.7 4.8 1.0 4.9
1.5 6.3 10.5 5.7 1.9 3.6 5.0 4.9 4.4 1.5 4.3
1.6 9.0 8.4 5.6 3.2 4.7 4.0 7.6 4.8 0.9 4.6
1.5 9.2 8.8 6.9 2.7 4.7 3.8 7.5 4.8 0.8 4.8
2.2 3.6 8.3 1.6 1.7 3.1 4.7 5.2 6.8 1.1 2.5
Growth t°
•Glu
•Lys
•Arg
GC%
r=0.83
p<1.e-4
org
Glu Gln
mj
8.7 1.5
mth
8.1 1.9
af
8.9 1.8
ph
8.3 1.6
pa
8.8 1.7
•Glnapem 7.3 1.8
ssp2 6.8 2.1
pfu
8.9 1.8
sto
7.0 2.1
pyae 7.0 2.1
ta
6.0 2.2
tv
6.4 2.1
ae
9.6 2.0
Tekaia, F., Yeramian, E. and Dujon B. (2002) Gene. 297 pp. 51-60.
tm
8.9 2.0
Lys+Arg
14.3
11.4
12.7
13.2
13.5
11.7
12.4
13.4
12.2
12.2
11.1
11.6
14.3
13.1
Growth t°
QuickTime™ et un décompresseur TIFF (non compressé) sont requis pour visionner cette image.
2005
GC%
PE, PPE families
Protein size statistics
Dom n org mean std n prot min
E
38 443.1 403.6 364538
10
A
19 279.9 199.6 42499
10
B
53 311.2 233.2 155538
11
Max
9638
4436
7463
Proteome comparisons:
Methodology
Species specific comparisons
• bestp1np
blastp, pam250, SEG filter
• allp1np
• segmatchp1np
NP
P1
proteome1
new proteome
• bestnpp1
• allnpp1
• segmatchnpp1
• bestpnnp
Pn
• allpnnp
proteomen
• segmatchpnnp
SPECSO
• bestnppn
• allnppn
• segmatchnppn
bestnppi
allnppi
np1 size pij e-value1 HS/IS/NS
np1 size pij e-value1 HS/IS/NS
100 species:
E:28, A: 19, B: 53
np1 size pik e-value HS/IS/NS
• Paralogs
• Orthologs
The expected number of HSPs with score at least S is given by: E = Kmne-S.
m and n are sequence and database lengths.
Dom
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
Code
SC
SP
NCU
CALBI
MGR
FG
AN
EC UN
CE
CBR
DM
AG
ATH
HS
MUS
FR
PF
CI
RN
MJ
MTH
AF
PH
PA
APEM
TA
TV
H
SSP2
PFU
STO
PYAE
MA
MK
MMA
MBUR
MFR
Size
5829
4962
10082
6165
11109
11640
9541
1996
20844
14713
17878
16112
22671
27625
28097
33609
5334
15851
21205
1773
1871
2409
2061
1765
1865
1478
1526
2058
2977
2208
2826
2605
4528
1687
3371
2676
2911
Organism
S. cerevisiae
S. pomb e
Neurospora crassa
C. albicans
Magnapo rthe Grisea
Fusarium Graminearum
Aspergillus nidulan s
E. cuniculi
C. elegans
Caeno rhabditi s briggsae
D. melanogaster
Anopheles gambiae
A. thalina
Homo sapi ens
Mus musculus
Fugu rubripes
P. falciparum
Ciona in testinali s
Rattus nor vegicus
M. jannaschii
M. thermoa utotrop hicum
A. fulgidus
P. horiko shii OT3
P. abyssi
A. pernix K1)
Thermopla sma a cidophilum
Thermopla sma volcanium
Haloba cterium sp. NRC-1
Sulfolobus solfataricus P2
P. furiosi s
Sulfolobus tokoda ii
Pyrobac ulum aerophilum
Methanosarcina acetivorans (C2A)
Methanopyrus kandl eri AV19
Methanosarcina mazei strain Goe1
Methanococcoides burtonii
Methanogenium frigidum
Taxonomi c class.
Ascomycota
Ascomycota
Ascomycota
Ascomycota
Ascomycota
Ascomycota
Ascomycota
Microsporidia
Eumetazoa
Eumetazoa
Arthropoda
Arthropoda
Streptoph yta
Eumetazoa
Chordata
Eumetazoa
Apicompl exa
Eumetazoa
Eumetazoa
Methanococci
Euryarchaeota
Euryarchaeota
Euryarchaeota
Euryarchaeota
Crenarchaeota
Euryarchaeota
Euryarchaeota
Euryarchaeota
Crenarchaeota
Euryarchaeota
Crenarchaeota
Crenarchaeota
Euryarchaeota
Euryarchaeota
Euryarchaeota
Euryarchaeota
Euryarchaeota
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
HI
1713 H. influenzae
Gammap roteobact eria
MG
479 M. genitalium
Mycoplasmatal es
MP
677 M. pneumoniae
Mycoplasmatal es
Ssp
3168 Synechocystis sp.
Cyanobac teria
EC
4290 E. coli
Gammap roteobact eria
HP
1577 H. pylori
Epsilonprot eobacteria
BS
4100 B. subtilis
Bacillus
BH
4066 Bascillus halodu rans
Bacillus
BB
1639 B. burgdorf eri
Spirochaetes
AE
1522 A. aeolicus
Aquificales
MT
3996 M. tuberculosis H37 R
Actinobacteria
MTC
4203 M. tuberculosis CDC 1551
Actinobacteria
ML
1604 Mycobacterium leprae
Actinobacteria
TP
1031 T. pallidum
Spirochaetes
CT
877 C. trachoma tis
Chlamydiae
RP
837 R. prowazekii
Alphaproteoba cteria
CJ
1634 C. jejuni
Epsilonprot eobacteria
CP
1052 C. pneumoniae
Chlamydiae
TM
1849 T. mariti ma
Thermotoga e
DR
3117 D. radiodura ns
Deinococcus-Thermus
NM
2081 N. meningitidis
Betaprot eobacteria
XF
2830 Xylella fastidiosa
Gammap roteobact eria
VC
3837 Vibrio cholerae
Gammap roteobact eria
PAE
5570 Pseudomona s aeruginosa
Gammap roteobact eria
B
575 Buchnera sp.
Gammap roteobact eria
LMO
2846 Listeria monocytogenes
Bacilli
LIN
2968 Listeria innocua
Bacilli
STY
4395 Salmonella Typhi
Gammap roteobact eria
YP
3895 Yersinia pestis
Gammap roteobact eria
SAMU 50 2714 Staphylococcus aureus Mu50 Bacilli
SAN315 2594 Staphylococcus aureus N315 Bacilli
SPY
1696 Streptococcus pyogenes M1
Bacilli
MM
7275 Mesorhizobium loti
Alphaproteoba cteria
SM
6205 Sinorhizobium meliloti
Alphaproteoba cteria
AGRT 5299 Agroba cterium tumefaciens
Alphaproteoba cteria
MB
3953 Mycobacterium Bovis
Actinobacteria
SCO
7810 Streptom yces coelicolor
Actinobacteria
UU
614 Ureaplasma urealyticum
Mycoplasmatal es
SHFL
4068 Shigella flexneri
Gammap roteobact eria
LL
2321 Lactococcus lactis subsp. lacti s
Bacilli
RCO
1374 Rickettsia co norii Malish 7
Alphaproteoba cteria
CCR
3737 Caulobacter crescentus CB15
Alphaproteoba cteria
NOS
5366 Nostoc sp
Cyanobacteria
TSE
2475 Thermosynechococcus elonga tus BP-1
Cyanoba cteria
TTE
2588 Thermoanaerobacter tengcongensis strain MB4T Clostridia
BFL
583 Candidatus Blochmannia floridanus
Gammaproteobacteria
PRO
1882 Prochlorococcus marinus subsp. marinus str.
Cyanoba cteria
PMT
2265 Prochlorococcus marinus str. MIT 9313
Cyanoba cteria
PMM
1712 Prochlorococcus marinus subsp. pastoris str.
Cyanoba cteria
WS
2044 Wolinella succinog enes
Epsilonp roteobacteria
PL
4683 Photorhabdus luminescens subsp. laumondii
Gammaproteobacteria
Homolog - Paralog - Ortholog
O
A
A1A1
BB
11
Species-1
B
Homologs: A1, B1, A2, B2
Paralogs: A1 vs B1 and A2 vs B2
Orthologs: A1 vs A2 and B1 vs B2
AA22
BB
22
Sequence analysis
Species-2
a
S1
S2
b
Example
Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome
SC vs SC
BLASTP 2.2.1 [Apr-13-2001]
............................
Query= YAL005c
SSA1 heat shock protein of HSP70 family,
cytosolic
(642 letters)
Database:
S. cerevisiae proteome version 22/05/2002
5829 sequences; 2,798,770 total letters
................................................
Sequences producing significant alignments:
Score
E
(bits) Value
YAL005c SSA1 heat shock protein of HSP70 family, cyt...
674 0.0
YLL024c SSA2 heat shock protein of HSP70 family, cyt...
663
0.0
YER103w SSA4 heat shock protein of HSP70 family, cyt...
589
e-169
YBL075c SSA3 heat shock protein of HSP70 family, cyt...
588
e-169
YJL034w KAR2 nuclear fusion protein
480 e-136
YDL229w SSB1 heat shock protein of HSP70 family
428
e-120
YNL209w SSB2 heat shock protein of HSP70 family, cyt...
427
e-120
YJR045c SSC1 mitochondrial heat shock protein 70-rel...
336
5e-93
YEL030w heat shock protein of HSP70 family
324 2e-89
YLR369w SSQ1 mitochondrial heat shock protein 70
296
4e-81
YBR169c SSE2 heat shock protein of the HSP70 family
173
7e-44
YPL106c SSE1 heat shock protein of HSP70 family
172
1e-43
YHR064c
regulator protein involved in pleiotro...
143 6e-35
YKL073w LHS1 chaperone of the ER lumen
100
4e-22
YLR135w subunit of SLX1P/Ybr228p-SLX4P complex...
330.13
...................
bestscsc
YAL002w
YAL003w
YAL004w
YAL005c
YAL007c
( SC / SC )
1176
206
215
642
215
allscsc
YLL024c
YOR016c
NS
NS
NS
HS 0.0
HS 1e-44
( SC / SC )
YAL002w
1176
-
NS
YAL003w
206
-
NS
YAL004w
215
-
NS
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
642
642
642
642
642
642
642
642
642
642
642
642
642
YLL024c
YER103w
YBL075c
YJL034w
YDL229w
YNL209w
YJR045c
YEL030w
YLR369w
YBR169c
YPL106c
YHR064c
YKL073w
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
0.0
0.0
0.0
e-147
e-130
e-130
e-100
2e-96
1e-87
2e-47
4e-47
7e-38
5e-24
YAL007c
YAL007c
YAL007c
YAL007c
215
215
215
215
YOR016c
YGL200c
YHR110w
YDL018c
HS
IS
IS
IS
1e-44
5e-05
0.017
0.021
- Paralogs - multiple matches
- Partitions/clustering
Multiple matches of sc in sc
ORF
matches in sc
YAL 005c
13
YAL 007c
1
YDR214w
1
YDR216w
2
YDR399w
1
YDR406w
9
YDR409w
1
YCR 040w
1
YKL218c
1
YKL219w
14
YKL220c
6
YKL221w
2
YKL222c
3
YKL223w
5
YKL224c
22
YKR001c
2
YKR003w
5
YBR104w
6
YBR105c
1
YKR013w
2
YKR014c
13
....................................
..........................
Max : YDR477w
77
SC/CE
bestscce
YAL002w
YAL003w
YAL004w
YAL005c
YAL007c
YAL009w
YAL019w
YAL020c
YAL021c
CE/SC
(SC / CE)
1176
206
215
642
215
259
1131
333
837
allscce
bestcesc
C42C1.4
F54H12.6
F26D10.3
F57B10.5
F16D3.7
M03C11.8
F07C3.4
ZC518.3
HS
HS
NS
HS
HS
IS
HS
IS
HS
2e-15
4e-22
e-172
9e-08
0.013
7e-92
7e-04
5e-47
(SC / CE)
1259
213
640
640
203
516
1 038
356
949
425
600
allcesc
YAL002w 1176
C42C1.4
HS
2e-15
YAL003w 206
YAL003w 206
F54H12.6 HS
Y41E3.10 HS
4e-22
2e-17
YAL004w 215
-
NS
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
F26D10.3
F44E5.4
F44E5.5
C12C8.1
C15H9.6
F43E2.8
C37H5.8
F11F1.1
F54C9.2
K09C4.3
T28F3.2
C30C11.4
T24H7.2
T14G8.3
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
642
642
642
642
642
642
642
642
642
642
642
642
642
642
C42C1.4
F54H12.6
F26D10.3
F26D10.3
F57B10.5
F16D3.7
M03C11.8
AC3.1
AC3.2
AC3.3
AC3.4
e-172
e-153
e-153
e-152
e-148
e-144
e-104
1e-77
4e-51
4e-47
2e-45
7e-43
2e-34
8e-33
Orthologs
( CE / SC)
YAL002w
Y AL003w
Y ER103w
Y ER103w
Y AL007c
YHL003c
Y AL019w
YLR189c
YNL326c
HS
HS
HS
HS
HS
IS
HS
NS
IS
NS
HS
8e-16
4e-20
e-174
e-174
7e-13
9e-04
2e-87
0.038
1e-12
(CE / SC )
C42C1.4
1259
YAL002w
HS
8e-16
F54H12.6
213
YAL003w
HS
4e-20
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
F26D10.3
640
640
640
640
640
640
640
640
640
640
640
640
640
640
YER103w
YBL075c
YLL024c
YAL005c
YJL034w
YDL229w
YNL209w
YJR045c
YEL030w
YLR369w
YPL106c
YBR169c
YHR064c
YKL073w
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
HS
e-174
e-174
e-172
e-171
e-141
e-129
e-129
e-100
2e-97
1e-83
2e-45
5e-45
8e-36
3e-22
segmatchSCCE
Test siz
Hit
YAL002w 1176 C42C1.4
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
YAL005c
642
642
642
642
642
642
642
642
642
642
642
642
642
642
642
F26D10.3
F44E5.5
F44E5.4
C12C8.1
C15H9.6
F43E2.8
C37H5.8
F11F1.1b
F11F1.1a
F54C9.2
K09C4.3
K09C4.3
C30C11.4
T24H7.2
T14G8.3
siz e-val %id %sim gap Ssiz dT eT dH eH
1259 5e-14 16 44 7 674 438 1111 547 1196
640
645
645
643
661
657
657
607
614
469
310
310
776
925
926
1e-159
1e-142
1e-142
1e-141
1e-137
1e-134
1e-96
1e-73
8e-72
3e-47
2e-43
1e-04
1e-39
1e-31
3e-30
73
63
63
62
60
58
46
36
36
38
71
54
26
24
24
84
79
79
79
78
76
67
60
60
66
88
70
50
50
51
0
0
0
0
1
1
2
0
2
2
0
605 3 607 5 613
605 3 607 5 611
605 3 607 5 611
605 3 607 5 611
603 5 607 36 641
606 1 606 29 637
606 2 607 31 632
599 4 602 2 600
599 4 602 2 607
379 2 380 52 433
186 4 189 6 192
61 327 387 189 249
8 600 5 604 4 647
3 506 4 509 26 548
6 510 4 513 28 560
P7.1
P7.1.C4.1
•
•
P4.2.C3.1
•
•
•
•
•
•
•
•
•
•
•
P7.1.C3.1
P4.1
A set of genes defines a "partition"
if and only if
a) each member of the set has at
least one significant match with
another member of the set;
b) no member of the set has
significant matches with members
not included in the set;
c) the set is minimal.
MCL: Markov Cluster algorithm
•
•
Partitions/MCL Clustering
•
Stijn van Dongen: A cluster algorithm
for graphs. http://micans.org/mcl/
• Each gene is identified by its partition
and its MCL cluster
Markov Cluster (MCL) algorithm
http://micans.org/mcl/
• Traditionally, most methods deal with similarity relationships in
a pairwise manner, while graph theory allows classification of proteins
into families based on a global treatment of all relationships in
similarity space simultaneously.
• Similarity between proteins are arranged in a matrix that represents a
connection graph.
• Nodes of the graph represent proteins, and edges represent sequence
similarity that connects such proteins.
• A weight is assigned to each edge by taking -log10(E-value) obtained
by a BLAST comparison.
•These weights are transformed into probabilities associated with a
transition from one protein to another within this graph.
•This matrix is passed through iterative rounds of matrix multiplication
and matrix inflation until there is little or no net change in the matrix.
The final matrix is then interpreted as a protein family clustering.
• The inflation value parameter of the MCL algorithm is used to
control the granularity of these clusters.
blastp proteome
specific comparisons
all protein
significant
hits
Adapted from
Enright et al. NAR 2002.
Example of Partition/MCL clustering
P6
19
Total number of distinct ORFs= 6
-------------------YKL212w
623 YIL002c
4e-11
YIL002c
946 YOR109w 7e-94
YIL002c
946 YNL106c 5e-90
YIL002c
946 YOL065c 3e-10
YNL106c
1183 YIL002c
1e-89
YOR109w 1107 YIL002c
1e-90
YKL212w
623 YOR109w 2e-34
YKL212w
623 YNL106c 3e-34
YKL212w
623 YNL325c 8e-29
YNL106c
1183 YKL212w 1e-33
YNL325c
879 YKL212w 6e-25
YOR109w 1107 YKL212w 2e-30
YNL106c
1183 YOR109w
0.0
YNL106c
1183 YNL325c 2e-22
YNL325c
879 YNL106c 1e-22
YOL065c
384 YNL106c 4e-10
YOR109w 1107 YNL106c
0.0
YNL325c
879 YOR109w 4e-20
YOR109w 1107 YNL325c 2e-16
YOL065c P6.9.C6.48
YIL002c P6.9.C6.48
YNL325c P6.9.C6.48
YKL212w P6.9.C6.48
YOR109w P6.9.C6.48
YNL106c P6.9.C6.48
6
6
6
6
6
6
6
6
6
6
6
6
Example of Partition/MCL clustering
P6
22
Total number of distinct ORFs= 6
-------------------YBR208c
YBR208c
YBR208c
YBR208c
YBR208c
YBR218c
YGL062w
YMR293c
YNR016c
YBR218c
YGL062w
YBR218c
YBR218c
YMR207c
YNR016c
YGL062w
YGL062w
YMR207c
YNR016c
YMR207c
YMR207c
YNR016c
1835 YBR218c
1835 YGL062w
1835 YMR207c
1835 YNR016c
1835 YMR293c
1180 YBR208c
1178 YBR208c
464 YBR208c
2233 YBR208c
1180 YGL062w
1178 YBR218c
1180 YNR016c
1180 YMR207c
2123 YBR218c
2233 YBR218c
1178 YMR207c
1178 YNR016c
2123 YGL062w
2233 YGL062w
2123 YBR208c
2123 YNR016c
2233 YMR207c
2e-53
1e-52
6e-34
5e-33
3e-10
5e-51
1e-47
1e-11
6e-34
0.0
0.0
6e-30
4e-29
4e-35
4e-36
2e-27
3e-27
1e-35
3e-35
1e-34
0.0
0.0
YMR293c
YBR218c
YGL062w
YBR208c
YMR207c
YNR016c
P6.8.C4.88
P6.8.C4.88
P6.8.C4.88
P6.8.C4.88
P6.8.C2.782
P6.8.C2.782
6
6
6
6
6
6
4
4
4
4
2
2
Large scale predicted proteome comparisons
Gene Dictionary
ORF
size match Partition size gene
YAL015c 399
HS
P2.140
S. cerevisiae
C. elegans
etc..
2
/YOL043c 9e-71 1 1 1 /R10E4.5 6e-26
/
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
820
HS
P3.46
2 2 0 /
/
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.....
.....
.....
.....
.... .
........
.
.
Rv0006
838
IS singleton
…… .
….
….
thrA
….
3 thrA /YJR139c 4e-31
1 gyrA /
…. ….
….
0 0 0 /K12D12.1 6e-09
….
Table : 541880 predicted proteins x 100 species
/
.
Protein conservation profiles (phylogenetic profiles)
E
A
B
S1..............I.............I................Sn
G1,1
100000000000000000000000000000000000000000000000
G2,1
111111111111111111111111111111111111111111111111
G3,1
111111111111111111111111111111111111111111111111
.......................................................
Gn1,1
000001110001000000000000000000000000000000000000
G1,2
000000000000000000010100000000000000000000000000
G2,2
000000000000000000000000000000000111000011100011
........................................................
Gn2,2
111111110011111111111111011101110101111111111111
........................................................
G1,n
011110100000000000000000001000000000000000000000
G2,n
011111100000000000000000000000000000000000000000
G3,n
011111100011111111100011011011110100111111101111
........................................................
Gnp,n
100110000000000000000000000000000000000000000000
Table : 541880 predicted proteins x 100 species
Ancestral weight matrix
j
i
i
Wii:
weight of ancestral
duplication;
• Wii
Wij:
weight of ancestral
conservation of i in j;
j
•
Wij
nsi
•W
nsi:
jj
nsj
nonspecific genes in
species i.
Ancestral duplication and ancestral conservation
org
SC
SP
CE
DM
AG
CA
ATH
HS
MUS
FR
PF
ECUN
MJ
MTH
AF
PH
PA
APEM
TA
TV
H
SSP2
PFU
STO
PYAE
MA
MK
MMA
HI
…..
tnsp
SC
40.5
58.4
38.1
40.5
40.9
71.8
40.3
43.0
41.7
42.0
25.9
19.5
11.5
13.6
14.4
16.3
14.3
15.5
15.2
15.4
14.8
16.7
17.0
18.6
15.6
16.0
13.0
14.8
13.0
SP
63.9
37.4
46.6
50.2
50.2
65.5
47.8
53.3
52.5
52.6
31.2
23.4
13.3
16.2
16.5
18.7
15.2
20.1
17.5
17.8
17.7
19.4
22.8
23.1
19.5
18.9
14.6
17.4
14.3
CE
17.5
18.8
65.2
39.2
39.8
18.4
21.7
40.0
39.5
40.0
13.1
8.9
4.9
4.6
5.9
5.0
5.4
4.8
5.9
6.2
5.8
7.1
6.5
6.8
5.3
7.1
4.0
6.4
4.8
DM
27.1
29.3
51.9
65.8
73.1
27.7
31.5
61.3
62.1
60.7
19.3
13.1
6.7
7.4
8.2
7.1
7.5
7.3
8.3
8.3
8.3
9.1
9.3
8.6
8.2
10.8
6.2
9.2
7.3
AG
22.3
26.3
50.6
69.9
59.5
25.7
30.3
54.5
54.7
59.9
15.9
10.8
6.0
7.6
8.7
9.2
7.3
10.6
8.3
8.7
9.8
9.4
11.1
11.4
9.9
12.5
6.1
9.5
8.5
CA
65.9
54.3
35.5
37.5
38.0
35.8
37.0
39.7
39.1
39.5
22.2
16.2
10.2
11.2
11.8
11.1
11.9
10.3
12.7
13.3
12.0
14.2
13.3
13.7
11.8
14.7
10.7
13.5
11.1
ATH
23.4
25.0
27.5
29.5
30.6
24.3
83.6
32.1
31.5
32.7
16.3
11.4
6.0
8.0
8.7
9.7
7.4
9.4
8.2
8.3
10.2
9.5
12.3
11.1
9.5
9.7
6.9
8.1
8.7
HS
22.9
25.0
44.6
50.3
50.2
23.2
25.6
66.7
76.8
68.7
17.2
12.0
4.8
5.1
5.6
5.2
5.5
5.2
5.3
5.6
5.5
6.2
7.0
5.9
5.8
7.4
4.6
6.6
4.4
MUS
27.3
29.6
54.4
62.7
60.3
27.8
29.7
90.8
77.8
81.8
21.0
15.2
5.6
6.1
6.6
6.0
6.4
5.9
6.3
6.8
6.6
7.4
8.0
7.1
6.9
8.7
5.4
7.9
5.4
FR
18.0
20.0
42.4
47.9
48.7
18.5
21.9
68.8
67.7
63.4
13.2
9.0
3.7
4.0
4.5
4.1
4.3
3.9
4.2
4.4
4.5
4.9
5.6
4.5
4.5
6.4
3.5
5.3
4.0
PF
22.5
24.6
24.8
26.5
26.5
22.3
26.2
28.2
27.6
27.6
28.3
13.6
8.7
8.3
8.6
7.9
8.3
7.2
8.6
8.7
8.0
9.5
9.1
9.1
8.1
9.8
7.3
9.7
8.2
ECUN
35.8
38.4
34.8
36.3
36.0
35.7
33.4
37.7
37.2
37.4
28.9
26.1
15.4
15.2
15.4
15.3
15.9
14.9
14.8
15.0
13.9
15.9
17.1
15.7
15.0
17.0
14.1
15.8
8.7
74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1
Wij
PF
ECUN
NCU
CALBI
SP
MGR
SC
AN
FG
CI
AG
FR
CE
DM
CBR
HS
RN
MUS
ATH
NEK
MFR
APEM
MJ
MK
H
TV
MTH
TA
PH
PA
PYAE
PFU
AF
STO
MBUR
SSP2
MMA
MA
BFL
RCO
B
PRO
PMM
MG
UU
CT
TP
RP
PMT
HP
HI
CJ
CP
AE
ML
XF
DR
NM
TSE
SPY
CCR
WS
TM
MP
LL
VVYJ
Ssp
VPR
TTE
VC
MTC
BB
MB
MT
SAMU5
SAN315
PL
NOS
EC
YP
BS
STY
SHFL
LMO
LIN
BH
AGRT
SCO
MM
PAE
SM
Percent duplication
Ancestral duplication
85
80
Intra-species duplication
75
70
65
60
E
A
B
55
50
45
40
35
30
25
20
15
10
5
0
mean=
52.1
30.
std=
17.8
11.7
Species
38.4
11.2
Specific and nonspecific proteins
Large scale proteome comparisons allow estimation of:
• Specific proteins (genes) are proteins that have no match outside
their own proteome. (no homolog in other species).
• Non-specific proteins (genes) are proteins that are conserved in at
least one other species (have homologs outside its own proteome).
CBR
RN
MUS
AN
SP
FG
CE
HS
SC
AG
DM
CA
MGR
FR
CI
NCU
ATH
ECUN
PF
PA
TA
MMA
TV
SSP2
MTH
AF
PH
PFU
MBUR
STO
MJ
MA
MFR
NEK
MK
H
APEM
PYAE
SAN315
BFL
MT
MG
MB
B
RP
SAMU50
LMO
SHFL
EC
CT
LIN
HI
MP
Ssp
AE
ML
PMM
BS
TSE
MTC
STY
YP
WS
SM
CJ
TM
AGRT
VC
PAE
PRO
VPR
BH
HP
SPY
PL
VVYJ
CP
PMT
TTE
LL
MM
UU
CCR
NOS
TP
NM
RCO
DR
SCO
XF
BB
Specific and nonspecific proportions
100
90
80
70
60
50
40
30
20
10
0
E
mean%
76.2
A
B
84.3
87.6
Species specific genes
100%
0
different phylum
same phylum
genes
Orthologs
100 species ==> 367143 orthologs
Si
a
Sj
b
Structural orthologs according to the 3 domains
of life (100 species: 367143 orthologous genes)
A
5%
EB
AB EA 2%
0%
2%
2
P .a ll
0%
AC R
1%
B
21%
E
69%
Total Partitions: 37826
note: ~6% include genes from at least 2 domains of life.
Evolution by Module
Evolution by Module
(A. gambiae paralogs)
GST:
orthologs
Genome trees
The three-domain proposal based on the ribosomal
RNA tree. Woese et al. PNAS. 87:4576-4579. (1990)
The three-domain proposal, with continuous lateral gene transfer
among domains.
Doolittle Science 284:2124-2128. (1999)
Martin & Embley, Nature 431:152-5.2004
The two-empire proposal, separating
eukaryotes from prokaryotes and eubacteria
from archaebacteria. Mayr, D. PNAS
95:9720-23. (1998).
The ring of life, incorporating lateral gene transfer but
preserving the prokaryote–eukaryote divide.
Rivera MC and Lake JA. Nature 431: 152-155. (2004)
The 1.2-Megabase Genome
Sequence of Mimivirus
Didier Raoult, Stéphane Audic, Catherine Robert,
Chantal Abergel, Patricia Renesto, Hiroyuki Ogata,
Bernard La Scola, Marie Suzan, Jean-Michel
Claverie.
Sciences, 306:1344-1350. (2004)
The tree was inferred with the use of a
maximum likelihood method based on
the concatenated sequences of seven
universally conserved protein
sequences: arginyl-tRNA synthetase,
methionyl-tRNA synthetase, tyrosyltRNA synthetase, RNA polymerase II
largest subunit, RNA polymerase II
second largest subunit, PCNA, and 5'3' exonuclease.
The alignment contains 3164 sites
without insertions and deletions.
Bootstrap percentages are shown
along the branches.
Evolutionary biology: Early
evolution comes full circle.
Martin W, Embley TM.
Nature, 431; 134-137. (2004)
The ring of life provides evidence for a
genome fusion origin of eukaryotes
Rivera, M.C. & Lake, J.A. Nature, 431; 152-155. (2004)
“Our analyses indicate that the eukaryotic genome resulted from
a fusion of two diverse prokaryotic genomes, and therefore at the
deepest levels linking prokaryotes and eukaryotes, the tree of life
is actually a ring of life.”
Genomic Databases and the Tree of Life
Keith A. Crandall and Jennifer E. Buhay
Sciences, 306; 1144-1145. (2004)
Prospects for Building the Tree of Life
from Large Sequence Databases
Amy C. Driskell, Cécile Ané, J. Gordon Burleigh, Michelle M.
McMahon, Brian C. O'Meara, Michael J. Sanderson .
Sciences, 306; 1172-1174. (2004)
Species tree
• 16/18s rRNA tree (Woese 1990);
• main difficulties include extensive incongruence between alternative
phylogenies generated from single-gene data sets;
Alternative solutions: integrative methods
• “supertree” (consensus tree from a set of individual gene
phylogenetic trees);
• “phylogenomic tree” based on concatenation of a gene sample
common to the considered species;
S1
.
Sn
• (these methods suffer difficulties related to the phylogenetic tree construction:
sequence global alignment difficulties; substitution variations between species;...)
Genome trees
• The concept of genome tree is based on overall gene content similarity;
• Genome trees consider more than single gene information;
Gene tree - Species tree
•
Time
Duplication
•
Duplication
A
B
C
Species tree
Speciation
Speciation
A
A
B
C
B
Gene tree
C
Evolutionary processes include:
Ancestor
Expansion*
Phylogeny*
genesis
duplication
HGT
Exchange*
species genome
HGT
loss
Deletion*
and selection
Universal tree
(Woese 1990 ):
• 16s rRNA (most
conserved sequences)
• main difficulties
include extensive
incongruence between
alternative
phylogenies generated
from single-gene data
sets
• tree that takes into
account the whole
make up of the species
genomes?
Genome trees: data matrices
T = {Tij ; i=1,n; j=1,n; n is the number of surveyed species}
Tij is the overall similarity score between species j and i.
• Ancestral duplication and ancestral conservation
T = {Tij = wij = (number of proteins in j conserved in i)/size(j));
i=1,n; j=1,n }.
541880 total proteins
• Shared orthologous genes
{sij = (shared orthologs between i and j) }
T = {Tij = sij/size(j); i=1,n; j=1,n }
442460 non-specific prot.
• Distinct shared conservation profiles
{sij = (distinct shared conservation profiles between i and j) }
T = { Tij = sij/sjj ; i=1,n; j=1,n}
28365 / 184130 d.c.prof
Ancestral duplication and ancestral conservation
org
SC
SP
CE
DM
AG
CA
ATH
HS
MUS
FR
PF
ECUN
MJ
MTH
AF
PH
PA
APEM
TA
TV
H
SSP2
PFU
STO
PYAE
MA
MK
MMA
HI
…..
tnsp
SC
40.5
58.4
38.1
40.5
40.9
71.8
40.3
43.0
41.7
42.0
25.9
19.5
11.5
13.6
14.4
16.3
14.3
15.5
15.2
15.4
14.8
16.7
17.0
18.6
15.6
16.0
13.0
14.8
13.0
SP
63.9
37.4
46.6
50.2
50.2
65.5
47.8
53.3
52.5
52.6
31.2
23.4
13.3
16.2
16.5
18.7
15.2
20.1
17.5
17.8
17.7
19.4
22.8
23.1
19.5
18.9
14.6
17.4
14.3
CE
17.5
18.8
65.2
39.2
39.8
18.4
21.7
40.0
39.5
40.0
13.1
8.9
4.9
4.6
5.9
5.0
5.4
4.8
5.9
6.2
5.8
7.1
6.5
6.8
5.3
7.1
4.0
6.4
4.8
DM
27.1
29.3
51.9
65.8
73.1
27.7
31.5
61.3
62.1
60.7
19.3
13.1
6.7
7.4
8.2
7.1
7.5
7.3
8.3
8.3
8.3
9.1
9.3
8.6
8.2
10.8
6.2
9.2
7.3
AG
22.3
26.3
50.6
69.9
59.5
25.7
30.3
54.5
54.7
59.9
15.9
10.8
6.0
7.6
8.7
9.2
7.3
10.6
8.3
8.7
9.8
9.4
11.1
11.4
9.9
12.5
6.1
9.5
8.5
CA
65.9
54.3
35.5
37.5
38.0
35.8
37.0
39.7
39.1
39.5
22.2
16.2
10.2
11.2
11.8
11.1
11.9
10.3
12.7
13.3
12.0
14.2
13.3
13.7
11.8
14.7
10.7
13.5
11.1
ATH
23.4
25.0
27.5
29.5
30.6
24.3
83.6
32.1
31.5
32.7
16.3
11.4
6.0
8.0
8.7
9.7
7.4
9.4
8.2
8.3
10.2
9.5
12.3
11.1
9.5
9.7
6.9
8.1
8.7
HS
22.9
25.0
44.6
50.3
50.2
23.2
25.6
66.7
76.8
68.7
17.2
12.0
4.8
5.1
5.6
5.2
5.5
5.2
5.3
5.6
5.5
6.2
7.0
5.9
5.8
7.4
4.6
6.6
4.4
MUS
27.3
29.6
54.4
62.7
60.3
27.8
29.7
90.8
77.8
81.8
21.0
15.2
5.6
6.1
6.6
6.0
6.4
5.9
6.3
6.8
6.6
7.4
8.0
7.1
6.9
8.7
5.4
7.9
5.4
FR
18.0
20.0
42.4
47.9
48.7
18.5
21.9
68.8
67.7
63.4
13.2
9.0
3.7
4.0
4.5
4.1
4.3
3.9
4.2
4.4
4.5
4.9
5.6
4.5
4.5
6.4
3.5
5.3
4.0
PF
22.5
24.6
24.8
26.5
26.5
22.3
26.2
28.2
27.6
27.6
28.3
13.6
8.7
8.3
8.6
7.9
8.3
7.2
8.6
8.7
8.0
9.5
9.1
9.1
8.1
9.8
7.3
9.7
8.2
ECUN
35.8
38.4
34.8
36.3
36.0
35.7
33.4
37.7
37.2
37.4
28.9
26.1
15.4
15.2
15.4
15.3
15.9
14.9
14.8
15.0
13.9
15.9
17.1
15.7
15.0
17.0
14.1
15.8
8.7
74.4 79.2 49.7 76.4 81.0 72.6 58.8 78.7 93.7 72.8 42.3 48.1
Wij
Agroba cterium tumefaciens
Sinorhizobium meliloti
Mesorhizobium loti
Mycobacterium leprae
Deinococcus radiod urans
Synechosystis sp.
Yersinia pestis
Vibrio cholerae
Salmonella Typhi
Pseudomona s aeruginosa
Escherichia coli
Haemophilus influenzae
Xylella fastidiosa
Neisseria meningitidis
Campylobacter jejuni
Aquifex aeol icus
Thermotoga maritima
Bascillus halodu rans
Bacillus subtilis
Listeria monocytogenes EGD
Stap hylococcus aureus N315
Stap hylococcus aureus Mu50
Listeria innocua
Streptococcus pyogenes M1
Mycobacterium tuberculosis cdc 1551
Mycobacterium tuberculosis
Chlamydia trachomat is
Helicobacter pylori
Rickettsia pro wazekii
Treponema pal lidum
Buchnera sp.
Borrelia burgdo rferi
Chlamydia pneumoniae
Mycoplasma p neumoniae
Mycoplasma g enitalium
Pyrococcus Furiosus
Pyrococcus horiko shii
Pyrococcus abyssi
Methanop yrus kandl eri AV19
Archaeoglobus fulgidus
Methanoba cterium th ermoa utotrop hicum
Methanococcus janna schii
Methanosarcina acetivorans (C2A)
Methanosarcina mazei strain Goe1
Sulfolobus solfata ricus P2
Sulfolobus tokoda ii
Thermopla sma a cidophilum
Haloba cterium sp. NRC-1
Aeropyrum pernix K1
Pyrobac ulum aerophi lum
Thermopla sma volca nium
homo sapiens
Fugu rubrip es
Mus musculus
Drosophi la melanogaster
Anoph eles gambia e
Caenorhabditis elegans
Arabidop sis thaliana
Plasmodi um falciparum
E. cuniculi
Candida albicans
Schizosaccharomyces pomb e
S accharomyces cerevi siae
Genome tree:
Ancestral duplication and conservation
B
•
A•
E
•
Tekaia, F., Lazcano, A.,B. Dujon
(1999). Genome Res. 12:17-25.
• “whole genome” species
clustering tree;
• species are clustered into
3 phylogenetic domains;
• bacterial species cluster
with archaeal species;
• similar species cluster
together;
• low resolution of deep
clustering;
• evolutionary side effects
are taken into account;
Shared orthologous genes (partial)
org SC
SP
CE
DM
AG
CA
ATH HS
MUS
FR
PF
ECUN
SC
0 2532 1533 1660 1671 3371 1582 1789 1733 1731 890 600
SP
2532
0 1753 1917 1907 2588 1754 2060 2032 2024 1008 645
CE
1533 1753
0 3910 3869 1611 1902 4036 3994 4047 1015 580
DM
1660 1917 3910
0 7018 1728 2094 5057 5147 5035 1106 616
AG
1671 1907 3869 7018
0 1738 2160 5016 5013 5059 1085 617
CA
3371 2588 1611 1728 1738
0 1590 1850 1824 1827 873 595
ATH
1582 1754 1902 2094 2160 1590
0 2404 2406 2399 1067 539
HS
1789 2060 4036 5057 5016 1850 2404
0 14053 10286 1185 638
MUS
1733 2032 3994 5147 5013 1824 2406 14053
0 10304 1169 632
FR
1731 2024 4047 5035 5059 1827 2399 10286 10304
0 1146 626
PF
890 1008 1015 1106 1085 873 1067 1185 1169 1146
0 453
ECUN
600 645 580 616 617 595 539
638
632
626 453
0
MJ
238 233 214 216 242 230 279
223
216
217 169 142
MTH
254 247 237 247 278 245 306
251
248
249 171 141
AF
261 255 254 260 303 248 310
260
263
265 182 151
PH
251 245 250 259 297 237 281
273
258
271 187 155
PA
267 261 255 268 311 256 312
276
273
278 189 156
APEM
212 233 228 228 251 215 242
248
237
230 165 136
TA
264 260 252 254 279 261 298
268
264
261 182 141
TV
263 255 256 249 276 258 296
260
258
270 184 138
H
255 264 258 249 284 248 318
271
267
272 173 140
SSP2
302 317 293 292 326 300 360
310
309
311 200 155
PFU
264 284 256 275 324 286 316
292
274
280 195 150
STO
281 291 273 263 313 278 329
293
282
298 196 143
PYAE
245 258 236 249 285 238 278
258
246
256 170 143
MA
303 316 298 293 368 301 369
329
326
326 200 161
MK
210 214 195 204 216 211 244
205
202
195 160 125
MMA
289 298 276 280 338 280 349
305
299
297 194 160
HI
268 273 231 243 388 268 382
259
259
267 181
86
sij
Salmonella Typhi
Escherichia coli
Yersinia pestis
Vibrio cholerae
Pseudomona s aeruginosa
Synechosystis sp.
Thermotoga maritima
Aquifex aeol icus
Deinococcus radiod urans
Xylella fastidiosa
Neisseria meningitidis
Haemophil us inf luenzae
Rickettsia pro wazekii
Buchnera sp.
Helicobacter pylori
Campylobacter jejuni
Sinorhizobium meliloti
Mesorhizobium loti
Agroba cterium tumefaciens
Borrelia burgdo rferi
Treponema pal lidum
Chlamydia pneumoniae
Chlamydia tra chomat is
Mycoplasma p neumoniae
Mycoplasma g enitalium
Listeria innocua
Listeria mo nocytogenes EGD
Streptococcus pyogenes M1
Bacillus subtilis
Bascillus halodu rans
Stap hylococcus aureus Mu50
Stap hylococcus aureus N315
Mycobacterium tuberculosis cdc 1551
Mycobacterium tuberculosis
Mycobacterium leprae
Methanop yrus kandl eri AV19
Methanococcus janna schii
Methanoba cterium th ermoa utotrop hicum
Archaeoglobus fulgidus
Haloba cterium sp. NRC-1
Methanosarcina mazei strain Goe1
Methanosarcina acetivorans (C2A)
Pyrococcus Furiosus
Pyrococcus abyssi
Pyrococcus horiko shii
Sulfolobus solfata ricus P2
Sulfolobus tokoda ii
Pyrobac ulum aerophi lum
Genome tree: shared orthologs:
Tij = 100*Sij/size(j)
B
•
A•
Aeropyrum pernix K1
Thermopla sma a cidophilum
Thermopla sma volca nium
Mus musculus
E•
homo sapiens
Fugu rubrip es
Caenorhabditis elegans
Drosophi la melanogaster
Anoph eles gambia e
Arabidop sis thaliana
Plasmodi um falciparum
E. cuniculi
Schizosaccharomyces pomb e
Candida albicans
Saccharomyces cer evisiae
• 3 phylogenetic domains;
• bacterials cluster with
archaeal species;
• similar species cluster
together;
• better resolution of deep
species clustering;
• Evolutionary side effects
(HGT, duplication, loss) are
not completely eliminated;
Conservation profiles
p 011111100011111111100011011011110100111111101111
• a “conservation profile” is an n-component vector describing a
protein conservation pattern across n species.
Components are 0 and 1, following absence or presence of homologs.
• Conservation profile is the trace of protein evolutionary histories jointly
captured in a set of species (multidimensional feature);
• Conservation profiles are signatures of evolutionary relationships;
• Considering distinct conservation profiles, reduces the effects of noisy
evolutionary processes (less noisy phylogenetic signals);
• Each conservation profile brings equal amount of information
regardless of the size of the set of genes that have identical c. profiles;
• => give evidence of evolutionary history in a set of species
Distinct conservation profiles
S1
Sj
SS
i
i
Si
S1……………………… .….Sn
gi,1 01 000 000 0000 000 000 000 00
gi,2 10 000 001 0101 000 000 000 00
.
.
.
gi,p 01 000 000 0000 000 010 000 00
Sn
St ep 2
St ep1
St ep 3
Si
S1……………………… ....Sn weight
01 000 00
00 000 000 0000 000w
0 i,1
10 000 001 0101 000 000 000w
00
i,2
…………………………… .
01 000 000 0000 000 010 000w
00
i,k
St ep4
S1………………………… Snweight
01 000 000 0000 000 000 000W
001
10 000 001 0101 000 000 000W
002
…………………………… .
01 000 000 0000 000 010 000W
00t
Wl = {wi,m; i=1,n; m=1,n} is the weight of the conservat ion profile l.
100 species ===>
541880
proteins
442460
Distinct conservation profiles
non-specific proteins
i.e. conservation
profiels
184130
Drastic reduction
distinct conservation
profiles
28365
distinct conservation profiles
associated with at least 2
proteins from distinct species
Distribution of distinct conservation profiles according to the three
phylogenetic domains
E
A
1% 2%
B
11%
EA
3%
EA B
39%
EB
11%
AB
33%
Occurrences of shared conservation profiles
• Tij = sij, where sij is the number of occurrences of
distinct shared conservation profiles between species i
and j;
• Tij = sij/sjj.
E
A
B
S1..............I.............I................Sn
100000000000000000000000000000000000000000000000
111111111111111111111111111111111111111111111111
000001110001000000000000000000000000000000000000
000000000000000000000000000000000111000011100011
................................................
Occurrences of shared distinct conservation profiles
spec SC
SP
CE
DM
AG
CA
ATH HS
MUS FR
PF ECUN
SC
2328 387 239 262 274 400 338 285 299 288 146
96
SP
387 2208 267 301 317 351 377 318 334 320 152 102
CE
239 267 3153 575 506 284 364 642 656 670 188 116
DM
262 301 575 2747 653 305 416 718 729 725 203 124
AG
274 317 506 653 4052 269 477 612 657 650 165 107
CA
400 351 284 305 269 1906 315 345 362 338 171 107
ATH
338 377 364 416 477 315 5762 451 477 469 190 110
HS
285 318 642 718 612 345 451 3813 1511 1134 231 127
MUS
299 334 656 729 657 362 477 1511 4134 1140 229 133
FR
288 320 670 725 650 338 469 1134 1140 4280 215 132
PF
146 152 188 203 165 171 190 231 229 215 1251
95
ECUN
96 102 116 124 107 107 110 127 133 132
95 572
MJ
41
46
32
39
48
45
60
39
41
39
21
13
MTH
54
56
40
53
63
53
73
51
54
50
30
21
AF
56
52
57
62
78
54
74
64
66
65
31
19
PH
41
41
46
45
58
44
59
47
51
47
24
14
PA
49
47
51
48
56
53
72
51
52
50
25
16
APEM
51
51
48
51
65
51
63
57
60
54
29
17
TA
55
59
63
61
72
57
83
66
68
65
31
19
TV
58
56
65
59
68
52
82
61
66
65
29
18
H
65
68
64
65
77
61 101
71
73
71
34
23
SSP2
71
75
73
72
87
70
95
80
87
76
32
20
PFU
52
57
57
51
64
57
73
56
62
56
28
18
STO
59
59
65
67
71
56
75
65
66
64
28
17
PYAE
59
56
48
53
73
53
81
60
67
62
24
15
MA
71
75
76
83 102
84 113
85
93
85
44
33
MK
43
45
33
40
48
44
56
38
41
36
21
12
MMA
77
72
65
73
89
76 105
74
81
66
41
28
HI
71
76
70
67 101
79 116
74
74
78
46
23
sij
Profiles
Conservation
Orthologs
• Tekaia, F. and B. Dujon (1999). Pervasiveness of gene conservation and persistence of
duplicates in cellular genomes. Journal of Molecular Evolution, 49:591-600.
• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from whole
proteome comparisons. Genome Res. 12:17-25.
• Tekaia, F., Gordon, S.V., Garnier, T., Brosch, R., Barrel, B.G. and S.T. Cole (1999).
Analysis of the proteome of Mycobacterium tuberculosis in silico. Tubercle and Lung
Disease, 79:329-342.
• Genolevures program:
- F. Tekaia, G. Blandin, A. Malpertuy, et al. (2000): Methods and strategies
used for sequence analysis and annotation. FEBS 487,1:17-30.
- A. Malpertuy, F. Tekaia, S. Casaregola, et al. (2000):
«Yeast specific» genes. FEBS 487,1:113-121.
- G. Blandin, P. Durrens, F. Tekaia, et al. (2000).
The genome of Saccharomyces cerevisiae revisited. FEBS 487,1:31-36.
• Tekaia, F., Yeramian, E. and Dujon B. (2002)
Amino acid composition of genomes, lifestyle of organisms and evolutionary
trends : a global picture with correspondence analysis. Gene. 297 pp. 51-60.
• Tekaia, F., Yeramian, E. in prep
Genome tree based on conservation profiles
Systematic analysis of completely sequenced organisms:
http://www.pasteur.fr/~tekaia/sacso.html