Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur [email protected].

Download Report

Transcript Construction of Genome Trees from Conservation Profiles of Proteins Fredj Tekaia Edouard Yeramian Institut Pasteur [email protected].

Construction of Genome Trees from
Conservation Profiles of Proteins
Fredj Tekaia
Edouard Yeramian
Institut Pasteur
[email protected]
Outline
• Species tree construction and difficulties;
• Post genome era species tree construction;
• Conservation profiles;
• Genome tree construction based on conservation
profiles;
• Conclusions;
• References.
Species tree - Tree Of Life
• 16/18s rRNA tree (Woese 1990);
Woese and others have used rRNA comparisons to
construct a “Tree Of Life” showing the evolutionary
relationships of a wide variety of organisms.
The « Tree Of Life » has long served as a useful tool for describing
the history and relationships of organisms over evolutionary time.
One species is represented as a branching point, or node, on the tree, and
the branches represent paths of descent from a parental node.
Martin & Embley
Nature 431:152-5.(2004)
The three-domain proposal based on the ribosomal
RNA tree. Woese et al. PNAS. 87:4576-4579. (1990)
The three-domain proposal, with continuous
lateral gene transfer among domains.
Doolittle. Science 284:2124-8. (1999)
The two-empire proposal, separating
eukaryotes from prokaryotes and
eubacteria from archaebacteria.
Mayr, D. PNAS 95:9720-23. (1998).
The ring of life, incorporating lateral gene
transfer but preserving the prokaryote
eukaryote divide.
Rivera & Lake JA. Nature 431: 152-5. (2004)
Genomic Databases and the
Tree of Life
Keith A. Crandall and Jennifer E. Buhay
Sciences, 306; 1144-1145. (2004)
Prospects for Building the Tree
of Life from Large Sequence
Databases
The 1.2-Megabase Genome Sequence of
Mimivirus
Raoult et al. Sciences, 306:1344-1350. (2004)
Driskell, et al .
Sciences, 306; 1172-1174. (2004)
Pennisi, E. (1998). Genome data shake tree of life.
Science 280:672-4.
New genome sequences are mystifying evolutionary
biologists by revealing unexpected connections between
microbes thought to have diverged hundreds of millions
of years ago.
and suggests to construct species trees from their whole gene content.
B
A
E
Genome phylogeny based on gene content (1999)
Snel, Bork, Huynen. Nature Genetics 21, 108-110.
Tekaia, Lazcano & Dujon (1999)
Genome Research 9: 550-7.
B
A
E
387
29
Complete genomes
 2208 projects
• 460 published
(14-11-2006)
• 1054 prokaryotes
• 631 eukaryotes
44
http://www.genomesonline.org/
Gene tree - Species tree
•
Time
Duplication
•
Duplication
A
B
C
Gene tree
Speciation
Speciation
A
A
B
C
Genomes 2 edition 2002. T.A. Brown
B
Species tree
C
Problems with species tree construction
• main difficulties in species tree construction include
extensive incongruence between alternative
phylogenies generated from single-gene data sets;
-Genes don't evolve at the same rate nor in the same way;
-the evolutionary history inferred from one gene may be
different from what another gene appears to show.
Alternative solutions: integrative methods
• “supertree”
The supertree approach estimates phylogenies for subsets
of genes with good overlap, then combines these subtree
estimates into a supertree.
• Depends on the ability to
distinguish between
orthologs and paralogs;
• Supertree approaches
are controversial, in part
because the methodology
results in a degree of
disconnection between the
underlying genetic data
and the final tree
produced.
Bininda-Emonds et al. 2002
• “phylogenomic tree”
(based on concatenation of a gene sample common to the
considered species);
S1
.
.
Sn
• genes don't evolve at the same rate nor in the same way;
• a limited number of genes are shared among all species;
The tree of one percent (2006)
Dagan and Martin. Genome Biology, 7:118.
More generally these methods suffer difficulties
related to the phylogenetic tree construction:
• global sequence alignment (quality, gaps,...);
• different evolutionary histories of genes;
• substitution saturation;...
and
• more seriously from gene sampling difficulties.
Adapted from:
Gene tree - Species tree: The gene
Linder, Moret,
Nakhleh,
Warnow.
sampling problem
True species tree
A
B
gene tree #
species tree
Blue is lost
in A and B
A
C
Red is lost in C
B
C
A
B
C
Gene tree - Species tree: The gene sampling problem
A
B
C
All red orthologs has been lost
in the 3 species.
A
B
C
Luckily: sampling gives the
blue orthologs. The true
species tree is reconstructed.
Gene tree - Species tree: The gene sampling problem
A
B
C
All versions of the gene are in
the 3 species
A
B
CA
B
C
Gene trees are the same as the
species tree
Genome tree is another alternative to construct
species tree.
• The concept of genome tree is based on overall
gene content similarity.
(consider more than single gene information)
Methodology
Fp
1
i
p
1
j
kij
•
•
•
•
•
•
•
•
• •
n
••
•
•
••
•
•
•
•
•
•
•
•
•
F1
•
•
•
•
•
•
sup
Matrice T
kij > 0
Correspondence
Analysis
Classification
• orthogonal system;
• use of euclidean distance;
Systematic Analysis of Completely Sequenced
Organisms
• In silico species specific comparisons (Tekaia & Dujon. J. Mol. Evol. 1999)
(27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)
blastp, pam250, SEG filter
Proteome1
Proteome
• 99 species
(B: 33; A: 19; E:27)
• total of 541880 proteins
Proteomen
Systematic Analysis of Completely Sequenced
Organisms
• In silico species specific comparisons
(27 eucaryal, 19 archaeal and 33 bacterial species: 541880 proteins)
• Degree of ancestral duplication and of ancestral
conservation between pairs of species;
• Families of paralogs (Partition-MCL);
• Families of orthologs (Partition-MCL);
• Distribution of orthologous families according to the three domains of life;
• Determination of the protein dictionary (orthologs);
• Determination of protein conservation profiles;
Ancestor
A
Note on: Homologs - Paralogs - Orthologs
Duplication
A
Time
Homologs: A1, B1, A2, B2
B
Paralogs: A1 vs B1 and A2 vs B2
Evolution
A
Orthologs: A1 vs A2 and B1 vs B2
B
Speciation
A1
A2
B1
B2
Species-1
Species-2
Sequence analysis
a
S1
S2
b
• Large scale comparative analysis of predicted proteomes revealed
significant evolutionary processes:
Evolutionary processes include
Ancestor
Expansion*
Phylogeny*
genesis
duplication
HGT
species genome
Exchange* selection*
HGT
loss
Deletion*
Expansion, Exchange and Deletion are noise. They should be
eliminated or at least reduced.
To overcome some of these limitations, we consider
Genome tree construction from “Protein
Conservation Profiles” and attempt to reduce
noisy evolutionary processes
Conservation profiles
• 99 species (B: 33; A: 19; E:27); 541880 proteins
p 0111111000111111111000110110111101001111101111
• A “conservation profile” is an n-component binary vector
describing a protein conservation pattern across n species.
Components are 0 and 1, following absence or presence of homologs.
Main interesting properties of conservation profiles:
• Conservation profiles are signatures of evolutionary relationships;
• A conservation profile is the trace of protein evolutionary histories
jointly captured in a set of n species (multidimensional feature);
Protein conservation profiles
E
A
B
S1..............I.............I................Sn
G1,1
100000000000000000000000000000000000000000000000
G2,1
111111111111111111111111111111111111111111111111
G3,1
111111110011111111111111011101110101111111101111
.......................................................
Gn1,1
100001110001000000000000000000000000000000000000
G1,2
010000000000000000010100000000000111000011100011
G2,2
010000000000000000010100000000000111000011100011
........................................................
Gn2,2
111111110011111111111111011101110101111111101111
........................................................
G1,n
011110100000000000000000001000000000000000000001
G2,n
111111110011111111100011011101110101111111101111
G3,n
111111110011111111100011011101110101111111101111
........................................................
Gnp,n
100110000000000000000000000000000000000000000001
Table : 541880 proteins x 99 species
• Different conservation profiles represent different evolutionary
histories
Distinct conservation profiles
541880 original total proteins (99 species)
442460 non-specific proteins i.e conservation profiles (82%)
184130 distinct conservation profiles (42%)
100000000000000000000000000000000000000000000000
111111111111111111111111111111111111111111111111
111111110011111111111111011101110101111111101111
010000000000000000010100000000000111000011100011
100110000000000000000000000000000000000000000001
................................................
(one representative from each set of identical conservation profiles)
• Effect of the duplication process is reduced
• This set is indicative of the various observed
evolutionary histories.
c01
c02
c03
c04
c05
c06
c07
c08
c09
c10
c11
c12
c13
c14
c15
c16
c17
c18
c19
c20
c21
c22
c23
c24
c25
c26
c27
c28
c29
c30
c31
c32
c33
c34
c35
c36
c37
c38
c39
c40
c41
c42
c43
c44
c45
c46
c47
c48
c49
c50
c51
c52
c53
c54
c55
c56
c57
c58
c59
c60
c61
c62
c63
c64
c65
c66
c67
c68
c69
c70
c71
c72
c73
c74
c75
c76
c77
c78
c79
c80
c81
c82
c83
c84
c85
c86
c87
c88
c89
c90
c91
c92
c93
c94
c95
c96
c97
c98
c99
Fractions (*10000) of distinct conservation profiles
250
240
230
220
210
200
190
180
170
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
0
Presence in the 184130 distinct conservation profiles:
Mean=32.2; SD=23.3; min=1; Max=99.
Conservation weights (sum of "1":presence)
Genome tree construction: data matrices
• 184130 d.c.prof
various evolutionary histories
i
j
100000000000000000000000000000000000000000000000
111111111111111111111111111111111111111111111111
111111110011111111111111011101110101111111101111
010000000000000000010100000000000111000011100011
100110000000000000000000000000000000000000000001
................................................
• Jaccard similarity scores between species
sij = N11/(N11+N01+N10);
N11; N01; N10 are respectively total occurrences of (1,1), (0,1)
and (1,0) between i,j.
T = { Tij = sij ; i=1,n; j=1,n; n }
profiles tree
Tekaia F, Yeramian E. (2005).
PLoS Comput Biol.1(7):e75
Conclusions: Methodology
• Species classification is not an easy task!
• Species tree construction should take into account the
whole information included in the genomes;
• Methods that take into account whole genome
informations are still needed;
• Correspondence analysis method might be helpful in
revealing evolutionary trends embedded in the
multidimensional relationships as obtained from large
scale genome comparisons;
Conclusions...
• Conservation profiles represent most conserved and
meaningful evolutionary signals jointly captured in a set
of species;
• Thus they should correspond to the most accurate type
of markers for species classification;
• In principal profiles tree derived from distinct
conservation profiles should considerably minimize
genome acquisition effects and should reflect less noisy
phylogenetic signals;
• The profiles tree presents evidence of conservation of
stable phylogenetic relationships and reveals
unconventional species clustering;
• The profiles tree corresponds to the classification of the
evolutionary scenari.
Acknowledgments:
The support of:
• The Institut Pasteur (Strategic Horizontal Programme on
Anopheles gambiae)
• The Ministère de la Recherche Scientifique (France):
ACI-IMPBIO-2004–98-GENEPHYS program.
• Bernard Dujon (Institut Pasteur).
References:
• Tekaia, F. and Dujon, B. (1999).
Pervasiveness of gene conservation and persistence of duplicates in cellular
genomes. Journal of Molecular Evolution, 49:591-600.
• Tekaia, F., Lazcano, A. and B. Dujon (1999). Genome tree as revealed from
whole proteome comparisons. Genome Res. 12:17-25.
• Tekaia, F., Yeramian, E. and Dujon, B. (2002).
Amino acid composition of genomes, lifestyles of organisms, and evolutionary
trends: a global picture with correspondence analysis. Gene 297: 51-60.
• Tekaia, F. and Yeramian, E. (2005).
Genome Trees from Conservation Profiles. PLoS Comput Biol.1(7):e75.
• Tekaia, F. and Yeramian, E. (2006).
Evolution of Proteomes: Fundamental signatures and global trends in amino acid
composition. BMC Genomics. 7:307.
• Tekaia F, Latgé JP. (2005). Aspergillus fumigatus: saprophyte or pathogen?
Curr Opin Microbiol. 8:385-92. Review.
• Systematic analysis of completely sequenced organisms:
http://www.pasteur.fr/~tekaia/sacso.html
References:
• Bininda-Emonds ORP (2005). Supertree Construction in the Genomic Age.
Methods in Enzymology 395: p.745-757.
• Bininda-Emonds,OPRP, John L. Gittleman, Mike A. Steel (2002)
The (super)Tree Of Life: Procedures, Problems, and Prospects.
Annual Review of Ecology and Systematics, Vol. 33: 265-289.
• Dagan, T. and W, Martin (2006). The tree of one percent. Genome Biology, 7:118.
• Delsuc F, Brinkmann H, Philippe H. (2005). Phylogenomics and the reconstruction of the tree of life.
Nat Rev Genet. 6:361-75. Review.
• Doolittle. Science 284:2124-8. (1999)
• Driskell, et al. (2004). Sciences, 306; 1172-1174.
• http://www.genomesonline.org/gold.cgi (list of genome projects)
• Keith A. Crandall and Jennifer E. Buhay (2004). Sciences, 306; 1144-1145.
• Linder, Moret, Nakhleh, and Warnow: http://compbio.unm.edu/networks1.ppt
• Martin & Embley (2004). Nature 431:152-5.
• MCL: a cluster algorithm for graphs: http://micans.org/mcl/
• Pennisi, E.(1998). Genome data shake tree of life.Science. 280:672-4.
• Rivera & Lake JA.(2004). Nature 431: 152-5.
• Raoult et al.(2004). Sciences, 306:1344-1350.
• Snel, Bork, Huynen (1999). Genome phylogeny based on gene content.Nature Genetics 21, 108-110.
• Snel B, Huynen MA, Dutilh BE (2005). Genome trees and the nature of genome evolution.Annu Rev
Microbiol.;59:191-209. Review.
• Woese et al.(1990). PNAS. 87:4576-4579.