Transcript Slide 1

MultiParanoid
Automatic Clustering of Orthologs
and Inparalogs Shared by Multiple
Proteomes
Andrey Alexeyenko
Ivica Tamas
Gang Liu
Erik L.L. Sonnhammer
Stockholm
Homologs: orthologs and paralogs
MultiParanoid
Homologs: genes that have descended
from a common ancestral gene.
Manifested by a sequence similarity.
We do not believe in sequence similarity
without a shared ancestry.
Ancestral gene
Gene 1
BLAST hit.
Low e-value
Gene2
Orthologs are related via a speciation
S
Paralogs are related via a gene duplication.
May or may not be in the same species
D
Homologs: orthologs and paralogs
MultiParanoid
Inparalogs ~ co-orthologs
paralogs that were
duplicated after the
speciation and hence are
orthologs to the other
species’ genes
Outparalogs = not co-orthologs
paralogs that were
duplicated before the
speciation
Orthology, paralogy and proposed classification for paralog subtypes
Sonnhammer ELL and Koonin EV
Trends in Genetics
Volume 18, Issue 12 , 1 December 2002, Pages 619-620
Orthologs for functional genomics
MultiParanoid
Orthologs are more likely than outparalogs to
have identical/similar biochemical functions
and biological roles
Orthologs are optimal to discover gene function
via model organism counterparts
Benchmarking ortholog identification methods using functional genomics data.
Hulsen T, Huynen MA, de Vlieg J, Groenen PM.
Genome Biol. 2006;7(4):R31. Epub 2006 Apr 13.
“…the InParanoid program is the best ortholog identification method in terms of
identifying functionally equivalent proteins.”
Outline
1.InParanoid
2. The world of ortholog resources
MultiParanoid
3. Why MultiParanoid
4. Limitations
5. Future development
S
Inparalogs
Orthologs
Outparalogs
MultiParanoid
Homologs: orthologs and paralogs
D
S
D
MultiParanoid
InParanoid
P r o t e o m e
A
P r o t e o m e
B
Reciprocally best hits ~ seed orthologs
Inparalogs
Automatic clustering of orthologs and in-paralogs from pairwise species comparisons
Maido Remm, Christian E. V. Storm and Erik L. L. Sonnhammer
Journal of Molecular Biology 314, 5
, 14 December 2001, Pages 1041-1052
Resources using InParanoid
Eukaryotic Ortholog Groups
MultiParanoid
3409 diseases
Multi-species ortholog resources
“Massive download” friendly:
Clusters of Orthologous
Groups
Tree-based, best
for detailed
analysis
MultiParanoid
HOVERGEN release 47
Any cluster of more than 2 species’
genes is controversial in terms of
orthology
as the speciation gives rise to
a pair of species.
MultiParanoid
S
D
D
D
S
S
MultiParanoid algorithm
1. Take >2 species with maximally
close speciation points
MultiParanoid
2. Generate 2-species
InParanoid clusters
3. Find
protein
counterparts
across the
clusters
A-B
B-C
A-C
InParanoid cluster A-C
?
InParanoid cluster A-B
InParanoid cluster B-C
MultiParanoid validation
The MultiParanoid output was benchmarked on a manually
curated set of 221 human-fly-worm clusters:
- 214 MultiParanoid clusters found
- 177 (almost) identical
-The rest controversial mainly due to:
MultiParanoid
- differences between pairwise and multiple alignments
- the curator’s perception and InParanoid settings
However:
tree conflicts
InParanoid cluster
membership
Genes:
Fly
Worm
Human
MultiParanoid
vs.
Clusters of Orthologous
Groups
and
Tree conflict
Other
Short match
Outparalog
MultiParanoid
Weak homolog
14000 12000 10000 8000
6000
4000
2000
0
0
2000
4000
MP not KOG
6000
8000 10000 12000 14000
KOG not MP
Tree conflict
Other
Short match
Outparalog
Weak homolog
14000 12000 10000 8000
6000
4000
MP not OrthoMCL
2000
0
0
2000
4000
6000
8000 10000 12000 14000
OrthoMCL not MP
Current MultiParanoid release
MultiParanoid
???
C.elegans
D.melanogaster
C.intestinalis
H.sapiens
40451 protein sequences classified into 7695 clusters
http://multiparanoid.cgb.ki.se/
A solution: expansion of
MultiParanoid clusters
MultiParanoid
1. Process all the possible 3-species combinations:
2. Merge respective cluster members across the clades:
But still, orthology is a pairwise concept!
MultiParanoid
The speciation gives rise to a pair of species.
Post-processing (bootstrap, synteny, tree manual curation etc.)
MultiParanoid
How the ortholog resources cope with it?
Clusters of Orthologous
Groups
HOVERGEN release 47
Cluster size ~ outparalogs/orthologs ratio
Overview and comparison of ortholog databases
Alexeyenko A, Lindberg J, Pérez-Bercoff Å, Sonnhammer ELL
MultiParanoid
Drug Discovery Today:Technologies (2006) v. 3; 2, 137-143
•EGO
•COG/KOG
•HomoloGene
•InParanoid/MultiParanoid
•HOPS
•KEGG
•OrthoMCL
•ENSEMBL Compara
•PhiGs
•MGD
•HOGENOM
•HOVERGEN
•INVHOGEN
•TreeFam
•OrthologID
How to reconcile…
…the demand for multi-species clusters and pair-wise gene
relations?
The common feature is
a single ancestor gene at the root point:
MultiParanoid
S
D
D
D
S
D
S
2 new terms:
Pseudo-proteome: a union of
MultiParanoid
proteomes of the same clade
Cluster of pseudo-inparalogs: a
within-clade gene family
MultiParanoid
Pseudo–proteome
A (reptiles)
Pseudo–proteome
B (mammals)
Another view:
“gene-family”-wise:
LCA
MultiParanoid
D
S
D
S
S
D
… and all the members of the same cluster ascend to a single
gene in the last common ancestor (LCA) of the two major clades
The clustering can be done at different levels
MultiParanoid
S
•
•
•
D
S
D
Orthologs
For example:
Fungi vs. animals
Insects vs. mammals
Rodents vs. primates
S
S
D
Having more than one species in a pseudo-proteome reduces misassignments in case of gene loss.
Closer pseudo-proteomes increase resolution.
Lineage(~pseudo-proteome)-specific expansions should be also
available
Conclusions
• Most of the ortholog resources may build clusters in form of
gene trees, but only InParanoid seems to correctly
delineate ortholog/inparalog groups
• MultiParanoid algorithm has relieved the problem of
“hidden outparalogs”, but the number/content of species
remains limited
MultiParanoid
• The “LCA-Paranoid” concept: the long waited solution?
– Each of the two clade-specific cluster parts may be regarded as a multispecies cluster
– When (in future) all possible “clade<->clade” clustering solutions will
be found, each gene would receive a complete set of orthologs at a
desirable level of LCA
– With sufficient number of complete proteomes, it would be possible to
date each gene pair’s point of divergence