Aucun titre de diapositive - Institut national de la

Download Report

Transcript Aucun titre de diapositive - Institut national de la

The use of the concepts of evolutionary biology in
genome (biological) annotation.
Pierre Pontarotti
EA 3781 Evolution Biologique
[email protected]
http://www.up.univ-mrs.fr/evol/
•
Somes Concepts in evolutionary biology
•
Use of the concepts for
•
Gene Structural and functional annotation.
Informatisation
Others concepts
Gastrotrichs
Onychophorans
Tardigrades
Kinorhynchs
Priapulids
Urbilateria
??
Hemichordates
Echinoderms
Ctenophorans
Cnidarians
Poriferans
DEUTEROSTOMES
Vertebrates
Cephalochordates
Urochordates
BILATERIA
Molluscs
Rotifers
Annelids
Gnathostomulids
Sipunculans
Nemerteans
Pogonophorans
Platyhelminthes
Entoprocts
Bryozoans
Brachiopods
Phoronids
PROTOSTOMES
Nematodes
ECDYSOZOANS LOPHOTROCHOZOANS
Arthropods
Metazoan Phylogeny ( Adoutte et al. 2000)
URBILATERIA : The hypothetical Metazoan Ancestor
Geoffroy de St Hilaire during XIX th Century
URBILATERIA Genome evolved by the fixation of :
• Nucleotide substitution
• Gene loss
• Genic duplication
 Gene duplication
 Genome region duplication
 Whole genome duplication
 Chromosomal rearrangement
Large scale gene duplication
in vertebrate lineage
360
450
528
T2
Pikaia
564
T1
751
>751
20 000 genes
<833-993
Amniota
(Human)
Lisamphibia
Actinopterygii
(Zebrafish)
Chondrichthyes
(shark)
Cephalaspidomorphi
(lamprey)
Myxini
(Hagfish)
Céphalochordata
(amphioxus)
Urochordata
(Ciona)
Echinodermata
Insects (Drosophila)
833-993
Nématod (c. elegans)
From alleles to orthologs
I
Points mutations
Allele A fixation and
accumulation of new
mutations
A
POP 1 split in
2 autonomous populations
A1
B
C
A2
D
POP 1A
A
B
C
D
Population :
POP 1
Allele B fixation and
accumulation of new
mutations
A
B
C
B1
D
B2
POP 1B
From alleles to orthologs
points mutations
POP 1A split in
2 autonomous populations
A1
Allele A1 fixation and
accumulation of new mutations
POP 1A1
A11
A12
A2
Allele A2 fixation and
accumulation of new mutations
POP 1A
A1
POP 1A2
A21
A22
A2
POP 1B split in
2 autonomous populations
B1
Allele B1 fixation and
accumulation of new mutations
POP 1B1
B11
B12
B2
Allele B2 fixation and
accumulation of new mutations
POP 1B
B1
B2
B21
B22
POP 1B2
From alleles to orthologs
A.1.1
Alleles
A.1.2
A.2.1
Orthologs
Alleles
A.2.2
B.1.1
B.1.2
B.2.1
B.2.2
Alleles
Alleles
Orthologs and paralogs
HUMAN multigenic family
A1
A2
Speciation
A3’
DROSOPHILA multigenic family
A3”
A1
A1
A2
A3
A2
A3
URBILATERIA
A1, A2, B Paralogs
Duplication
A1/2
A3
A
Orthology/ Paralogy
A1 HUMAN
A1 DROSO
A1/2
Orthologs : 2 genes on different
species Which come from a
common ancestor and separated
by a speciation event.
A2 HUMAN
A2 DROSO
Paralogs : 2 genes resulting from
a duplication event in a genome.
A
A3’ HUMAN
A3” HUMAN
Co-Orthologues
A3
A3 DROSO
Speciation
Duplication
From Gene History
To Gene Function
Orhologs under purifying selection
HUMAN
DROSOPHILA
Ancestral Function
Ancestral Function
A
A
Purifying
Selection
Speciation
A
URBILATERIA
Purifying
Selection
Ortholog functional switch
HUMAN
DROSOPHILA
New Function ?
Ancestral Function
A
A2
Positive selection
Or relaxed
Speciation
A
URBILATERIA
Purifying
Selection
Co-ortholog Sub Functionalization
DROSOPHILA
HUMAN
HUMAN
Sub-Function
Sub-Function
A’
Ancestral Function
A”
A
Duplication
Speciation
A
URBILATERIA
Purifying
Selection
Co-ortholog Neo Functionalization
HUMAN
HUMAN
Ancestral Function
New Function
A
DROSOPHILA
Ancestral Function
A
A2
Positive or relaxed
selection
Duplication
Purifying
Selection
Speciation
A
URBILATERIA
Purifying
Selection
• Orthology /paralogy information
• is important for functional inference
• (forget for species with high level of
horizontal transfer)
Orthology/ Paralogy
A1 HUMAN
A1 DROSO
A1/2
Orthologs : 2 genes on different
species Which come from a
common ancestor and separated
by a speciation event.
A2 HUMAN
A2 DROSO
Paralogs : 2 genes resulting from
a duplication event in a genome.
A
A3’ HUMAN
A3” HUMAN
Co-Orthologues
A3
A3 DROSO
Speciation
Duplication
A Warning that will be discussed by other intervenants
Many scientists are using the best BLAST hit
to look for orthologous relationship
… BUT!
Many co orthologs can be present
Problem with genomes that are not fully sequenced
Or when gene loss occurred
AND
Even with Phylogenetic analysis :
• Bias must be corrected.
• A phylogenetic tree is hypothetical
• Evolutionary shift (due to positive or
relaxed selection) could be linked to
functional shift .
See N Galtier and A Levasseur talks.
• Detection of Positive selection and
functional shift
• Detection of Evolutionary constraint
relaxation and functional shift
Co-ortholog Neo Functionalization
HUMAN
HUMAN
Ancestral Function
New Function
A
DROSOPHILA
A
A2
Duplication
Purifying
Selection
Ancestral Function
Speciation
A
URBILATERIA
Purifying
Selection
Constitutive proteasome β-subunits replacement after Interferon-γ stimulation
Paralogue replacement
PSMB5
PSMB8 (LMP 7)
PSMB6
PSMB9 (LMP 2)
PSMB7
PSMB10 (LMP Z)
Constitutive Proteasome
•
•
Ancestral function : Protein
degradation
Present in all Metazoans, therefore
present in Urbilateria (Metazoan
ancestor).
Immuno-Proteasome
•
•
New function (specialization) (Specific size
protein or peptide degradation – used by
MHC system)
Only found in vertebrates
Paralogue = duplicated gene
Large scale gene duplication
in vertebrate lineage
450
528
564
751
>751
<833-993
PROTEASOME
360
Amniota
(Human)
Lisamphibia
Actinopterygii
(Zebrafish)
Chondrichthyes
(shark)
Cephalaspidomorphi
(lamprey)
Myxini
(Hagfish)
Céphalochordata
(amphioxus)
Urochordata
(Ciona)
Echinodermata
Insects (Drosophila)
833-993
Nématod (c. elegans)
Duplication
58
59 *
52 PSMB7 Mus
69
91
99 80
PSMB7 Ratt
91 100
95
PSMB7 Bos
98
*
PSMB7 Homo
62
88
PSMB7 Gall
75
PSMB7 Xeno
93 * **
PSMB7 Zebra
* 95 59
58
PSMB7 Fugu
95
78
99
74
100
93
80
*
78
*
100
100 **
62
PSMB10 Zebra
PSMB10 Fugu
PSMB10 Bos
PSMB10 Mus
PSMB10 Homo
PSMB7/10 Bran
PSMB7/10 Ci-zeta Cionai
PSMB7/10 Bombyx
PSMB7/10 Prosbeta2
PSMB7/10 CG18341 Drosophila
76
*
95 *
*
44
0.1
The study genes and genomes HISTORY.
Help to find evidences for gene FUNCTION.
Concepts in evolutionary biology
•
•
Use of the concepts for
Structural and functional annotation.
 Structural annotation (deciphering of gene structure).
 Functional annotation (especially the use of phylogeny to
decipher proteins function).
.
Functional annotation
Biochemical and Biological process :
•
Experimental approach :
 RNA Interference
 Tandem affinity purification and mass spectrometry
•
In Silico
Functional annotation
•
Functional Annotation
Based on phylogeny.
from experimentally annotated genes…
INTERLUDE
• FUNCTION????
• A complex concept;
Function Prediction
Using orthology information (done)
Using the evolutionary shift
information (in progress)
Function prediction by
Integrative phylogenomics (Engelhardt et al
PLOS Computional biology 2005) (in progress)
Functional annotation
Homologs with experimentally known function:
how information can be found.
Gene Ontology
SwissProt
GenBank
MedLine
Textual Information Analysis
G.O. Standard
Functional annotation
Gene Ontology Classification
•
Biological process – biological process to which the gene or gene product
contributes.
 Cell growth and maintenance; pyrimidine metabolism; …
•
Molecular function – biochemical activity, including specific binding to
ligands or structures, of a gene product.
 Enzyme, transporter; Toll receptor ligand, …
•
Cellular component – place in the cell where a gene product is active.
 Cytoplasm, ribosome, …
. Plus others classifications to develop:
In particular evolutionary based ontology
Small fraction correspond to known, well-characterized proteins.
If the function is unknown : Phylogenetic analysis :
Functional prediction:
Using orthology information
Using the evolutionary shift information
by integrative Phylogenomics
Tumor necrosis factor family Phylogenetic tree :
Orthologs identification
GgaTNFSF10
DreTNFSF10
HsaTNFSF10
PolTNFSF11
HsaTNFSF11
XlaTNFSF11
GgaTNFSF5
99
96
73
79
95
78
99
99
98
79
96
99
88
99
74
99
Atherosclerotic
plaque
formation
MmuTNFSF5
HsaTNFSF5
BboTNFSF5
HsaTNFSF2
MmuTNFSF2
HsaTNFSF1
MmuTNFSF1
MmuTNFSF15
HsaTNFSF15
HsaTNFSF14
MmuTNFSF14
99
DF1
DF2
HsaTNFSF6
99
ALPS LPR/GLD
RnoTNFSF6
69
99
68
99
55
99
58
97
0,2
MmuTNFSF6
HsaTNFSF13
GgaTNFSF13
PolTNFSF13
MmuTNFSF7
HsaTNFSF7
MmuTNFSF8
HsaTNFSF8
MmuTNFSF9
HsaTNFSF9
EIGER (DmeTNF)
Lymphoproliferative
syndrome
DF3
Trends in Immunology (July 2003)
Human TNF family Phylogenetic tree :
Molecular Function
Search for the closest Paralog
TNFSF3
TNFRSF3
TNFSF1
TNFRSF1A
TNFSF2
TNFRSF1B
TNFRSF12
TNFSF15
TNFSF14
TNFSF6
TNFSF18
TNFSF4
TNFSF5
TNFSF10
TNFSF11
TNFSF13B
TNFSF13
TNFSF12?
TNFRSF14
TNFRSF6B
TNFRSF6
TNFRSF18
TNFRSF4
TNFRSF5
TNFRSF10B
TNFRSF10A
TNFRSF10C
TNFRSF10D
TNFRSF11B
TNFRSF11A
BR3
TNFRSF17
TACI
Functional annotation
Biological Process
LN, PP, GC, Tumorocidal activity
PP, GC, T cell Homeostasis (death)
T cell Homeostasis (death)
T cell costimulation, negative selection?
T cell Homeostasis (survival?), CTL activation,
peripheral tolerance?
T cell Homeostasis (death), CTL function,
peripheral tolerance, T cell costimulation, chemotaxis
T cell transmigration and homeostasis (survival)?
T cell homeostasis (survival), peripheral tolerance
GC, B cell function, peripheral tolerance, T cell priming
Tumorocidal activity, T cell function?
Tumorocidal activity, T cell function?
LN, bone Homeostasis, mammary gland development
B cell Homeostasis
B cell Homeostasis ?
B cell Homeostasis
TNFSF7
TNFSF9
TNFRSF7
TNFRSF9
T cell activation?
TNFSF8
TNFRSF8
Negative selection, autoimmunity
TNFRSF19
?
Tooth, hair, sweat gland formation
EDA-A1
EDA-A2
EDAR
T cell activation and survival, CTL activity, Tumorocidal actvity?
XEDAR
Tooth, hair, skin formation?
TNFRSF21
?
RELT
?
Trends in Immunology
(July 2003)
Small fraction correspond to known, well-characterized proteins.
If the function is unknown : Phylogenetic analysis :
Gene function prediction:
Using orthology information
Using the evolutionary shift information (
see Levasseur talk)
by integrative Phylogenomics
evolutionary biology concepts for
genome annotation
Further reading
Concepts
Levasseur A, Danchin E, Orlando L, Bailly X, Pontarotti P. Conceptual bases for
quantifying the role of the environment on genomes evolution: the
participation of positive selection and neutral evolution Biological review in
press
Danchin E.G.J, et al. The Major Histocompatibiliy Complex Origin
Immunological reviews. 2004;198(1):216-232.
Concepts for applied evolution
Danchin E.G.J, Levasseur A, Lopez-Rascol V, Gouret P, Pontarotti P. The use
of evolutionary biology concepts for genome annotation. J. Exp. Zoology
Part B: Mol. and Dev. Evol. 2006
Informatisation des concepts et
connaissances
• Phylogénie
• Détection des gènes orthologues et
paralogues
• Détection de changements évolutifs
(en cours)
• Prévision de fonctions
FIGENIX est une plate-forme logicielle multi-utilisateur dédiée aux taches
d'annotation structurales et fonctionnelles:
- Prédictions de gènes pour de grandes séquences d'ADN
- Construction d'arbres phylogénétiques robustes
- Détection automatique d'orthologues et de paralogues
- Recherche automatique de données fonctionnelles sur les gènes
disponibles à partir
de bases de données « Web »
- Filtrage et construction de bases de données protéiques (contigage d'EST)
- Processus chainés
(ex: Prédiction de gènes suivie d'études phylogénétiques pour chacun)
ETAPES DU PIPELINE de Phylogénie (1)
Séquence protéique codée
par un gène putatif
Ensembl
NR…
BLAST + filtrage
CLUSTAL W + purification
+ correction de biais
PFAM
Alignement multiple
Recherche de domaines par HmmPFAM
Construction
Arbre de la Vie
Enumération domaines
Existence
« repeats »?
O
Conservation
« repeats »
monophylétiques
Alignement « repeats »
fusionnés
N
Arbre de
référence
Création domaine
« FIGENIX » (correctDomains)
Conservation alignement complet
Test de composition
par TREEPuzzle pour
élim séq trop
divergentes
ETAPES DU PIPELINE de phylogénie (2)
Détection « groupes de paralogie » + élim
sites qui évol trop vites (« test de Gu »)
Élim séq >30% « gaps »
Construction
Arbre de la Vie
Élim domaines les + non congruents
détectés par HomPart de PAUP
Arbre de
référence
Test de saturation
NJ
arbre
Parcimonie
Maximum de vraisemblance
arbre
arbre
Comparaison topologies par tests
Templeton-Hasegawa
Arbre NJ
N
Topologies congruentes?
Détection orthologues
I
recherche de fonctions
O
Arbre consensus
Architecture de FIGENIX
EST
Agent
MGI
Agent
GO
Agent
Functional Collector Agent
Archiver
RDBMS
Expert System
Genomic
Data
Annotation Engine
Persistence Layer
Repository
Load Balancing, Security, ...
Web Server
- plate-forme Intranet/Extranet
Request
-architecture 3 tiers (interface web/ serveurs “métier” / base de données)
Data exchange
1)
Further reading:
about concepts informatisation
•
Gouret et al.FIGENIX: intelligent automation of
genomic annotation: expertise integration in a new
software platform. BMC Bioinformatics. 2005 Aug
5;6:198
• Balandraud et al. A rigorous method for multigenic
families' functional annotation: the peptidyl arginine
deiminase (PADs) proteins family example BMC
Genomics 2005, 6:153
Further reading on FIGENIX utilization
• Danchin et al . Eleven ancestral gene families lost in
mammals and vertebrates while otherwise universally
conserved in animals BMC Evolutionary Biology 2006,
6:5
• Paillisson et al . Bromodomain testis-specific protein is
expressed in mouse oocyte and evolves faster than its
ubiquitously expressed paralogs BRD2, -3 and -4.
Genomics. 2006
• Levasseur et al Tracking the evolutionary and functional
shifts connection: the lipase-esterase example.BMC
evolutionary biology 2006
Structural annotation
Structural annotation
Genome nucleotide-level Annotation :
•
•
Mapping
Finding genomic landmarks
• Gene finding and protein
prediction
•
•
•
•
Non-coding RNAs and regulatory
regions
Identifying repetitive elements
Mapping segmental duplications
Mapping variations (SNP,
microsatellites, ….)
Available tools
Ab
•
•
•
•
initio :
Genscan
Fgenesh
Genie
Etc …
Similarity Based :
• Genewise
• Sim4
• Est2genome
• Figenix
State of the Art
Structural annotation
Based on statistical signals within the DNA.
Coding propensity (hexamer signals).
Splice Site Signals.
Strengths :
 Easy and quick to run.
 Only need DNA as input.
Weakness : High false positive rate.
Alignement programs that know about gene structure.
Very accurate with strong sequence similarities
Strengths : Accurate.
Weakness : Need strong similarities, slow to run.
« FIGENIX SOFTWARE
PLATFORM » Annotating method
•
Structural Annotation
combining together a statistical and
homologous approach (similarities with
known proteins). The process automation
resulted in an expert system based on
biological inference rules using gene history
and ab-initio program. But yet not
completely evolutionary biology based
Structural annotation
région 1
région 2
segment ADN
hsp: A1
protéine A
(meilleur hit
région 1)
hsp: A2
hsp: A3
hsp: B1
protéine B
(meilleur hit
région 2)
hsp:B2
+
MD
A
A
D
A
D
AD
A
A
D
D
A
D
A
D DA
S
Validation of structural annotation
Gene = nucleotidic sequence
Sequence
P
Transcription
mRNA = nucleotidic sequence
Genscan : 31%
Traduction
Protein = amino acid sequence
Protein
HMMGene : 38%
Figenix : 87%
The platform performances were validated on standard dataset (HMR195) see Guigò et al, 2000; Rogic et al, 2001.
Structural annotation
Accuracy versus Exon Type and Prediction
EXON TYPE
PROGRAMS
Initial
(55)
Internal
(186)
Terminal
(55)
OVER
PREDICTION
CORRECT
PROTEIN
PREDICTION
Genscan
0.55
0.80
0.65
0.22
0.31
Hmmgen
0.75
0.81
0.78
0.15
0.38
Figenix
0.91
0.92
0.95
0.05
0.87
The Mouse and Rat sequence from the HMR195 dataset was used
on the human division of swissprot.
• The next step for structural annotation :
• Is to take into account the gene
evolutionary history
•
Concepts , modélisation, informatisation, bio-annalyse
 Structural annotation (deciphering of gene structure).
 Functional annotation (especially the use of phylogeny to decipher
proteins function).
Next
• Phylogenomics (genome Evolution)
• Phylopostgenomics
• - phylotranscriptomics
• - phylointeractomics
• ………..
Connaissances/concepts
Observation : il existe des régions de syntenies conservées entre espèce.
Explication /concept : ces régions proviennent d’une région ancestrale qui a évoluée de
manière indépendante après spéciation dans chaque lignée, mais pas assez pour perdre
toute trace de conservation. A partir de cette connaissance et de cette prédiction que
découle un ensemble de réflexion qui indique que les analyses des synténies conservées
et la reconstruction de régions ancestrales sont intéressantes, d’un point de vu appliqué :
assistance au clonage positionnel et d’un point de vue conceptuel : compréhension de
l’évolution des génomes.
Formalisation de la question biologique
Comment mettre en évidence les synténies conservées ?
C’est aussi à ce moment que la conceptualisation prend toute sa place
Si les synténies conservées proviennent vraiment d’une région ancestrale, les gènes dans
ces régions doivent avoir
1/ des relations d’orthologie
2/ le regroupement des gènes orthologues doit être improbable sous l’hypothèse du hasard
(le regroupement doit être significatif).
ll faut donc avoir des programmes qui soient capables de mettre en évidence les relations
d’orthologie, et de trouver des clusters significatifs.
Reconstruction des génomes (translocation, fusion inversion… pondération de ces
événements)
Modélisation mathématique
Il faut modéliser dans le cas ou les outils informatiques
n’existent pas ou dont le formalisme biologique n’est pas
correct. Ce qui est le cas pour les tests statistiques de
regroupement (la taille des famille de in-paralogues en
particulier).
Modéliser la reconstruction des génomes
Formalisation informatique
1)Algorithmes
Tests statistiques
Modélisation reconstruction ancestrale des génomes
2) Intégration avec les autres outils « informatique »
dans le système informatique (CASSIOPE)
• Bioanalyse
• Recherche automatique de synténies
conservées.
• Reconstruction et évolution de régions
génomique
• Nouvelle connaissance et nouveaux
concepts
• Application directe :
• aide au clonage positionnel
• Concepts/connaissance:
• Mise en évidence de regroupement fonctionnel
C.A.S.S.I.O.P.E
•
C.A.S.S.I.O.P.E: Clever Agent System for
Synteny Inheritance and Other
Phenomena in Evolution
•
•
find conserved regions between genomes
For more info see Virginie Lopez Rascol
C.A.S.S.I.O.P.E.
• Toward the ancestral genome reconstruction
Toward the ancestral genome reconstruction
C.A.S.S.I.O.P.E
• Bioanalyse
• Recherche automatique de synténies
conservées.
• Reconstruction et évolution de régions
génomique
• Nouvelle connaissance et nouveaux
concepts
• Application directe :
• aide au clonage positionnel
• Concepts/connaissance:
• Mise en évidence de regroupement fonctionnel
Collaborateurs
Projet MEG* (Modèlisation Evolution Génome)
Nathalie Balandraud
Philippe Gouret
Etienne Danchin
Vérane Vitiello
•
•
•
•
Math/bio
Julien Berestycki*
Stéphanie Léocard*
Laure Rigal*
•
•
•
Info/bio
Olivier Chabrol*
Cedric Notredame*
•
•
•
•
Concepts et bio-analyse
Roxane Barthelemy *
Jean, Paul Casanova*
Elodie Darbo*
Anthony Levasseur*
Eric Faure*
Pierre Pontarotti*
Simona Grusea*
Valda Limic *
Etienne Pardoux*
Virginie Lopez*
http://www.up.univ-mrs.fr/evol/
Open Discussion
Phylo postgenomic