Bioinformatics in the 90’s

Download Report

Transcript Bioinformatics in the 90’s

Bioinformatics in the 90’s


Origins : data storage needs related to the
sequencing effort...
…but storage was hardly enough : additional needs:







Assembly, comparison and annotation of sequences
Prediction of genes
Reconstruction of evolutionary trees
Modelisation & prediction of 3D structures
...
IT : on-line databases and software tools
Science : modeling, computational representations,
algorithms
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
The post-genomic phase transition
•
•
Availability of complete genome sequences
High-throughput experimental techniques
yield new types of results :
•
•
•
•
•
•
•

SNP (Single Nucleotide Polymorphisms)
mRNA expression levels (“DNA chips”)
Systematic determination of 3D structure
Protein expression levels
Protein- protein interactions
Systematic mutagenesis
...
New needs & opportunities :



Processing and analysis of each type of data
Integration of heterogeneous data
Reconstruction and simulation of cellular mechanisms…
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Corporate Information
Founded:
December 1997
Headquarters and laboratories:
Central Paris
Employees:
60 as of end 2001
Intellectual Property :
57 patents on technology,
interactions and targets
Equity raised:
c. €
30 million
Ownership:
Advent (B), Alafi (US), Apax (F),
Auriga (F), IMH(D), Health Cap (S),
Lombard-Odier (CH), Medicis (D),
Rendex (B)
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Hybrigenics’ business strategy

Own drug discovery programs



in the fields of infectious diseases, cancer and metabolic disorders
the resulting novel validated targets being exploited for the Company’s own
product pipeline
Collaboration and licensing agreements with
biopharmaceutical companies


in any disease field
for out-licensing
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Hybrigenics’ discovery programs
Cancer
Proteins involved in basic cellular functions
Proteins involved in apoptosis
Proteins involved in cell cycle regulation
Metabolic disorders / Obesity
Proteins involved in adipogenesis
Anti-infectious diseases
Antibacterial
Essential proteins of the pathogens
HIV, HCV : protein-protein interactions between
the host cell and the pathogen
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
The Helicobacter pylori Genome
From Tomb et al. (1997), Nature 388:539-47
1,667,867 base pairs
1,590 predicted ORFs
less than 20% with assigned
biological functions
(500 with no database match 250 with
structural homology but totally unknown
function)
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
The protein-protein interaction map of
Helicobacter pylori
285 baits
261 proteins
2 million prey fragments
20 milion interactions/bait
PBS® filtering (false positives identification)
Over 1,200 interactions
Over 1500 SID®
Nature (2001) 409:211-215.
Connectivity:
46.6% of proteome
3.36 interactions/bait
Reproducibility: >95%
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Target Identification
Hybrigenics' PIM Technology Platform
New Generation of
Reliable
High-Throughput
2-Hybrid in Yeast & Coli
PIMBuilder®
in-house Production
Management System
VirtualPIM
Prediction
PIMRider® platform
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
PBS®
Scoring
Technology
Hybrigenics
Target Discovery Process
Target
Identification
Selected
Pathology
and
Mechanism
of Action
Identify Target
Proteins &
Interactions
through
HighThroughput
Functional
Proteomics
Target
Pre-Validation
Target
Validation
Select
Relevant
Targets
through
Bioinformatics
Analyses
Validate
Targets in
Cellular Model
& Animal
Model through
Functional
Assays
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Validated
Targets…
… in
Context
In-silico Target Validation
Platform
Goals
• Validate protein interactions and SIDs
• Evaluate « target potential » and druggability
• Provide functional context for target candidates
• Prioritize »promising" candidates for biological
validation
Means
• Integrate PIMs with functional clues of different origins
• Predict novel biological information
• Computer aided decision process :
Provide comprehensive « decision-oriented » view of functional
clues
Automated filtering
Output
• Prevalidated targets + functional context
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
The Genostar platform
A modular software platform for exploratory
genomics
The Geno*™ Consortium :
Pasteur Institute (Paris),
National Institute for Research in Computer Science (INRIA, Grenoble)
Genome Express (Grenoble)
Hybrigenics
Plug-in modules
each with specific objects, relations, strategies and graphical
interfaces
GenoViews : common graphical interfaces
Genostar technology



GenoAnnot™ : procaryotic and eucaryotic
genome annotation
Rich object-based
knowledge
representation
system (objects,
relations, tasks and
strategies)
Modular architecture
Domain-specific
biological modeling
GenoLink™ : functional annotation using
neighborhood relationships
GenoCore
Central data management & knowledge
representation
GenoBool™ : statistical exploration of
genomic data
?
External
External Information
Information Integration
Integration Pipeline
Pipeline
Expression
data
Sub-Cellular
Location
Genomic
data
Pathway
data
3D
Structure
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Genolink : viewing biological data
as a graph of relations
Genolink Composite Graph
Vertices :
biological
entities
Edges :
similarity,
interaction
or association links
Sequence
Similarity
Links
Protein
Interaction
Links
Domain
Inclusion
Links
Subcell
Location
Links
Profile
Similarity
Links
Tissue
Expression
Links
Preprocessing
Genomic
data
Interaction
data
Domain
data
Sub-Cellular
Location
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
mRNA
Expression
data
From PIMs to Pathways
From PIMs to Pathways
Combine PIMs and external data to reconstruct biological
pathways
PIM annotation
Pathways
expansion
PIM
Network of
interaction
links
Contextdepende
ntHomol
ogy
Pathways
Network of
functional links
• Metabolic reactions
• Signaling reactions
• Polymerization
reactions
• Regulatory
interactions
+ context
(organism, tissue)
Common Data Model
PIMs
Pathways
Databases
Functional
Classification
s
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
The BioPathways Consortium

Mission :


Goals :






Forum open to interested participants (academics, pharmas, biotechs, software
vendors)
Achievements :


Scientific community buildup, standards recommendation, public outreach, industryacademia collaboration support, coordination with other groups
Means :


Foster development of pathways informatics & systems biology
Launched June 2000 by 3rd Millennium (Boston) and Hybrigenics (Paris)
1st Meeting at ISMB 2000 -> Work Groups
2nd Meeting at PSB 2001 -> First results on evaluation of pathways representations
3rd Meeting – Satellite Meeting of ISMB 2001, Copenhagen -> Focus on ontologies and pathways
reconstruction (>150 attendants), new workgroups
Several sponsors (pharmas, biotechs, IT companies)
Over 200 participants from academia & industry
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Annotation fonctionnelle


Objectif : assigner une/des “fonction(s)” à un gène ou à
une protéine de séquence connue
Méthodes traditionnelles:



Résultats expérimentaux
Variations sur le thème : propagation d’annotations d’origine expérimentale
via similitude de séquences
Fonction ?



Locale et précise (Ex : la protéine P est un enzyme catalysant la réaction R)
Globale et vague : appartenance à un processus biologique de haut niveau
(Ex : P intervient dans la dégradation du glucose)
Ce qui est propagé : mots clefs, nœud d’un arbre de classification
fonctionnelle
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
An effort toward consensus :
Gene Ontology
The Gene Ontology
Consortium (2000)
Nature Genet. 25: 25-29
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Le dogme…
Séquence
Structure
Fonction
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Technologies
de
Perturbation
Contexte
cellulaire
…et les expériences
Séquence
?
?
Structure
?
Technologies
d’observation
Fonction
Phénotype
Couple perturbation-observation : faux positifs, faux négatifs, traitement
Février 2002 – Journée
« Algèbres de processus et processus biologiques »
statistique, formalisation
de laPréARC
conclusion…
Integration of heterogeneous data

Joint use of functional clues from a variety of experimental
approaches to :








Validate the biological relevance of interactions
Determine the function of proteins
Validate targets in-silico
Examples :
• Interaction + expression
• Interaction + 3D structure
• Location + expression
• Phylogenetic profiles + domain fusion
Recent problem, drug discovery efforts bottleneck
Frontier for the bioinformatics community
Technology : normalization, formats, ontologies
Science : automate (some) biological reasoning ?
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Evaluating pathways representations
Vincent Schächter, Hybrigenics, Paris
Aviv Regev, Tel-Aviv University
BioPathways Formalisms Workgroup
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Evaluation scope : untangling the web...

Large body of literature, focusing on different
biological phenomena and different theoretical
issues

A typical article on pathways may include one or
more of the following :




A data-model, describing (a fraction of) the pathway “universe of
discourse”
A formalism, used to describe the data-model and to express
algorithms / functions
Description of algorithms based on characteristics of both the
formalism and the data model
Description of implementations of data-storage functionalities
and/or of some of the above algorithms
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Excerpt from target evaluation list :
“non DE” formalisms













Petri nets (basic, hybrid, self-modifying, time-dependent, hierarchical,
mobile)
Process algebra (basic and stochastic pi-calculus)
Markup languages (CellML and SBML)
Biocalculus
Regulatory grammars (Collado-Vides)
Semiotes (Kazic)
Statecharts (Kam, Holcombe)
Boolean networks (basic, multi-level)
Hierarchical networks (Bodnar)
Neural networks (Mjolsness)
Molecular graph reaction networks (McCaskill)
Molecular interaction maps (Kohn)
Electrical circuits (Keane)
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Quelques exemples de
représentations “discrètes”

Modèles orienté-objet :



Réseaux booléens :



Simulation qualitative, reconstruction à partir de données d ’expression
Appliqué aux « réseaux de régulation »
Réseaux de Petri :




Requêtes sur tous types de réseaux
Reconstruction, mais problème de l ’information incomplète
Simulation qualitative plus fine, analyse formelle du comportement
Appliqué aux réseaux de régulation
Application possible aux réseaux métaboliques et signalisation avec extensions (selfmodifying PN, Hybrid PN…)
Algèbres de processus :


Simulation, analyse formelle, reconstruction
Appliqué aux réseaux de signalisation et de régulation (métabolisme avec extension
stochastiques
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
The position of formalisms in the context of
pathways informatics
Pathway
construction
Dynamics
• Simulation
• Analysis
Data storage &
retrieval
• Pathway generation
• Pathway selection
Query language
Supports
Supports
Construction-oriented
formalism+ data-model
Expresses
Supports
Dynamics-oriented
formalism + data-model
Database-oriented
formalism+ data-model
Expresses
Expresses
Core Representation / Ontology
• Biological scope
• Formal expressiveness
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Evaluate and compare : a modular
approach



Evaluate expressiveness/ease of use of
representation relatively to specific
goals/functionalities
Compare representations in the categories for
which they were designed
Reduce each category to a set of evaluation items
that can be rated and compared as objectively as
possible
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Core representation / “Ontology”


Conceptual structure of the universe of discourse (abstract
and concrete entities, relations, hierarchies...)
Constrains scope of phenomena that can be described, and
thus queried, analyzed, reconstructed, and queried.
Often implicit in a given pathway representation : need to
extract...
Possible evaluation schemes :
1. Compare “features” of ontology
2. Expressiveness benchmark : set of biological “situations”
3. Translation of data models into common formalisms + comparison
How do you represent : “gene A inhibits gene B” in your
data model ?
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Conceptual Model : Biological Scope
Evaluation Items
Data
Criteria
Pathway type
Expressiveness
Biological objects
Expressiveness
Associated
secondary data
(Examples)


Biochemical
relations:
Conceptual
relations
Expressiveness,
efficiency
(combinatorial
explosion)

Expressiveness,
efficiency



Location, expression,
sequence, structure
Experimental
evidence
Location, rate,
delays, reaction
mechanism...
Experimental
evidence
Criteria



Existence (y/n)
Representation
mode
Context
dependency
(biological context
constructed from
secondary data)
Genetic data
Experimental
evidence
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Core Representation : Formal
Expressiveness
Evaluation Items
Issue
Criteria
Explicit representation of
incomplete information:





Hierarchy, modularity and multilevel representation




Constraints on attributes of
objects/relations
Explicit rep. of undetermined objects
Explicit rep. of undetermined relations
Global constraints
Query language expressiveness
Existence of entities/relations at
different scales
Existence and nature of encapsulation
mechanism
Multi-scale queries
Inter-scale Mapping
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Data Storage and Retrieval
Storage and retrieval of data : “database-related” functionalities

Extremes : relational or OO models vs, e.g., most simulationoriented formalisms...

A data-retrieval oriented formalism can be used « below » other
formalisms

Query language :


Retrieve information within a structured, homogeneous, compositional framework
Shifting boundary with analysis and reconstruction algorithms
Evaluation items / sub-categories
 Robust database : implementation issue
 Query language ease of use
 Query language expressiveness
 Limited by formalism and ontology expressiveness
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Pathway reconstruction
Construction/prediction of pathways in given biological environment
(organism, tissue, condition, location…) from a combination of :
 experimental data
 fully instantiated pathway information,
 partially instantiated (or “incomplete”) pathway data, such as
interaction data
Special cases : reverse engineering, pathway inference
Evaluation items / sub-categories :
• Input data types
• Pathway generation algorithm
• Pathway selection algorithm :
• Pathway “fitness” function
• Pathway similarity/homology measure
• Interactive validation ?
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Dynamics
Study of network dynamics (regulatory networks, ST, MP) :
 Simulation runs
 Analysis of dynamic behavior
Evaluation items / sub-categories :





States : nature, expressiveness, level of detail vs available data
Evolution rules / Reaction model : rule, implementation
Time : continuous/discrete, synchronous/asynchronous updates
Space : continuous/discrete, topology, resolution
Analysis :




Scope : state reachability, liveness of transitions, substance flow...
Formal methods available
Comparative power
Limited to steady state ?
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »
Methodology : what do we evaluate ?
Queries
Reconstruction
Simulation
Supports
Formalism
Describes
Translation
into common
ontology
description
language ?
Evaluation
targets
Data-model
Ontology
Février 2002 – Journée PréARC « Algèbres de processus et processus biologiques »