Transcript General
Computational Exploration of Metabolic Networks with Pathway Tools Part 1: Overview & Representations
Suzanne Paley
Bioinformatics Research Group SRI International [email protected]
http://BioCyc.org/
Bioinformatics
Function Too Large for One Mind to Grasp
Example: E. coli metabolic network
160 pathways involving 744 reactions and 791 substrates
Example: E. coli genetic network
Control by 97 transcription factors of 1174 genes in 630 transcription units
Past solutions:
Partition theories across multiple minds Encode theories in natural-language text
We cannot compute with theories in those forms
Evaluate theories for consistency with new data: microarrays Refine theories with respect to new data Compare theories describing different organisms
Solution: Biological Knowledge Bases
SRI International Bioinformatics
Store biological knowledge and theories in computers in a declarative form
Amenable to computational analysis and generative user interfaces
Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases
A high quality comprehensive knowledge base enables us to ask and answer important new questions
Terminology
Model Organism Database (MOD) – DB describing genome and other information about an organism
Pathway/Genome Database (PGDB) – MOD that combines information about
Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites
BioCyc – Collection of 15 PGDBs at BioCyc.org
EcoCyc, AgroCyc, HumanCyc
SRI International Bioinformatics
SRI International Bioinformatics
Pathway Tools Software
PathoLogic
Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases
Pathway/Genome Editors
Distributed curation of genome annotations Distributed object database system Interactive editing tools
Pathway/Genome Navigator
WWW publishing of PGDBs Graphic depictions of pathways, chromosomes, operons Analysis operations Pathway visualization of gene-expression data Global comparisons of metabolic networks
Pathway Tools Software
SRI International Bioinformatics
Pathway/Genome Navigator PathoLogic Pathway Predictor Pathway/ Genome Databases Pathway/ Genome Editors
Pathway/Genome Database
SRI International Bioinformatics
Pathways Reactions Compounds Proteins Genes Operons, Promoters, DNA Binding Sites Chromosomes, Plasmids
CELL
Pathway Tools Algorithms
Visualization and editing tools for following datatypes
Full Metabolic Map
Paint gene expression data on metabolic network; compare metabolic networks
Pathways
Pathway prediction
Reactions
Balance checker
Compounds
Chemical substructure comparison
Enzymes, Transporters, Transcription Factors
Genes
Chromosomes
Operons
Operon prediction; visualize genetic network
SRI International Bioinformatics
SRI International Bioinformatics
Definitions
Chemical reactions interconvert chemical compounds
A + B C + D
An enzyme is a protein that accelerates chemical reactions
A pathway is a linked set of reactions
Often regulated as a unit A C E A conceptual unit of cell’s biochemical machine
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
SRI International Bioinformatics
Operations of the Metabolic Overview
SRI International Bioinformatics
Find pathways, compounds
Find reactions
By enzyme name, EC number, substrates, modulation All with isozymes All occurring in multiple pathways By EC class, pathway class
Find genes
By name, gene class All regulated by transcriptional regulator protein
Metabolic Overview Queries
SRI International Bioinformatics
Species comparison
Highlight reactions that are Shared/not-shared with Any-one/All-of A specified set of species
Overlay expression data
Colors reflects expression level and are user-configurable Can show single experiment or animated time series
EcoCyc Project
E. co
li En cyc lopedia
Model-Organism Database for E. coli Began in 1992 as collaboration between Karp and Riley Over 3500 literature citations
Collaborative development via Internet
Karp (SRI) -- Bioinformatics architect John Ingraham -- Advisor (SRI) Metabolic pathways Saier (UCSD) and Paulsen (TIGR)-- Transport Collado (UNAM)-- Regulation of gene expression
Ontology: 1000 biological classes Database content: 17,700 instances
SRI International Bioinformatics
EcoCyc = E.coli Dataset + Pathway/Genome Navigator
SRI International Bioinformatics
Pathways: 165 Reactions: 2,760 Compounds: 774 Enzymes: 914 Transporters: 162 Promoters: 812 TransFac Sites: 956 Citations: 3,508 Proteins: 4,273 Genes: 4,393 Transcription Units: 724 Factors: 110
http://BioCyc.org/
SRI International Bioinformatics
MetaCyc:
Meta
bolic En
cyc
lopedia
Nonredundant metabolic pathway database
Describe a representative sample of every experimentally determined metabolic pathway
Literature-based DB with extensive references and commentary
Pathways, reactions, enzymes, substrates
460 pathways, 1267 enzymes, 4294 reactions
172 E. coli pathways, 2735 citations
Nucleic Acids Research 30:59-61 2002.
Jointly developed by SRI and Carnegie Institution
New focus on plant pathways
SRI International Bioinformatics
MetaCyc Data
MetaCyc contains one DB object for each distinct pathway
Distinct in terms of reaction steps Each pathway labeled with species it occurs in
MetaCyc pathways are experimentally determined
4218 reactions in MetaCyc
401 lack EC numbers
MetaCyc Enzyme Data
Reaction(s) catalyzed
Alternative substrates
Cofactors / prosthetic groups
Activators and inhibitors
Subunit structure
Molecular weight, pI
Comment, literature citations
Species
SRI International Bioinformatics
MetaCyc Frequent Organisms
SRI International Bioinformatics
Escherichia coli Arabidopsis thaliana Homo sapiens Pseudomonas Bacillus subtilis Salmonella typhimurium Sulfolobus solfataricus Pseudomonas putida Saccharomyces cerevisiae Haemophilus influenzae Glycine max Deinococcus radiourans
20 20 18 14 14 13 156 47 30 21 11 10
SRI International Bioinformatics
EcoCyc and MetaCyc
Review level databases
Data derived primarily from biomedical literature
Manual entry by staff curators Updates by staff curators only
Data validation
Consistency constraints Lisp programs that verify other semantic relationships Unbalanced chemical reactions
Computationally-Derived PGDBs
SRI International Bioinformatics
Annotated Genomic Sequence Gene Products Genes/ORFs DNA Sequences Multi-organism Pathway Database (MetaCyc) Pathways PathoLogic Software
Integrates genome and pathway data to identify putative metabolic networks
Reactions Compounds Pathway/Genome Database Pathways Reactions Compounds Gene Products Genes Genomic Map
SRI International
PathoLogic Input/Output
Bioinformatics
Inputs:
File listing genetic elements http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat
Files containing DNA sequence for each genetic element Files containing annotation for each genetic element MetaCyc database
Output:
Pathway/genome database for the subject organism Directory tree for the subject organism Reports that summarize: Evidence contained in the input genome for the presence of reference pathways Reactions missing from inferred pathways
SRI International Bioinformatics
PathoLogic Functionality
Initialize schema for new PGDB
Transform existing genome to PGDB form
Infer metabolic pathways and store in PGDB
Infer operons and store in PGDB
Assist user with manual tasks
Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Assemble Overview diagram
BioCyc Collection of Pathway/Genome DBs
SRI International Bioinformatics
Literature-based Datasets:
Escherichia coli (EcoCyc)
MetaCyc
PGDBs at other sites:
Arabidopsis thaliana (TAIR)
Methanococcus jannaschii (EBI)
Saccharomyces cerevisiae (SGD)
Synechocystis PCC6803
Computationally-derived datasets:
Agrobacterium tumefaciens
Caulobacter crescentus
Chlamydia trachomatis
Bacillus subtilis
Helicobacter pylori
Haemophilus influenzae
Homo sapiens
Mycobacterium tuberculosis RvH37
Mycobacterium tuberculosis CDC1551
Mycoplasma pneumonia
Pseudomonas aeruginosa
Treponema pallidum
Vibrio cholerae
http://BioCyc.org/
Yellow = Open Database
SRI International
HumanCyc: Human Metabolic Pathway Database
PGDB of human metabolic pathways built using PathoLogic
Contains information on 28,700 genes, their products, and the metabolic reactions and pathways they catalyze (no signalling pathways)
Chromosome and contigs from Ensembl
Human genetic loci from LocusLink
Mitochondrion data from GenBank
Ensembl and LocusLink gene entries were merged to eliminate redundancies where possible.
Contains links to human genome web sites
Plan to hire one curator to refine and curate with respect to literature over a 2 year period
Remove false-positive predictions Insert known pathways missed by PathoLogic Add comments and citations from pathways and enzymes to the literature Add enzyme activators, inhibitors, cofactors, tissue information
Funded by commercial consortium
BioCyc and Pathway Tools Availability
SRI International Bioinformatics
WWW BioCyc freely available to all
BioCyc.org
Six BioCyc DBs openly available to all
BioCyc DBs freely available to non-profits
Flatfiles downloadable from BioCyc.org
Binary executable:
Sun UltraSparc-170 w/ 64MB memory
PC, 400MHz CPU, 64MB memory, Windows-98 or newer PerlCyc API
Pathway Tools freely available to non-profits
Information Sources
Pathway Tools User’s Guide
aic-export/ecocyc/genopath/released/doc/userguide1.pdf
Pathway/Genome Navigator Appendix A: Guide to the Pathway Tools Schema aic-export/ecocyc/genopath/released/doc/userguide2.pdf
PathoLogic, Editing Tools
Pathway Tools Web Site
http://bioinformatics.ai.sri.com/ptools/ Publications, programming examples, etc.
Pathway Tools Tutorial
http://bioinformatics.ai.sri.com/ptools/tutorial/
SRI International Bioinformatics
SRI International Bioinformatics
Pathway Tools Implementation Details
Allegro Common Lisp
Sun and PC platforms
Ocelot object database
250,000 lines of code
Lisp-based WWW server at BioCyc.org
Manages 15 PGDBs
SRI International Bioinformatics
Frame Data Model
Frame Data Model -- organizational structure for a PGDB
Knowledge base (KB, Database, DB)
Frames
Slots
SRI International Bioinformatics
Knowledge Base
Collection of frames and their associated slots, values, facets, and annotations
AKA: Database, PGDB
Can be stored within
An Oracle DB A disk file A Pathway Tools binary program
SRI International Bioinformatics
Frames
Entities with which facts are associated
Kinds of frames:
Classes: Genes, Pathways, Biosynthetic Pathways Instances (objects): trpA, TCA cycle
Classes:
Superclass(es) Subclass(es) Instance(s)
A symbolic frame name (id, key) uniquely identifies each frame
SRI International Bioinformatics
Slots
Encode attributes/properties of a frame
Integer, real number, string
Represent relationships between frames
The value of a slot is the identifier of another frame
Every slot is described by a “slot frame” in a KB that defines meta information about that slot
SRI International Bioinformatics
Properties of Slots
Number of values
Single valued Multivalued: sets, bags
Slot values
Any LISP object: Integer, real, string, symbol (frame name)
Slotunits define properties of slots: datatypes, classes, constraints
Two slots are inverses if they encode opposite relationships
Slot Product in class Genes Slot Gene in class Polypeptides
SRI International Bioinformatics
Pathway Tools Ontology
1064 classes
Main classes such as: Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons, DNA-Segments (Genes, Operons, Promoters) Taxonomies for Pathways, Reactions, Compounds
205 slots
Meta-data: Creator, Creation-Date Comment, Citations, Common-Name, Synonyms Attributes: Molecular-Weight, DNA-Footprint-Size Relationships: Catalyzes, Component-Of, Product
Classes, instances, slots all stored side by side in DBMS, share a single namespace
Chrom
SRI International
Slot Links from Gene to Pathway Frame
TCA Cycle in-pathway left succinate FAD succinate + FAD = fumarate + FADH 2 fumarate reaction right FADH 2 Enzymatic-reaction catalyzes Succinate dehydrogenase component-of Sdh-flavo Sdh-Fe-S sdhA sdhB Sdh-membrane-1 sdhC product Sdh-membrane-2 sdhD
Bioinformatics
properties of pairing between enzyme and reaction
TCA Cycle Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase EC# K eq Cofactors Inhibitors Molecular wt pI Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhA sdhB sdhC sdhD Left-end-position
Monofunctional Monomer
Pathway Reaction Enzymatic-reaction Monomer Gene
SRI International Bioinformatics
Bifunctional Monomer
Pathway Reaction Reaction
SRI International Bioinformatics
Enzymatic-reaction Enzymatic-reaction Monomer Gene
SRI International Bioinformatics
Monofunctional Multimer
Pathway Reaction Enzymatic-reaction Monomer Gene Monomer Gene Multimer Monomer Gene Monomer Gene
SRI International Bioinformatics
Pathway and Substrates
Reactant-1 Reactant-2 left Reaction Product-1 Product-2 right Pathway Reaction Reaction in-pathway Reaction
SRI International
Genetic Network Representation
Bioinformatics
Describe biological entities involved in control of transcription initiation
Promoters, operators, transcription factors, operons, terminators
Describe molecular interactions among these entities
Modulation of transcription factor activity Binding of transcription factors to DNA binding sites Effects on transcription initiation
Ontology for Transcriptional Regulation
SRI International Bioinformatics
One DB object defined for each biological entity and for each molecular interaction
trp Complexation reaction apoTrpR site001 Int001 TrpR*trp trpLEDCBA pro001 trpL trpE trpD trpC trpB trpA Int002 RpoSig70 Int001 (binding of TrpR*trp to site001) inhibits Int002 (binding of RNA Polymerase to promoter) and consequently prevents transcription of genes in transcription unit.
Principle Classes
Class names are capitalized, plural
Genetic-Elements, with subclasses:
Chromosomes Plasmids
Genes
Transcription-Units
RNAs
Proteins, with subclasses:
Polypeptides Protein-Complexes
SRI International Bioinformatics
Principle Classes
Reactions, with subclasses:
Transport-Reactions
Enzymatic-Reactions
Pathways
Compounds-And-Elements
SRI International Bioinformatics
Slots in Multiple Classes
SRI International Bioinformatics
Common-Name
Synonyms
Names (computed as union of Common-Name, Synonyms)
Comment
Citations
DB-Links
Genes Slots
Chromosome
Left-End-Position
Right-End-Position
Centisome-Position
Transcription-Direction
Product
SRI International Bioinformatics
Proteins Slots
Molecular-Weight-Seq
Molecular-Weight-Exp
pI
Locations
Modified-Form
Unmodified-Form
Component-Of
SRI International Bioinformatics
Polypeptides Slots
Gene
SRI International Bioinformatics
Protein-Complexes Slots
SRI International Bioinformatics
Components
SRI International Bioinformatics
Reactions Slots
EC-Number
Left, Right
Substrates (computed as union of Left, Right)
Enzymatic-Reaction
DeltaG0
Spontaneous?
Enzymatic-Reactions Slots
SRI International Bioinformatics
Enzyme
Reaction
Activators
Inhibitors
Physiologically-Relevant
Cofactors
Prosthetic-Groups
Alternative-Substrates
Alternative-Cofactors
Reaction-direction
Pathways Slots
Reaction-List
Predecessors
Primaries
SRI International Bioinformatics