Transcript General
The EcoCyc and MetaCyc Pathway/Genome Databases
Peter D. Karp, Ph.D.
Bioinformatics Research Group SRI International [email protected]
http://www.ai.sri.com/pkarp/ http://EcoCyc.org/
SRI International Bioinformatics
Overview
Motivations and terminology
Pathway/genome databases
BioCyc collection EcoCyc, MetaCyc
Pathway Tools software
Bioinformatics Database Warehouse project
A
SRI International Bioinformatics
E
SRI International
What to do When Theories Become Larger than Minds can Grasp?
Example: E. coli metabolic network
160 pathways involving 744 reactions and 791 substrates
Example: E. coli genetic network
Control by 97 transcription factors of 1174 genes in 630 transcription units
Past solutions:
Partition theories across multiple minds Encode theories in natural-language text
We cannot compute with theories in those forms
Evaluate theories for consistency with new data: microarrays Refine theories with respect to new data Compare theories describing different organisms
Solution: Biological Knowledge Bases
SRI International Bioinformatics
Store biological knowledge and theories in computers in a
declarative form
Amenable to computational analysis and generative user interfaces
Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases
Accepted to store data in computers, but not knowledge
Such knowledge bases are an integral part of the scientific enterprise
SRI International Bioinformatics
Pathway Definition
Chemical reactions interconvert chemical compounds
A + B C + D
An enzyme is a protein that accelerates chemical reactions
A pathway is a linked set of reactions
Often regulated as a unit
A C E
A conceptual unit of cell’s biochemical machine
Terminology
Model Organism Database (MOD) – DB describing genome and other information about an organism
Pathway/Genome Database (PGDB) – MOD that combines information about
Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites
BioCyc – Collection of 15 PGDBs at BioCyc.org
EcoCyc, AgroCyc, YeastCyc
SRI International Bioinformatics
BioCyc Collection of Pathway/Genome DBs
SRI International Bioinformatics
Computationally Derived Datasets:
Literature-based Datasets:
MetaCyc
Escherichia coli (EcoCyc)
http://BioCyc.org/
Agrobacterium tumefaciens
Caulobacter crescentus
Chlamydia trachomatis
Bacillus subtilis
Helicobacter pylori
Haemophilus influenzae
Mycobacterium tuberculosis RvH37
Mycobacterium tuberculosis CDC1551
Mycoplasma pneumonia
Pseudomonas aeruginosa
Saccharomyces cerevisiae
Treponema pallidum
Vibrio cholerae
Yellow = Open Database
Terminology – Pathway Tools Software
PathoLogic
Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases
SRI International Bioinformatics
Pathway/Genome Editors
Distributed curation of PGDBs Distributed object database system, interactive editing tools
Pathway/Genome Navigator
WWW publishing of PGDBs Querying, visualization of pathways, chromosomes, operons Analysis operations Pathway visualization of gene-expression data Global comparisons of metabolic networks
Bioinformatics 18:S225 2002
Pathway Tools Algorithms
Query, visualization and editing tools for these datatypes:
Full Metabolic Map
Paint gene expression data on metabolic network; compare metabolic networks
Pathways
Pathway prediction
Reactions
Balance checker
Compounds
Chemical substructure comparison
Enzymes, Transporters, Transcription Factors
Genes:
Blast search
Chromosomes
Operons
Operon prediction
SRI International Bioinformatics
Model Organism Databases
SRI International Bioinformatics
DBs that describe the genome and other information about an organism
Every sequenced organism with an active experimental community requires a MOD
Integrate genome data with information about the biochemical and genetic network of the organism
MODs are platforms for global analyses of an organism
Interpret gene expression data in a pathway context Characterize systems properties of metabolic and genetic networks Determine consistency of metabolic and transport networks In silico prediction of essential genes
EcoCyc Project – EcoCyc.org
SRI International Bioinformatics
E. co
li En cyc lopedia
Model-Organism Database for E. coli Computational symbolic theory of E. coli Electronic review article for E. coli – over 3500 literature citations Tracks the evolving annotation of the E. coli genome
Collaborative development via Internet
Karp (SRI) -- Bioinformatics architect John Ingraham -- Advisor (SRI) Metabolic pathways Saier (UCSD) and Paulsen (TIGR)-- Transport Collado (UNAM)-- Regulation of gene expression
Database content: 18,000 objects
EcoCyc = E.coli Dataset + Pathway/Genome Navigator
SRI International Bioinformatics
Pathways: 165 Reactions: 2,760 Compounds: 774 Enzymes: 914 Transporters: 162 Promoters: 812 TransFac Sites: 956 Citations: 3,508 Proteins: 4,273 Genes: 4,393 Transcription Units: 724 Factors: 110
http://EcoCyc.org/
SRI International Bioinformatics
EcoCyc Procedures
All DB updates by 5 staff curators
Information gathered from biomedical literature Corrections solicited from E. coli researchers
Review-level database
Four releases per year
Available through WWW site, as data files, as downloadable application
Quality assurance of data and software:
Evaluate database consistency constraints Perform element balancing of reactions Run other checking programs Display every DB object
SRI International Bioinformatics
MetaCyc:
Meta
bolic En
cyc
lopedia
Nonredundant metabolic pathway database
Describe a representative sample of every experimentally determined metabolic pathway
Literature-based DB with extensive references and commentary
Pathways, reactions, enzymes, substrates
460 pathways, 1267 enzymes, 4294 reactions
172 E. coli pathways, 2735 citations
Nucleic Acids Research 30:59-61 2002.
Jointly developed by SRI and Carnegie Institution
New focus on plant pathways
Family of Pathway/Genome Databases
SRI International Bioinformatics
MetaCyc
SRI International Bioinformatics
Pathway Tools Implementation Details
Allegro Common Lisp
Sun and PC platforms
Ocelot object database
250,000 lines of code
Lisp-based WWW server at BioCyc.org
Manages 15 PGDBs
SRI International Bioinformatics
Pathway Tools Architecture
WWW Server Pathway Genome Navigator X-Windows Graphics GFP API Object Editor Pathway Editor Reaction Editor Object DBMS Oracle
Ocelot Knowledge Server Architecture
SRI International Bioinformatics
Frame data model
Classes, instances, inheritance Frames have slots that define their properties, attributes, relationships A slot has one or more values Each value can be any Lisp datatype Slotunits define metadata about slots: Domain, range, inverse Collection type, number of values, value constraints
Transaction logging facility
Schema evolution
SRI International Bioinformatics
Ocelot Storage System Architecture
Persistent storage via disk files, Oracle DBMS
Concurrent development: Oracle Single-user development: disk files Read-only delivery: bundle data into binary program
Oracle storage
DBMS is submerged within Ocelot, invisible to users Relational schema is domain independent, supports multiple KBs simultaneously Frames transferred from DBMS to Ocelot On demand By background prefetcher Memory cache Persistent disk cache to speed performance via Internet
SRI International
The Common Lisp Programming
Bioinformatics
Environment
Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)
EcoCyc WWW Server
SRI International Bioinformatics
SRI International
Pathway/Genome DBs Created by External Users
Plasmodium falciparum, Stanford University
plasmocyc.stanford.edu
Mycobacterium tuberculosis, Stanford University
BioCyc.org
Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington
Arabidopsis.org:1555
Methanococcus janaschii, EBI
Maine.ebi.ac.uk:1555
Other PGDBs in progress by 24 other users
Software freely available
Each PGDB owned by its creator
Global Consistency
SRI International Bioinformatics
Checking of Biochemical Network
Given:
A PGDB for an organism A set of initial metabolites
Infer:
What set of products can be synthesized by the small molecule metabolism of the organism
Can known growth medium yield known essential compounds?
Pacific Symposium on Biocomputing p471 2001
Algorithm: Forward Propagation
Nutrient set
SRI International Bioinformatics
Products Metabolite set
PGDB reaction pool
“Fire” reactions Reactants
SRI International Bioinformatics
Results
Phase I: Forward propagation
21 initial compounds yielded only half of 38 essential compounds for E. coli
Phase II: Manually identify
Bugs in EcoCyc (e.g., two objects for tryptophan) Missing initial protein substrates (e.g., ACP) Missing pathways in EcoCyc
Phase III: Forward propagation with 11 more initial metabolites
Yielded all 38 essential compounds
SRI International Bioinformatics
Nutrient-Related Analysis: Validation of the EcoCyc Database
Results on EcoCyc:
Phase I:
• Essential compounds • produced • not produced • Total compounds • produced: • Reactions • Fired 19 19 (28%) (31%)
SRI International Bioinformatics
Missing Essential Compounds Due To
Bugs in EcoCyc
Narrow conceptualization of the problem
Protein substrates
Incomplete biochemical knowledge
Nutrient-Related Analysis:
SRI International Bioinformatics
Validation of the EcoCyc Database
Results on EcoCyc:
Phase II (After adding 11 extra metabolites):
• Essential compounds • produced • not produced • Total compounds • produced: • not produced: • Reactions • Fired • Not fired 38 0 (49%) (51%) (58%) (42%)
Pathway Tools Misconceptions
SRI International Bioinformatics
PathoLogic
Does not re-annotate genomes
Pathway Tools does not handle quantitative information
Pathway/Genome Editors do not work through the web
SRI International
HumanCyc: Human Metabolic Pathway Database Consortium
Construct DB of human metabolic pathways using PathoLogic
Link to human genome web sites
Hire one curator to refine and curate with respect to literature over a 2 year period
Remove false-positive predictions Insert known pathways missed by PathoLogic Add comments and citations from pathways and enzymes to the literature Add enzyme activators, inhibitors, cofactors, tissue information
Available as flatfiles and with Pathway/Genome Navigator
New versions to be released every 6 months
SRI International Bioinformatics
Summary
Pathway/Genome Databases
MetaCyc non-redundant DB of literature-derived pathways 14 organism-specific PGDBs available through SRI at BioCyc.org
Computational theories of biochemical machinery
Pathway Tools software
Extract pathways from genomes Morph annotated genome into structured ontology Distributed curation tools for MODs Query, visualization, WWW publishing
BioCyc and Pathway Tools Availability
SRI International Bioinformatics
WWW BioCyc freely available to all
BioCyc.org
Six BioCyc DBs openly available to all
BioCyc DBs freely available to non-profits
Flatfiles downloadable from BioCyc.org
Binary executable:
Sun UltraSparc-170 w/ 64MB memory
PC, 400MHz CPU, 64MB memory, Windows-98 or newer PerlCyc API
Pathway Tools freely available to non-profits
SRI International Bioinformatics
Acknowledgements
SRI
Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud
EcoCyc Project
Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier
MetaCyc Project
Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville
Stanford
Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh
Funding sources:
NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute Department of Energy Microbial Cell Project DARPA BioSpice, UPC