Transcript General

The EcoCyc and MetaCyc Pathway/Genome Databases

Peter D. Karp, Ph.D.

Bioinformatics Research Group SRI International [email protected]

http://www.ai.sri.com/pkarp/ http://EcoCyc.org/

SRI International Bioinformatics

Overview

Motivations and terminology

Pathway/genome databases

 BioCyc collection  EcoCyc, MetaCyc 

Pathway Tools software

Bioinformatics Database Warehouse project

A

SRI International Bioinformatics

E

SRI International

What to do When Theories Become Larger than Minds can Grasp?

Example: E. coli metabolic network

 160 pathways involving 744 reactions and 791 substrates 

Example: E. coli genetic network

 Control by 97 transcription factors of 1174 genes in 630 transcription units 

Past solutions:

 Partition theories across multiple minds  Encode theories in natural-language text 

We cannot compute with theories in those forms

 Evaluate theories for consistency with new data: microarrays   Refine theories with respect to new data Compare theories describing different organisms

Solution: Biological Knowledge Bases

SRI International Bioinformatics

Store biological knowledge and theories in computers in a

declarative form

 Amenable to computational analysis and generative user interfaces 

Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases

Accepted to store data in computers, but not knowledge

Such knowledge bases are an integral part of the scientific enterprise

SRI International Bioinformatics

Pathway Definition

Chemical reactions interconvert chemical compounds

A + B C + D 

An enzyme is a protein that accelerates chemical reactions

A pathway is a linked set of reactions

Often regulated as a unit

A C E 

A conceptual unit of cell’s biochemical machine

Terminology

Model Organism Database (MOD) – DB describing genome and other information about an organism

Pathway/Genome Database (PGDB) – MOD that combines information about

 Pathways, reactions, substrates    Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites 

BioCyc – Collection of 15 PGDBs at BioCyc.org

 EcoCyc, AgroCyc, YeastCyc

SRI International Bioinformatics

BioCyc Collection of Pathway/Genome DBs

SRI International Bioinformatics

Computationally Derived Datasets:

Literature-based Datasets:

MetaCyc

Escherichia coli (EcoCyc)

http://BioCyc.org/

Agrobacterium tumefaciens

Caulobacter crescentus

Chlamydia trachomatis

Bacillus subtilis

Helicobacter pylori

Haemophilus influenzae

Mycobacterium tuberculosis RvH37

Mycobacterium tuberculosis CDC1551

Mycoplasma pneumonia

Pseudomonas aeruginosa

Saccharomyces cerevisiae

Treponema pallidum

Vibrio cholerae

Yellow = Open Database

Terminology – Pathway Tools Software

PathoLogic

 Prediction of metabolic network from genome  Computational creation of new Pathway/Genome Databases

SRI International Bioinformatics

Pathway/Genome Editors

 Distributed curation of PGDBs  Distributed object database system, interactive editing tools 

Pathway/Genome Navigator

 WWW publishing of PGDBs  Querying, visualization of pathways, chromosomes, operons  Analysis operations   Pathway visualization of gene-expression data Global comparisons of metabolic networks 

Bioinformatics 18:S225 2002

Pathway Tools Algorithms

Query, visualization and editing tools for these datatypes:

Full Metabolic Map

 Paint gene expression data on metabolic network; compare metabolic networks 

Pathways

 Pathway prediction 

Reactions

 Balance checker 

Compounds

 Chemical substructure comparison 

Enzymes, Transporters, Transcription Factors

Genes:

Blast search 

Chromosomes

Operons

 Operon prediction

SRI International Bioinformatics

Model Organism Databases

SRI International Bioinformatics

DBs that describe the genome and other information about an organism

Every sequenced organism with an active experimental community requires a MOD

 Integrate genome data with information about the biochemical and genetic network of the organism 

MODs are platforms for global analyses of an organism

 Interpret gene expression data in a pathway context    Characterize systems properties of metabolic and genetic networks Determine consistency of metabolic and transport networks In silico prediction of essential genes

EcoCyc Project – EcoCyc.org

SRI International Bioinformatics

E. co

li En cyc lopedia

 Model-Organism Database for E. coli  Computational symbolic theory of E. coli   Electronic review article for E. coli – over 3500 literature citations Tracks the evolving annotation of the E. coli genome 

Collaborative development via Internet

 Karp (SRI) -- Bioinformatics architect   John Ingraham -- Advisor (SRI) Metabolic pathways   Saier (UCSD) and Paulsen (TIGR)-- Transport Collado (UNAM)-- Regulation of gene expression 

Database content: 18,000 objects

EcoCyc = E.coli Dataset + Pathway/Genome Navigator

SRI International Bioinformatics

Pathways: 165 Reactions: 2,760 Compounds: 774 Enzymes: 914 Transporters: 162 Promoters: 812 TransFac Sites: 956 Citations: 3,508 Proteins: 4,273 Genes: 4,393 Transcription Units: 724 Factors: 110

http://EcoCyc.org/

SRI International Bioinformatics

EcoCyc Procedures

All DB updates by 5 staff curators

  Information gathered from biomedical literature Corrections solicited from E. coli researchers 

Review-level database

Four releases per year

Available through WWW site, as data files, as downloadable application

Quality assurance of data and software:

    Evaluate database consistency constraints Perform element balancing of reactions Run other checking programs Display every DB object

SRI International Bioinformatics

MetaCyc:

Meta

bolic En

cyc

lopedia

Nonredundant metabolic pathway database

Describe a representative sample of every experimentally determined metabolic pathway

Literature-based DB with extensive references and commentary

Pathways, reactions, enzymes, substrates

460 pathways, 1267 enzymes, 4294 reactions

 172 E. coli pathways, 2735 citations 

Nucleic Acids Research 30:59-61 2002.

Jointly developed by SRI and Carnegie Institution

 New focus on plant pathways

Family of Pathway/Genome Databases

SRI International Bioinformatics

MetaCyc

SRI International Bioinformatics

Pathway Tools Implementation Details

Allegro Common Lisp

Sun and PC platforms

Ocelot object database

250,000 lines of code

Lisp-based WWW server at BioCyc.org

 Manages 15 PGDBs

SRI International Bioinformatics

Pathway Tools Architecture

WWW Server Pathway Genome Navigator X-Windows Graphics GFP API Object Editor Pathway Editor Reaction Editor Object DBMS Oracle

Ocelot Knowledge Server Architecture

SRI International Bioinformatics

Frame data model

     Classes, instances, inheritance Frames have slots that define their properties, attributes, relationships A slot has one or more values Each value can be any Lisp datatype Slotunits define metadata about slots:   Domain, range, inverse Collection type, number of values, value constraints 

Transaction logging facility

Schema evolution

SRI International Bioinformatics

Ocelot Storage System Architecture

Persistent storage via disk files, Oracle DBMS

 Concurrent development: Oracle   Single-user development: disk files Read-only delivery: bundle data into binary program 

Oracle storage

   DBMS is submerged within Ocelot, invisible to users Relational schema is domain independent, supports multiple KBs simultaneously Frames transferred from DBMS to Ocelot     On demand By background prefetcher Memory cache Persistent disk cache to speed performance via Internet

SRI International

The Common Lisp Programming

Bioinformatics

Environment

Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)

EcoCyc WWW Server

SRI International Bioinformatics

SRI International

Pathway/Genome DBs Created by External Users

Plasmodium falciparum, Stanford University

 plasmocyc.stanford.edu

Mycobacterium tuberculosis, Stanford University

 BioCyc.org

Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington

 Arabidopsis.org:1555 

Methanococcus janaschii, EBI

 Maine.ebi.ac.uk:1555 

Other PGDBs in progress by 24 other users

Software freely available

Each PGDB owned by its creator

Global Consistency

SRI International Bioinformatics

Checking of Biochemical Network

Given:

 A PGDB for an organism  A set of initial metabolites 

Infer:

 What set of products can be synthesized by the small molecule metabolism of the organism 

Can known growth medium yield known essential compounds?

Pacific Symposium on Biocomputing p471 2001

Algorithm: Forward Propagation

Nutrient set

SRI International Bioinformatics

Products Metabolite set

PGDB reaction pool

“Fire” reactions Reactants

SRI International Bioinformatics

Results

Phase I: Forward propagation

 21 initial compounds yielded only half of 38 essential compounds for E. coli

Phase II: Manually identify

 Bugs in EcoCyc (e.g., two objects for tryptophan)   Missing initial protein substrates (e.g., ACP) Missing pathways in EcoCyc 

Phase III: Forward propagation with 11 more initial metabolites

 Yielded all 38 essential compounds

SRI International Bioinformatics

Nutrient-Related Analysis: Validation of the EcoCyc Database

Results on EcoCyc:

Phase I:

• Essential compounds • produced • not produced • Total compounds • produced: • Reactions • Fired 19 19 (28%) (31%)

SRI International Bioinformatics

Missing Essential Compounds Due To

Bugs in EcoCyc

Narrow conceptualization of the problem

 Protein substrates 

Incomplete biochemical knowledge

Nutrient-Related Analysis:

SRI International Bioinformatics

Validation of the EcoCyc Database

Results on EcoCyc:

Phase II (After adding 11 extra metabolites):

• Essential compounds • produced • not produced • Total compounds • produced: • not produced: • Reactions • Fired • Not fired 38 0 (49%) (51%) (58%) (42%)

Pathway Tools Misconceptions

SRI International Bioinformatics

PathoLogic

 Does not re-annotate genomes 

Pathway Tools does not handle quantitative information

Pathway/Genome Editors do not work through the web

SRI International

HumanCyc: Human Metabolic Pathway Database Consortium

Construct DB of human metabolic pathways using PathoLogic

Link to human genome web sites

Hire one curator to refine and curate with respect to literature over a 2 year period

 Remove false-positive predictions  Insert known pathways missed by PathoLogic   Add comments and citations from pathways and enzymes to the literature Add enzyme activators, inhibitors, cofactors, tissue information 

Available as flatfiles and with Pathway/Genome Navigator

New versions to be released every 6 months

SRI International Bioinformatics

Summary

Pathway/Genome Databases

 MetaCyc non-redundant DB of literature-derived pathways  14 organism-specific PGDBs available through SRI at BioCyc.org

 Computational theories of biochemical machinery 

Pathway Tools software

 Extract pathways from genomes  Morph annotated genome into structured ontology  Distributed curation tools for MODs  Query, visualization, WWW publishing

BioCyc and Pathway Tools Availability

SRI International Bioinformatics

WWW BioCyc freely available to all

 BioCyc.org

 Six BioCyc DBs openly available to all 

BioCyc DBs freely available to non-profits

  

Flatfiles downloadable from BioCyc.org

Binary executable:

Sun UltraSparc-170 w/ 64MB memory

PC, 400MHz CPU, 64MB memory, Windows-98 or newer PerlCyc API

Pathway Tools freely available to non-profits

SRI International Bioinformatics

Acknowledgements

SRI

 Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud 

EcoCyc Project

 Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier 

MetaCyc Project

 Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville 

Stanford

 Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh 

Funding sources:

   NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute   Department of Energy Microbial Cell Project DARPA BioSpice, UPC

BioCyc.org