Transcript General

Computational Exploration of Metabolic Networks with Pathway Tools Part 1: Overview & Representations

Suzanne Paley

Bioinformatics Research Group SRI International [email protected]

http://BioCyc.org/

Bioinformatics

Function Too Large for One Mind to Grasp

Example: E. coli metabolic network

 160 pathways involving 744 reactions and 791 substrates 

Example: E. coli genetic network

 Control by 97 transcription factors of 1174 genes in 630 transcription units 

Past solutions:

 Partition theories across multiple minds  Encode theories in natural-language text 

We cannot compute with theories in those forms

 Evaluate theories for consistency with new data: microarrays   Refine theories with respect to new data Compare theories describing different organisms

Solution: Biological Knowledge Bases

SRI International Bioinformatics

Store biological knowledge and theories in computers in a declarative form

 Amenable to computational analysis and generative user interfaces 

Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases

A high quality comprehensive knowledge base enables us to ask and answer important new questions

Terminology

Model Organism Database (MOD) – DB describing genome and other information about an organism

Pathway/Genome Database (PGDB) – MOD that combines information about

 Pathways, reactions, substrates    Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites 

BioCyc – Collection of 15 PGDBs at BioCyc.org

 EcoCyc, AgroCyc, HumanCyc

SRI International Bioinformatics

SRI International Bioinformatics

Pathway Tools Software

PathoLogic

 Prediction of metabolic network from genome  Computational creation of new Pathway/Genome Databases 

Pathway/Genome Editors

 Distributed curation of genome annotations   Distributed object database system Interactive editing tools 

Pathway/Genome Navigator

 WWW publishing of PGDBs   Graphic depictions of pathways, chromosomes, operons Analysis operations   Pathway visualization of gene-expression data Global comparisons of metabolic networks

Pathway Tools Software

SRI International Bioinformatics

Pathway/Genome Navigator PathoLogic Pathway Predictor Pathway/ Genome Databases Pathway/ Genome Editors

Pathway/Genome Database

SRI International Bioinformatics

Pathways Reactions Compounds Proteins Genes Operons, Promoters, DNA Binding Sites Chromosomes, Plasmids

CELL

Pathway Tools Algorithms

Visualization and editing tools for following datatypes

Full Metabolic Map

 Paint gene expression data on metabolic network; compare metabolic networks 

Pathways

 Pathway prediction 

Reactions

 Balance checker 

Compounds

 Chemical substructure comparison 

Enzymes, Transporters, Transcription Factors

Genes

Chromosomes

Operons

 Operon prediction; visualize genetic network

SRI International Bioinformatics

SRI International Bioinformatics

Definitions

Chemical reactions interconvert chemical compounds

A + B C + D 

An enzyme is a protein that accelerates chemical reactions

A pathway is a linked set of reactions

 Often regulated as a unit A C E  A conceptual unit of cell’s biochemical machine

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

SRI International Bioinformatics

Operations of the Metabolic Overview

SRI International Bioinformatics

Find pathways, compounds

Find reactions

  By enzyme name, EC number, substrates, modulation All with isozymes   All occurring in multiple pathways By EC class, pathway class 

Find genes

  By name, gene class All regulated by transcriptional regulator protein

Metabolic Overview Queries

SRI International Bioinformatics

Species comparison

 Highlight reactions that are    Shared/not-shared with Any-one/All-of A specified set of species 

Overlay expression data

 Colors reflects expression level and are user-configurable  Can show single experiment or animated time series

EcoCyc Project

E. co

li En cyc lopedia

 Model-Organism Database for E. coli   Began in 1992 as collaboration between Karp and Riley Over 3500 literature citations 

Collaborative development via Internet

 Karp (SRI) -- Bioinformatics architect   John Ingraham -- Advisor (SRI) Metabolic pathways   Saier (UCSD) and Paulsen (TIGR)-- Transport Collado (UNAM)-- Regulation of gene expression  

Ontology: 1000 biological classes Database content: 17,700 instances

SRI International Bioinformatics

EcoCyc = E.coli Dataset + Pathway/Genome Navigator

SRI International Bioinformatics

Pathways: 165 Reactions: 2,760 Compounds: 774 Enzymes: 914 Transporters: 162 Promoters: 812 TransFac Sites: 956 Citations: 3,508 Proteins: 4,273 Genes: 4,393 Transcription Units: 724 Factors: 110

http://BioCyc.org/

SRI International Bioinformatics

MetaCyc:

Meta

bolic En

cyc

lopedia

Nonredundant metabolic pathway database

Describe a representative sample of every experimentally determined metabolic pathway

Literature-based DB with extensive references and commentary

Pathways, reactions, enzymes, substrates

460 pathways, 1267 enzymes, 4294 reactions

 172 E. coli pathways, 2735 citations 

Nucleic Acids Research 30:59-61 2002.

Jointly developed by SRI and Carnegie Institution

 New focus on plant pathways

SRI International Bioinformatics

MetaCyc Data

MetaCyc contains one DB object for each distinct pathway

 Distinct in terms of reaction steps  Each pathway labeled with species it occurs in 

MetaCyc pathways are experimentally determined

4218 reactions in MetaCyc

 401 lack EC numbers

MetaCyc Enzyme Data

Reaction(s) catalyzed

Alternative substrates

Cofactors / prosthetic groups

Activators and inhibitors

Subunit structure

Molecular weight, pI

Comment, literature citations

Species

SRI International Bioinformatics

MetaCyc Frequent Organisms

SRI International Bioinformatics

Escherichia coli Arabidopsis thaliana Homo sapiens Pseudomonas Bacillus subtilis Salmonella typhimurium Sulfolobus solfataricus Pseudomonas putida Saccharomyces cerevisiae Haemophilus influenzae Glycine max Deinococcus radiourans

20 20 18 14 14 13 156 47 30 21 11 10

SRI International Bioinformatics

EcoCyc and MetaCyc

Review level databases

Data derived primarily from biomedical literature

 Manual entry by staff curators  Updates by staff curators only 

Data validation

 Consistency constraints  Lisp programs that verify other semantic relationships  Unbalanced chemical reactions

Computationally-Derived PGDBs

SRI International Bioinformatics

Annotated Genomic Sequence Gene Products Genes/ORFs DNA Sequences Multi-organism Pathway Database (MetaCyc) Pathways PathoLogic Software

Integrates genome and pathway data to identify putative metabolic networks

Reactions Compounds Pathway/Genome Database Pathways Reactions Compounds Gene Products Genes Genomic Map

SRI International

PathoLogic Input/Output

Bioinformatics

Inputs:

 File listing genetic elements     http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat

Files containing DNA sequence for each genetic element Files containing annotation for each genetic element MetaCyc database 

Output:

 Pathway/genome database for the subject organism  Directory tree for the subject organism  Reports that summarize:   Evidence contained in the input genome for the presence of reference pathways Reactions missing from inferred pathways

SRI International Bioinformatics

PathoLogic Functionality

Initialize schema for new PGDB

Transform existing genome to PGDB form

Infer metabolic pathways and store in PGDB

Infer operons and store in PGDB

Assist user with manual tasks

 Assign enzymes to reactions they catalyze  Identify false-positive pathway predictions  Build protein complexes from monomers  Assemble Overview diagram

BioCyc Collection of Pathway/Genome DBs

SRI International Bioinformatics

Literature-based Datasets:

Escherichia coli (EcoCyc)

MetaCyc

PGDBs at other sites:

Arabidopsis thaliana (TAIR)

Methanococcus jannaschii (EBI)

Saccharomyces cerevisiae (SGD)

Synechocystis PCC6803

Computationally-derived datasets:

Agrobacterium tumefaciens

Caulobacter crescentus

Chlamydia trachomatis

Bacillus subtilis

Helicobacter pylori

Haemophilus influenzae

Homo sapiens

Mycobacterium tuberculosis RvH37

Mycobacterium tuberculosis CDC1551

Mycoplasma pneumonia

Pseudomonas aeruginosa

Treponema pallidum

Vibrio cholerae

http://BioCyc.org/

Yellow = Open Database

SRI International

HumanCyc: Human Metabolic Pathway Database

PGDB of human metabolic pathways built using PathoLogic

Contains information on 28,700 genes, their products, and the metabolic reactions and pathways they catalyze (no signalling pathways)

Chromosome and contigs from Ensembl

Human genetic loci from LocusLink

Mitochondrion data from GenBank

Ensembl and LocusLink gene entries were merged to eliminate redundancies where possible.

Contains links to human genome web sites

Plan to hire one curator to refine and curate with respect to literature over a 2 year period

 Remove false-positive predictions    Insert known pathways missed by PathoLogic Add comments and citations from pathways and enzymes to the literature Add enzyme activators, inhibitors, cofactors, tissue information 

Funded by commercial consortium

BioCyc and Pathway Tools Availability

SRI International Bioinformatics

WWW BioCyc freely available to all

BioCyc.org

Six BioCyc DBs openly available to all

BioCyc DBs freely available to non-profits

  

Flatfiles downloadable from BioCyc.org

Binary executable:

Sun UltraSparc-170 w/ 64MB memory

PC, 400MHz CPU, 64MB memory, Windows-98 or newer PerlCyc API

Pathway Tools freely available to non-profits

Information Sources

Pathway Tools User’s Guide

 aic-export/ecocyc/genopath/released/doc/userguide1.pdf

  Pathway/Genome Navigator Appendix A: Guide to the Pathway Tools Schema  aic-export/ecocyc/genopath/released/doc/userguide2.pdf

 PathoLogic, Editing Tools 

Pathway Tools Web Site

 http://bioinformatics.ai.sri.com/ptools/  Publications, programming examples, etc.

Pathway Tools Tutorial

 http://bioinformatics.ai.sri.com/ptools/tutorial/

SRI International Bioinformatics

SRI International Bioinformatics

Pathway Tools Implementation Details

Allegro Common Lisp

Sun and PC platforms

Ocelot object database

250,000 lines of code

Lisp-based WWW server at BioCyc.org

 Manages 15 PGDBs

SRI International Bioinformatics

Frame Data Model

Frame Data Model -- organizational structure for a PGDB

Knowledge base (KB, Database, DB)

Frames

Slots

SRI International Bioinformatics

Knowledge Base

Collection of frames and their associated slots, values, facets, and annotations

AKA: Database, PGDB

Can be stored within

 An Oracle DB  A disk file  A Pathway Tools binary program

SRI International Bioinformatics

Frames

Entities with which facts are associated

Kinds of frames:

 Classes: Genes, Pathways, Biosynthetic Pathways  Instances (objects): trpA, TCA cycle 

Classes:

 Superclass(es)   Subclass(es) Instance(s) 

A symbolic frame name (id, key) uniquely identifies each frame

SRI International Bioinformatics

Slots

Encode attributes/properties of a frame

 Integer, real number, string 

Represent relationships between frames

 The value of a slot is the identifier of another frame 

Every slot is described by a “slot frame” in a KB that defines meta information about that slot

SRI International Bioinformatics

Properties of Slots

Number of values

  Single valued Multivalued: sets, bags 

Slot values

 Any LISP object: Integer, real, string, symbol (frame name) 

Slotunits define properties of slots: datatypes, classes, constraints

Two slots are inverses if they encode opposite relationships

 Slot Product in class Genes  Slot Gene in class Polypeptides

SRI International Bioinformatics

Pathway Tools Ontology

1064 classes

 Main classes such as:   Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons, DNA-Segments (Genes, Operons, Promoters) Taxonomies for Pathways, Reactions, Compounds 

205 slots

 Meta-data: Creator, Creation-Date  Comment, Citations, Common-Name, Synonyms   Attributes: Molecular-Weight, DNA-Footprint-Size Relationships: Catalyzes, Component-Of, Product 

Classes, instances, slots all stored side by side in DBMS, share a single namespace

Chrom

SRI International

Slot Links from Gene to Pathway Frame

TCA Cycle in-pathway left succinate FAD succinate + FAD = fumarate + FADH 2 fumarate reaction right FADH 2 Enzymatic-reaction catalyzes Succinate dehydrogenase component-of Sdh-flavo Sdh-Fe-S sdhA sdhB Sdh-membrane-1 sdhC product Sdh-membrane-2 sdhD

Bioinformatics

properties of pairing between enzyme and reaction

TCA Cycle Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase EC# K eq Cofactors Inhibitors Molecular wt pI Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhA sdhB sdhC sdhD Left-end-position

Monofunctional Monomer

Pathway Reaction Enzymatic-reaction Monomer Gene

SRI International Bioinformatics

Bifunctional Monomer

Pathway Reaction Reaction

SRI International Bioinformatics

Enzymatic-reaction Enzymatic-reaction Monomer Gene

SRI International Bioinformatics

Monofunctional Multimer

Pathway Reaction Enzymatic-reaction Monomer Gene Monomer Gene Multimer Monomer Gene Monomer Gene

SRI International Bioinformatics

Pathway and Substrates

Reactant-1 Reactant-2 left Reaction Product-1 Product-2 right Pathway Reaction Reaction in-pathway Reaction

SRI International

Genetic Network Representation

Bioinformatics

Describe biological entities involved in control of transcription initiation

 Promoters, operators, transcription factors, operons, terminators 

Describe molecular interactions among these entities

 Modulation of transcription factor activity  Binding of transcription factors to DNA binding sites  Effects on transcription initiation

Ontology for Transcriptional Regulation

SRI International Bioinformatics

One DB object defined for each biological entity and for each molecular interaction

trp Complexation reaction apoTrpR site001 Int001 TrpR*trp trpLEDCBA pro001 trpL trpE trpD trpC trpB trpA Int002 RpoSig70 Int001 (binding of TrpR*trp to site001) inhibits Int002 (binding of RNA Polymerase to promoter) and consequently prevents transcription of genes in transcription unit.

Principle Classes

Class names are capitalized, plural

Genetic-Elements, with subclasses:

  Chromosomes Plasmids 

Genes

Transcription-Units

RNAs

Proteins, with subclasses:

  Polypeptides Protein-Complexes

SRI International Bioinformatics

Principle Classes

Reactions, with subclasses:

 Transport-Reactions 

Enzymatic-Reactions

Pathways

Compounds-And-Elements

SRI International Bioinformatics

Slots in Multiple Classes

SRI International Bioinformatics

Common-Name

Synonyms

Names (computed as union of Common-Name, Synonyms)

Comment

Citations

DB-Links

Genes Slots

Chromosome

Left-End-Position

Right-End-Position

Centisome-Position

Transcription-Direction

Product

SRI International Bioinformatics

Proteins Slots

Molecular-Weight-Seq

Molecular-Weight-Exp

pI

Locations

Modified-Form

Unmodified-Form

Component-Of

SRI International Bioinformatics

Polypeptides Slots

Gene

SRI International Bioinformatics

Protein-Complexes Slots

SRI International Bioinformatics

Components

SRI International Bioinformatics

Reactions Slots

EC-Number

Left, Right

Substrates (computed as union of Left, Right)

Enzymatic-Reaction

DeltaG0

Spontaneous?

Enzymatic-Reactions Slots

SRI International Bioinformatics

Enzyme

Reaction

Activators

Inhibitors

Physiologically-Relevant

Cofactors

Prosthetic-Groups

Alternative-Substrates

Alternative-Cofactors

Reaction-direction

Pathways Slots

Reaction-List

Predecessors

Primaries

SRI International Bioinformatics