Recent Development of KEGG (Kyoto Encyclopedia of Genes

Download Report

Transcript Recent Development of KEGG (Kyoto Encyclopedia of Genes

Integrating Biological Information using
Oracle - KEGG at Kyoto University & the
PATHWAY database project
Susumu Goto
Bioinformatics Center, Institute for Chemical Research
Kyoto University
03/09/11
Oracle Life Science Day & User Group Meeting
Contents
• Introduction to KEGG
– Kyoto Encyclopedia of Genes and Genomes
– Integrated database for pathways, chemical
reactions, genomes, expression, and more..
– Data representation with graphs
• XML representation of PATHWAY database
• Comparison between pathways and other data
• Path computation in pathways
• Application to Oracle 10g Network Model
– Storing information on PATHWAY as binary
relations in Oracle
– Path computation using Oracle
03/09/11
Oracle Life Science Day & User Group Meeting
KEGG
Kyoto Encyclopedia of Genes and Genomes
• Integrated Database of Biological Systems
Information for Post-genomic era
– Genomes, genes, pathways of completely sequenced
organisms
– Functional annotation for each gene by comparative
genomics
– Pathway reconstruction based on the annotation
– A system for computing and comparing biological
networks from molecular interaction data
• Graph representation and application of graph algorithms
– http://www.genome.ad.jp/kegg/
03/09/11
Oracle Life Science Day & User Group Meeting
Databases in KEGG
Chemicals and
their reactions
Expression
Pathway
Genomes
Genes
Orthologs
Information on relations
between molecules
03/09/11
Oracle Life Science Day & User Group Meeting
Sequence
similarity
GENES and PATHWAY
• GENES: ~400,000 genes from over 100 organisms
– Parsing GenBank and EMBL for completely sequenced
genomes
– Parsing LocusLink and RefSeq for model organisms such
as human and mouse
– KEGG annotates function of each gene based on
sequence similarity
• PATHWAY: over 100 maps
– Metabolic pathways, regulatory pathways and protein
complexes
– Manually drawn and classified
– Information collected from various text books,
literatures and web pages
03/09/11
Oracle Life Science Day & User Group Meeting
Metabolic pathway map
03/09/11
Oracle Life Science Day & User Group Meeting
Graph Representation of Metabolic
Pathways and Chemical Compounds
• Metabolic Pathways
– Image maps and position of each object on them
– Graph 1
• Node: chemical compounds, Link: enzymatic reactions
– Graph 2
• Node: enzymes, Link: neighborhood relations of enzymes
on pathway maps
• Chemical Compounds
– Graph 1
• Node: atoms, Link: bonds between atoms
– Graph 2 for carbohydrates
• Node: sugars, Link: glycosylation bonds
03/09/11
Oracle Life Science Day & User Group Meeting
Graph representation of PATHWAY (1)
03/09/11
Oracle Life Science Day & User Group Meeting
XML representation of PATHWAY (1)
Graph 1: Compound as a node / Enzyme as a link
...
<entry id="22" name="ec:4.2.1.3" type="enzyme" reaction="rn:R01325“ ... />
<entry id="23" name="ec:2.3.3.1" type="enzyme" reaction="rn:R00351" ... />
<entry id="24" name="ec:1.1.1.37" type="enzyme" reaction="rn:R00342" ... />
...
<entry id="51" name="cpd:C00036" type="compound" .../>
...
<entry id="54" name="cpd:C00024" type="compound" .../>
...
<entry id="60" name="cpd:C00010" type="compound" .../>
C00036
<entry id="61" name="cpd:C00158" type="compound" .../>
...
<relation entry1="23" entry2="24" type="ECrel" compound="51"/>
R00351
<relation entry1="22" entry2="23" type="ECrel" compound="61"/>
...
C00010
<reaction name="rn:R00351" type="reversible">
<substrate name="cpd:C00158"/>
C00024
<substrate name="cpd:C00010"/>
<product name="cpd:C00024"/>
<product name="cpd:C00036"/>
C00158
</reaction>
<reaction name="rn:R00352" type="reversible">
<substrate name="cpd:C00158"/>
<substrate name="cpd:C00010"/>
Producing four binary relations from a
<product name="cpd:C00024"/>
<product name="cpd:C00036"/>
reaction with two substrates and two products
</reaction>
...
03/09/11
Oracle Life Science Day & User Group Meeting
Graph representation of PATHWAY (2)
03/09/11
Oracle Life Science Day & User Group Meeting
XML representation of PATHWAY (2)
Graph 2. Enzyme as a node / Compound as a link
...
<entry id="22" name="ec:4.2.1.3" type="enzyme" reaction="rn:R01325“ ... />
<entry id="23" name="ec:2.3.3.1" type="enzyme" reaction="rn:R00351" ... />
<entry id="24" name="ec:1.1.1.37" type="enzyme" reaction="rn:R00342" ... />
...
<entry id="51" name="cpd:C00036" type="compound" .../>
...
<entry id="54" name="cpd:C00024" type="compound" .../>
...
<entry id="60" name="cpd:C00010" type="compound" .../>
R00351
<entry id="61" name="cpd:C00158" type="compound" .../>
...
<relation entry1="23" entry2="24" type="ECrel" compound="51"/>
C00036
C00158
<relation entry1="22" entry2="23" type="ECrel" compound="61"/>
...
<reaction name="rn:R00351" type="reversible">
<substrate name="cpd:C00158"/>
<substrate name="cpd:C00010"/>
R00342
R001325
<product name="cpd:C00024"/>
<product name="cpd:C00036"/>
</reaction>
Producing binary relations between two
<reaction name="rn:R00352" type="reversible">
<substrate name="cpd:C00158"/>
enzymes with a compound as a link name
<substrate name="cpd:C00010"/>
<product name="cpd:C00024"/>
<product name="cpd:C00036"/>
</reaction>
...
03/09/11
Oracle Life Science Day & User Group Meeting
Pathway Data Definition in Oracle … Example
Graph 1 case
Node table
NODE_ID
------1
2
3
4
5
6
7
8
9
10
NODE_NAME
---------C00022
C00122
C00036
C05379
C00074
C00024
C00149
C00311
C00417
C00042
Link table
LINK_ID
------1
2
3
4
5
6
7
8
03/09/11
ACT COSTS SAMPLE_ID
ENTRY_ID
--- ----- ---------------------------------------- -------Y
1 Pyruvate
49
Y
1 Fumarate
50
Y
1 Oxaloacetate
51
Y
1 Oxalosuccinate
52
Y
1 Phosphoenolpyruvate
53
Y
1 Acetyl-CoA
54
Y
1 (S)-Malate
55
Y
1 Isocitrate
56
Y
1 cis-Aconitate
57
Y
1 Succinate
58
:
Supplementary
information
LINK_NAME
START_NODE_ID END_NODE_ID ACT COST SAMPLE_ID
---------------------------- ------------- ----------- --- ---- ---------------------------------------1.1.1.42 (rn:R00268)
4
19 Y
1 isocitrate dehydrogenase (NADP)
1.1.1.42 (rn:R00268)
19
4 Y
1 isocitrate dehydrogenase (NADP)
1.1.1.42 (rn:R00268)
4
20 Y
1 isocitrate dehydrogenase (NADP)
1.1.1.42 (rn:R00268)
20
4 Y
1 isocitrate dehydrogenase (NADP)
4.1.1.49 (rn:R00341)
3
19 Y
1 phosphoenolpyruvate carboxykinase (ATP)
4.1.1.49 (rn:R00341)
3
5 Y
1 phosphoenolpyruvate carboxykinase (ATP)
1.1.1.37 (rn:R00342)
3
7 Y
1 malate dehydrogenase
6.4.1.1 (rn:R00344)
1
3 Y
1 pyruvate carboxylase
:
Oracle Life Science Day & User Group Meeting
Path computation supported by the
Oracle Network Model
• Shortest path between two nodes
– Computing the shortest path between
two specified compounds
• All paths search between two nodes
– Computing all alternative paths between
two specified compounds
03/09/11
Oracle Life Science Day & User Group Meeting
Overlaying a result of path computation onto
the existing pathway map
03/09/11
Oracle Life Science Day & User Group Meeting
Hierarchy of similar proteins
EC numbers specify the
hierarchical classification of
enzyme reactions
2. transferase
2.3. acyltransferase
Two levels up
2.3.1. other than
aminoacyl groups
One level up
2.3.2. aminoacyltransferase
.....
2.3.1.39 2.3.1.41 2.3.1.61
03/09/11
2.3.2.2
2.3.2.6
Oracle Life Science Day & User Group Meeting
Results of path computation using query
relaxation
Enzymes that the target
organism does not have
03/09/11
Oracle Life Science Day & User Group Meeting
Searching alternative pathways
between two compounds
03/09/11
Oracle Life Science Day & User Group Meeting
Other applications using Oracle
• SSDB: Database of sequence similarities
– Binary relations for pairs of similar sequences
– 190,000,000 relations
– http://ssdb.genome.ad.jp/
• Annotation tool
– A tool for functional annotation of genes in GENES
– Annotation can be done using
• Best hit and bidirectional best hit relations in SSDB
• Genomic position information
• LIGAND chemical database
– http://www.genome.ad.jp/ligand/
– Based on MDL’s ISIS database
– Substructure search
03/09/11
Oracle Life Science Day & User Group Meeting
03/09/11
Oracle Life Science Day & User Group Meeting
Possible application for future
• LinkDB: database of related entries
– Binary relations between database entries related
with cross-references
– 50 databases and 70 millions relations
– http://www.genome.ad.jp/dbget-bin/www_linkdb
• Extraction of cliques
– Orthologous groups in SSDB
• Comparison between networks
– Comparing and extracting correlated clusters
from different networks
• Metabolic pathways and genomic positions
• Protein interaction networks and expression similarities
03/09/11
Oracle Life Science Day & User Group Meeting
Searching similar compounds
1. Searching a maximal common subgraphs
2. Counting matching atoms
3. Calculating weight as Jaccard coefficient
03/09/11
Oracle Life Science Day & User Group Meeting
Summary
• KEGG: Kyoto Encyclopedia of Genes and Genomes
– Integrated database for pathways, reactions, genomes,
expression infomration, and more..
– Graph representation
• Comparisons of pathways
• Path computation in pathway
• Using Oracle 10g Network Model....
– Efficient and effective pathway analysis will be achieved,
especially for pathway computation using graph search
algorithms embedded in Oracle.
– Various types and large amount of network data such as
network of protein interactions, database entries are
expected.
– It will be much more effective if the graph comparison
algorithms can be easily applied.
03/09/11
Oracle Life Science Day & User Group Meeting
KEGG project team
•
•
•
•
03/09/11
Project Leader
– Minoru Kanehisa
System Development Team:
– Susumu Goto, Kotaro Shiraishi, Kayo Okamoto, Satoshi Miyazaki, Tomomi
Kamiya, Yoko Sato, Akihiro Nakaya, Shuichi Kawashima, Koichiro Tonomura,
Junji Fukumoto, Koichi Ohkubo
Data Entry Team:
– Miho Furumichi, Junko Yabuzaki, Nobue Takeuchi, Yuriko Matsuura, Masami
Hamajima, Rumiko Yamamoto, Tomoko Komeno,
Toshi Nakatani,
Junko Nishida, Atsuko Tanaka, Megumi Yamaguchi, Tomoko Deno, Ayumi Kirioka,
Tomoko Hattori, Kana Matsumoto, Hiroko Shino, Sanae Asanuma, Junko
Yamamoto
Curators:
– Takaaki Nishioka, Yasushi Okuno, Masahiro Hattori,
Toshiaki
Katayama, Yoshinobu Igarashi, Keun-joon Park,
Akiyasu Yoshizawa,
Vachiranee Limviphuvadh
Oracle Life Science Day & User Group Meeting