Semantics empowered Life Science Applications

Download Report

Transcript Semantics empowered Life Science Applications

Semantic empowerment
of Life Science Applications
October 2006
Amit Sheth
LSDIS Lab, Department of Computer Science,
University of Georgia
Acknowledgement: NCRR funded Bioinformatics of Glycan Expression,
collaborators, partners at CCRC (Dr. William S. York)
and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.
Computation, data and
semantics In life sciences
• “The development of a predictive biology will likely be one of
the major creative enterprises of the 21st century.” Roger
Brent, 1999
• “The future will be the study of the genes and proteins of
organisms in the context of their informational pathways or
networks.” L. Hood, 2000
• "Biological research is going to move from being hypothesisdriven to being data-driven." Robert Robbins
• “We’ll see over the next decade complete transformation (of
life science industry) to very database-intensive as opposed
to wet-lab intensive.” Debra Goldfarb
We will show how semantics is a key enabler for achieving the
above predictions and visions in which information and
process play critical role.
Semantic Web and Life Science
• Data captured per year = 1 exabyte (1018)
(Eric Neumann, Science, 2005)
• How much is that?
– Compare it to the estimate of the total words
ever spoken by humans = 12 exabyte
• Death by data
• The need for
– Search
– Integration
– Analysis,
decision
support
– Discovery
Not data, but
analysis and insight,
leading to decisions
and discovery
Semantic empowerment
of Life Science Applications
Life Science research today deals with highly
heterogeneous as well as massive amounts of data
distributed across the world.
We need more automated ways for integration and
analysis leading to insight and discovery
- to understand cellular components, molecular
functions and biological processes, and more
importantly complex interactions and
interdependencies between them.
Benefits of Semantics
• Development of large domain-specific
knowledge
– for reference, common nomenclature, tagging
• Integration of heterogeneous multi-source
data: biomedical documents (text),
scientific/experimental data and structured
databases
• Semantic search, browsing, integration
analysis, and discovery
Faster and more reliable discovery leading to
quality of life improvements
What is semantics & Semantic Web
• Meaning and use of data
• From syntax and structure to semantics (beyond
formatting, organization, query interfaces,….)
• XML -> RDF -> OWL -> Rules -> Trust
• Ontologies at the heart of Semantic Web,
capturing agreement and domain knowledge
• (Automatic) Semantic annotation, reasoning,…
• Also, increasing use of Services oriented
Architecture -> semantic Web services
• W3C SW for Health Care and Life Sciences
Semantic empowerment
of Life Science Applications
This talk will demonstrate some of the efforts in:
• Building large (populated) life science ontologies
(GlycO, ProPreO)
• Gathering/extracting knowledge and metadata:
entity and relationship extraction from
unstructured data, automatic semantic annotation
of scientific/experimental data (e.g., mass
spectrometry)
• Semantic web services and registries, leading to
better discovery/reuse of scientific tools and their
composition
• Ontology-driven applications developed
Semantic Applications
• Active
Semantic
Medical
Records
Demo : an
operational health care application using multiple ontologies,
semantic annotations and rule based decsion support
• Semantic Browser Demo: contextual browsing of
PubMed aided by ontology and schema (in future instance)
level relationships
• N-glycosylation
workflow
process: an example of scientific
• Integrated Semantic Information & Knowledge
System (ISIS): integrated access and analysis of
structured databases, sc. literature and experimental data
Others we will not discuss: SemBowser, SemDrug, ….
Let us start with a couple of simple applications
Life Science Ontologies
• Glyco
• An ontology for structure and function of Glycopeptides
• 573 classes, 113 relationships
• Published through the National Center for Biomedical
Ontology (NCBO)
• ProPreO
• An ontology for capturing process and lifecycle information
related to proteomic experiments
• 398 classes, 32 relationships
• 3.1 million instances
• Published through the National Center for Biomedical
Ontology (NCBO) and Open Biomedical Ontologies (OBO)
N-Glycosylation metabolic pathway
N-glycan_beta_GlcNAc_9
GNT-I
attaches GlcNAc at position 2
N-acetyl-glucosaminyl_transferase_V
N-glycan_alpha_man_4
GNT-V
attaches
GlcNAc at position 6
UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2
<=>
UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2
UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021
GlycO ontology
• Challenge – model hundreds of thousands of
complex carbohydrate entities
• But, the differences between the entities are
small (E.g. just one component)
• How to model all the concepts but preclude
redundancy → ensure maintainability,
scalability
GlycoTree
b-D-GlcpNAc-(1-2)- a-D-Manp -(1-6)+
b-D-Manp-(1-4)- b-D-GlcpNAc -(1-4)- b-D-GlcpNAc
b-D-GlcpNAc-(1-4)- a-D-Manp -(1-3)+
b-D-GlcpNAc-(1-2)+
N. Takahashi and K. Kato, Trends in Glycosciences
and Glycotechnology, 15: 235-251
EnzyO
• The enzyme ontology EnzyO is highly
intertwined with GlycO. While it’s structure
is mostly that of a taxonomy, it is highly
restricted at the class level and hence
allows for comfortable classification of
enzyme instances from multiple organisms
• GlycO together with EnzyO contain all the
information that is needed for the
description of Metabolic pathways
– e.g. N-Glycan Biosynthesis
Pathway representation in
GlycO
Pathways do not need to be
explicitly defined in GlycO. The
residue-, glycan-, enzyme- and
reaction descriptions contain
all the knowledge necessary to
infer pathways.
Zooming in a little …
Reaction R05987
catalyzed by enzyme 2.4.1.145
adds_glycosyl_residue
N-glycan_b-D-GlcpNAc_13
The product of this
reaction is the
Glycan with KEGG
ID 00020.
The N-Glycan with KEGG
ID 00015 is the substrate to
the reaction R05987, which
is catalyzed by an enzyme
of the class EC 2.4.1.145.
GlycO population
• Multiple data sources used in populating
the ontology
o KEGG - Kyoto Encyclopedia of Genes and
Genomes
o SWEETDB
o CARBANK Database
• Each data source has different schema for
storing data
• There is significant overlap of instances in
the data sources
• Hence, entity disambiguation and a
common representational format are
needed
Ontology population workflow
Semagix Freedom knowledge
extractor
YES:
next Instance
Instance
Data
Already in
KB?
Has
CarbBank
ID?
NO
YES
Insert into
KB
Compare to
Knowledge
Base
NO
IUPAC to
LINUCS
LINUCS to
GLYDE
Ontology population workflow
Semagix Freedom knowledge
extractor
YES:
next Instance
Instance
Data
Already in
KB?
Has
CarbBank
ID?
NO
YES
Insert into
KB
Compare to
Knowledge
Base
[][Asn]{[(4+1)][b-D-GlcpNAc]
{[(4+1)][b-D-GlcpNAc]
{[(4+1)][b-D-Manp]
{[(3+1)][a-D-Manp]
IUPAC to
NO{[(2+1)][b-D-GlcpNAc]
LINUCS
{}[(4+1)][b-D-GlcpNAc]
{}}[(6+1)][a-D-Manp]
{[(2+1)][b-D-GlcpNAc]{}}}}}}
LINUCS to
GLYDE
Ontology population workflow
Semagix Freedom knowledge
extractor
<Glycan>
YES:
<aglycon name="Asn"/>
<residue link="4"
anomer="b" chirality="D" monosaccharide="GlcNAc">
nextanomeric_carbon="1"
Instance
<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc">
<residue link="4" anomeric_carbon="1" anomer="b"
Instancechirality="D" monosaccharide="Man" >
<residue link="3" anomeric_carbon="1" anomer="a"
Data chirality="D" monosaccharide="Man" >
<residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" >
</residue>
<residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" >
</residue>
Has
</residue> Already in
IUPAC to
CarbBankchirality="D"
NO monosaccharide="Man" >
<residue link="6" anomeric_carbon="1" anomer="a"
KB?
LINUCS
<residue link="2" anomeric_carbon="1" anomer="b"
chirality="D" monosaccharide="GlcNAc">
ID?
</residue>
</residue>
</residue>
NO
YES
</residue>
</residue>
</Glycan>
Compare to
Insert into
KB
Knowledge
Base
LINUCS to
GLYDE
Ontology population workflow
Semagix Freedom knowledge
extractor
YES:
next Instance
Instance
Data
Already in
KB?
Has
CarbBank
ID?
NO
YES
Insert into
KB
Compare to
Knowledge
Base
NO
IUPAC to
LINUCS
LINUCS to
GLYDE
ProPreO ontology
• Two aspects of glycoproteomics:
o What is it? → identification
o How much of it is there? → quantification
• Heterogeneity in data generation process,
instrumental parameters, formats
• Need data and process provenance →
ontology-mediated provenance
• Hence,
ProPreO
models
both
the
glycoproteomics experimental process and
attendant data
ProPreO population:
transformation to rdf
Scientific Data
Computational Methods
Ontology instances
ProPreO population:
transformation to rdf
Scientific Data
Computational Methods
Key
Extract Peptide Amino-acid Sequence
from Protein Amino-acid Sequence
Protein Path
amino-acid
sequence
amino-acid
sequence
Protein Data
Peptide Path
Determine
N-glycosylation
Concensus
Calculate
Chemical
Mass
Calculate
Monoisotopic
Mass
RDF
Chemical
Mass RDF
n-glycosylation
concensus
“Protein RDF”
chemical
mass
monoisotopic
mass
amino-acid
sequence
parent
protein
Monoisotopic
Mass RDF
n-glycosylation
concensus
“Peptide RDF”
chemical
mass
Amino-acid
Sequence
RDF
monoisotopic
mass
amino-acid
sequence
Semantic empowerment
of Life Science Applications
This talk will demonstrate some of the efforts in:
•
building large life science ontologies (GlycO -an ontology for structure and
function for Glycopeptides and ProPreO - an ontology for capturing process
and lifecycle information related to proteomic experiments) and their
application in advanced ontology-driven semantic applications
• entity and relationship extraction from unstructured data, automatic
semantic annotation of scientific/experimental data (e.g., mass
spectrometry), and resulting capability in integrated access and analysis of
structured databases, scientific literature and experimental data
•
semantic web services and registries, leading to better discovery/reuse of
scientific tools and composition of scientific workflows that process highthroughput data and can be adaptive
•
semantic applications developed
Relationship extraction
from unstructured data
(other related research: biological entity extraction)
Overview
UMLS
Biologically
active substance
affects
complicates
causes
causes
Lipid
Disease or
Syndrome
affects
instance_of
instance_of
???????
Fish Oils
Raynaud’s Disease
MeSH
PubMed
9284
documents
5
documents
4733
documents
About the data used
• UMLS – A high level schema of the
biomedical domain
– 136 classes and 49 relationships
– Synonyms of all relationship – using variant
lookup (tools from NLM)
• MeSH
– Terms already asserted as instance of one or
more classes in UMLS
• PubMed
– Abstracts annotated with one or more MeSH
terms
T147—effect
T147—induce
T147—etiology
T147—cause
T147—effecting
T147—induced
Example PubMed abstract (for the domain
expert)
Abstract
Classification/Annotation
Method – Parse Sentences in
PubMed
SS-Tagger (University of Tokyo)
SS-Parser (University of Tokyo)
(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ
exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ
induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT
the) (NN endometrium) ) ) ) ) ) )
Method – Identify entities and
Relationships in Parse Tree
Method – Identify entities and
Relationships in Parse Tree
Modifiers
Modified entities
Composite Entities
Method – Fact Extraction from
Parse Tree
Semantic annotation of
scientific/experimental data
ProPreO: Ontology-mediated
provenance
830.9570
194.9604
2
580.2985
0.3592
parent ion m/z
688.3214
0.2526
779.4759
38.4939
784.3607
21.7736
1543.7476
1.3822
fragment ion m/z
1544.7595
2.9977
1562.8113
37.4790
1660.7776
476.5043
parent ion charge
parent ion
abundance
fragment ion
abundance
ms/ms peaklist data
Mass Spectrometry (MS) Data
ProPreO: Ontology-mediated
provenance
<ms-ms_peak_list>
<parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”
mode=“ms-ms”/>
<parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/>
<fragment_ion m-z=“580.2985” abundance=“0.3592”/>
<fragment_ion m-z=“688.3214” abundance=“0.2526”/>
<fragment_ion m-z=“779.4759” abundance=“38.4939”/>
<fragment_ion m-z=“784.3607” abundance=“21.7736”/>
<fragment_ion m-z=“1543.7476” abundance=“1.3822”/>
Ontological
<fragment_ion m-z=“1544.7595” abundance=“2.9977”/>
Concepts
<fragment_ion m-z=“1562.8113” abundance=“37.4790”/>
<fragment_ion m-z=“1660.7776” abundance=“476.5043”/>
</ms-ms_peak_list>
Semantically Annotated MS Data
Semantic empowerment
of Life Science Applications
This talk will demonstrate some of the efforts in:
•
building large life science ontologies (GlycO -an ontology for structure and
function for Glycopeptides and ProPreO - an ontology for capturing process
and lifecycle information related to proteomic experiments) and their
application in advanced ontology-driven semantic applications
•
entity and relationship extraction from unstructured data, automatic
semantic annotation of scientific/experimental data (e.g., mass
spectrometry), and resulting capability in integrated access and analysis of
structured databases, scientific literature and experimental data
•
semantic web services and registries, leading to better discovery/reuse of
scientific tools and composition of scientific workflows that process highthroughput data and can be adaptive
•
semantic applications developed
N-Glycosylation Process (NGP)
Cell Culture
extract
Glycoprotein Fraction
proteolysis
Glycopeptides Fraction
1
n
Separation technique I
Glycopeptides Fraction
n
PNGase
Peptide Fraction
Separation technique II
n*m
Peptide Fraction
Mass spectrometry
ms data
ms/ms data
Data reduction
ms peaklist
ms/ms peaklist
binning
Glycopeptide identification
and quantification
N-dimensional array
Signal integration
Data reduction
Peptide identification
Peptide list
Data correlation
Semantic Web Process to incorporate provenance
Agent
Biological
Sample
Analysis
by MS/MS
O
Semantic
Annotation
Applications
Agent
Raw
Data to
Standard
Format
I
Raw
Data
Agent
Data
Preprocess
O
I
Standard
Format
Data
(Mascot/
Sequest)
O
Filtered
Data
Agent
DB
Search
I
Search
Results
Storage
Biological Information
O
Final
Output
Results
Postprocess
(ProValt)
I
O
Converting biological
information to the W3C
Resource Description Framework
(RDF): Experience with Entrez
Gene
Collaboration with Dr. Olivier Bodenreider
(US National Library of Medicine, NIH, Bethesda, MD)
Biomedical Knowledge Repository
….
Entrez
Biomedical
Knowledge
Repository
Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
Web interface
ENTREZ GENE
ENTREZ GENE XML
XSLT
ENTREZ GENE RDF GRAPH
….
ENTREZ GENE RDF
Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
Connecting different genes
protease nexin-II
A4 amyloid protein
amyloid-beta protein
APP gene [Homo sapiens]
beta-amyloid peptide
cerebral vascular amyloid peptide
Human APP gene is implicated in Alzheimer's disease.
amyloid beta A4 protein
Which genes are functionally homologous to this gene?
amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease)
APP gene [Gallus gallus]
amyloid protein
APP gene [Canis familiaris ]
eg:has_protein_reference_name_E
Integrated Semantic Information
and knowledge System (Isis)
Have I performed an error?
Give me all result files from a similar
organism, cell, preparation,
mass spectrometric conditions
and compare results.
SPARQL query-based User Interface
ProPreO ontology
Is the result erroneous?
Experimental
Semantic
Give me
result files from
a similar
Data all
Semantic
Metadata
Annotation
Metadata
organism,
cell,
preparation,
Registry
File
mass spectrometric conditions
and compare results.
PROTEOMECOMMONS
EXPERIMENTAL DATA
Raw
mzXML
Raw2mzXML
mzXML2Pkl
Pkl
MACOT
result
ProVault
result
MASCOT Search
ProVault
pSplit
Pkl2pSplit
PROTEOMICS WORKFLOW
Summary, Observations,
Conclusions
• We now have semantics and services
enabled approaches that support semantic
search, semantic integration, semantic
analytics, decision support and validation
(e.g., error prevention in healthcare),
knowledge discovery, process/pathway
discovery, …
• http://lsdis.cs.uga.edu
• http://knoesis.org
http://lsdis.cs.uga.edu/projects/asdoc/
http://lsdis.cs.uga.edu/projects/glycomics/