Semantics Enabled Industrial and Scientific Applications

Download Report

Transcript Semantics Enabled Industrial and Scientific Applications

Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications

Keynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.org

November 23, 2005

Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia http://lsdis.cs.uga.edu

Acknowledgement: NCRR funded

Bioinformatics of Glycan Expression

, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.

Computation, data and semantics in life sciences

• “The development of a predictive biology will likely be one of the major creative enterprises of the 21 st century.” Roger Brent, 1999 • “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 • "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins • We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions.

Bioinformatics Apps & Ontologies

• • • • GlycO : A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans)  Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy  ProPreO  : a comprehensive process Ontology modeling experimental proteomics Contains 330 classes, 40,000+ instances  URL: Models three phases Separation techniques , URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco of experimental Mass Spectrometry and, proteomics* – Data analysis; http://lsdis.cs.uga.edu/projects/glycomics/propreo Automatic semantic annotation of high throughput experimental data (in

progress)

Semantic Web Process with WSDL-S for semantic annotations of Web Services – http://lsdis.cs.uga.edu

-> Glycomics project (funded by NCRR)

GlycO – A domain ontology for glycans

GlycO

Structural modeling and population challenges in GlycO

• Extremely large number of glycans occurring in nature • But, frequently there are small differences structural properties • Modeling all possible glycans would involve significant amount of redundant classes • Redundancy results in often fatal complexities in maintenance and upgrade • Population – Manual – Extraction and integration from external knowledge sources – GlycoTree – exploiting structural composition rules

Ontology population workflow

GlycoTree Takahashi, Kato 2003

GlycoTree – A Canonical Representation of N-Glycans

b D -Glc

p

NAc-(1-6)+ b D -Glc

p

NAc-(1-2) a D -Man

p

-(1-6)+ b D -Man

p

-(1-4) b D -Glc

p

NAc -(1-4) b D -Glc

p

NAc b D -Glc

p

NAc-(1-4) a D -Man

p

-(1-3)+ b D -Glc

p

NAc-(1-2)+

N. Takahashi and K. Kato

,

Trends in Glycosciences and Glycotechnology

, 15: 235-251

Beyond expressiveness afforded in OWL

• Probabilistic • more

Example: Mass spectrometry analysis

Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865 –875

Mass Spectrometry Experiment

Each m/z value in mass spec diagrams can stand for many different structures (uncertainty wrt to structure that corresponds to a peak) • Different linkage • Different bond • Different isobaric structures

Very subtle differences

CBank: 16155

Honeybee venom • Peak at 1219.1 • Same molecular composition • One diverging link • Found in different organisms • background knowledge (found in honeybee venom or bovine cells) can resolve the uncertainty

CBank: 16154

Bovine These are core-fucosylated high-mannose glycans

Even in the same organism

CBank: 21821

Different enzymes lead to these linkages

CBank: 21982

• Both Glycans found in bovine cells • Both have a mass of 3425.11

• Same composition • Different linkage • Since expression levels of different genes can be measured in the cell, we can get probability of each structure in the sample

Model 1: associate probability as part of Semantic Annotation

• Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge

P

(S | M = 3461.57) =

0.6

P

(T | M = 3461.57) =

0.4

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865 –875

Model 2: Probability in ontological representation of Glycan structure

• Build a generalized probabilistic glycan structure that embodies several possible glycans

N-Glycosylation Process (

NGP

)

Cell Culture Glycoprotein Fraction

extract proteolysis

Glycopeptides Fraction

1 n Separation technique I

Glycopeptides Fraction

PNGase n

Peptide Fraction

Separation technique II n*m

Peptide Fraction

Mass spectrometry

ms data ms peaklist

Data reduction binning

Glycopeptide identification and quantification N-dimensional array

Signal integration Data correlation

ms/ms data

Data reduction

ms/ms peaklist

Peptide identification

Peptide list

Phase II: Ontology Population

 Populate ProPreO with all experimental datasets?

 Two levels of ontology population for ProPreO:  Level 1: Populate the ontology with instances that a stable across experimental runs Ex: Human Tryptic peptides – 40,000 instances in ProPreO  Level 2: Use of URIs to point to actual experimental datasets

Ontology-mediated Proteomics Protocol

PKL Files (XML-based Format) RAW Results File Output (*.dat) DB Storing Output

Mass Spectrometer Conversion To PKL Preprocessing DB Search Post processing

All values of the produces ms-ms peaklist Instrument produces_ms-ms_peak_list mass_spec_raw_data Micromass_Q_TOF_ultima_quadrupole_time_of_flig Micromass_Q_TOF_micro_quadrupole_time_of_f

PeoPreO

light_ms_raw_data

Semantic Annotation of Scientific Data

830.9570 194.9604 2 580.2985 0.3592

688.3214 0.2526

779.4759 38.4939

784.3607 21.7736

1543.7476 1.3822

1544.7595 2.9977

1562.8113 37.4790

1660.7776 476.5043

ms/ms peaklist data

830.9570

194.9604

2

/>

/>

/>

/>

/>

/>

/>

/>

Annotated ms/ms peaklist data

Semantic annotation of Scientific Data

830.9570 194.9604 2

Annotated ms/ms peaklist data

Service description using WSDL-S

Formalize description and classification of Web Services using ProPreO concepts

xmlns: xmlns:xsd="http://www.w3.org/2001/XMLSchema"> wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns: ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" > …..

xmlns="http://www.w3.org/2001/XMLSchema"> wssem:modelReference="ProPreO#peptide_sequence"> data sequence Web Service using:

W

eb

S

ervice

D

escription

L

anguage

Concepts defined in process Ontology ProPreO process Ontology

Summary, Observations, Conclusions

• Ontology Schema: relatively simple in business/industry, highly complex in science • Ontology Population: could have millions of assertions, or unique features when modeling complex life science domains • Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population • Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)

Summary, Observations, Conclusions

• Some applications: semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

More information at

• http://lsdis.cs.uga.edu/projects/glycomics