Transcript Semantics Enabled Industrial and Scientific Applications
Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications
Keynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.org
November 23, 2005
Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia http://lsdis.cs.uga.edu
Acknowledgement: NCRR funded
Bioinformatics of Glycan Expression
, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.
Computation, data and semantics in life sciences
• “The development of a predictive biology will likely be one of the major creative enterprises of the 21 st century.” Roger Brent, 1999 • “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 • "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins • We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions.
Bioinformatics Apps & Ontologies
• • • • GlycO : A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy ProPreO : a comprehensive process Ontology modeling experimental proteomics Contains 330 classes, 40,000+ instances URL: Models three phases Separation techniques , URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco of experimental Mass Spectrometry and, proteomics* – Data analysis; http://lsdis.cs.uga.edu/projects/glycomics/propreo Automatic semantic annotation of high throughput experimental data (in
progress)
Semantic Web Process with WSDL-S for semantic annotations of Web Services – http://lsdis.cs.uga.edu
-> Glycomics project (funded by NCRR)
GlycO – A domain ontology for glycans
GlycO
Structural modeling and population challenges in GlycO
• Extremely large number of glycans occurring in nature • But, frequently there are small differences structural properties • Modeling all possible glycans would involve significant amount of redundant classes • Redundancy results in often fatal complexities in maintenance and upgrade • Population – Manual – Extraction and integration from external knowledge sources – GlycoTree – exploiting structural composition rules
Ontology population workflow
GlycoTree Takahashi, Kato 2003
GlycoTree – A Canonical Representation of N-Glycans
b D -Glc
p
NAc-(1-6)+ b D -Glc
p
NAc-(1-2) a D -Man
p
-(1-6)+ b D -Man
p
-(1-4) b D -Glc
p
NAc -(1-4) b D -Glc
p
NAc b D -Glc
p
NAc-(1-4) a D -Man
p
-(1-3)+ b D -Glc
p
NAc-(1-2)+
N. Takahashi and K. Kato
,
Trends in Glycosciences and Glycotechnology
, 15: 235-251
Beyond expressiveness afforded in OWL
• Probabilistic • more
Example: Mass spectrometry analysis
Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865 –875
Mass Spectrometry Experiment
Each m/z value in mass spec diagrams can stand for many different structures (uncertainty wrt to structure that corresponds to a peak) • Different linkage • Different bond • Different isobaric structures
Very subtle differences
CBank: 16155
Honeybee venom • Peak at 1219.1 • Same molecular composition • One diverging link • Found in different organisms • background knowledge (found in honeybee venom or bovine cells) can resolve the uncertainty
CBank: 16154
Bovine These are core-fucosylated high-mannose glycans
Even in the same organism
CBank: 21821
Different enzymes lead to these linkages
CBank: 21982
• Both Glycans found in bovine cells • Both have a mass of 3425.11
• Same composition • Different linkage • Since expression levels of different genes can be measured in the cell, we can get probability of each structure in the sample
Model 1: associate probability as part of Semantic Annotation
• Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge
P
(S | M = 3461.57) =
0.6
P
(T | M = 3461.57) =
0.4
Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865 –875
Model 2: Probability in ontological representation of Glycan structure
• Build a generalized probabilistic glycan structure that embodies several possible glycans
N-Glycosylation Process (
NGP
)
Cell Culture Glycoprotein Fraction
extract proteolysis
Glycopeptides Fraction
1 n Separation technique I
Glycopeptides Fraction
PNGase n
Peptide Fraction
Separation technique II n*m
Peptide Fraction
Mass spectrometry
ms data ms peaklist
Data reduction binning
Glycopeptide identification and quantification N-dimensional array
Signal integration Data correlation
ms/ms data
Data reduction
ms/ms peaklist
Peptide identification
Peptide list
Phase II: Ontology Population
Populate ProPreO with all experimental datasets?
Two levels of ontology population for ProPreO: Level 1: Populate the ontology with instances that a stable across experimental runs Ex: Human Tryptic peptides – 40,000 instances in ProPreO Level 2: Use of URIs to point to actual experimental datasets
Ontology-mediated Proteomics Protocol
PKL Files (XML-based Format) RAW Results File Output (*.dat) DB Storing Output
Mass Spectrometer Conversion To PKL Preprocessing DB Search Post processing
All values of the produces ms-ms peaklist Instrument produces_ms-ms_peak_list mass_spec_raw_data Micromass_Q_TOF_ultima_quadrupole_time_of_flig Micromass_Q_TOF_micro_quadrupole_time_of_f
PeoPreO
light_ms_raw_data
Semantic Annotation of Scientific Data
830.9570 194.9604 2 580.2985 0.3592
688.3214 0.2526
779.4759 38.4939
784.3607 21.7736
1543.7476 1.3822
1544.7595 2.9977
1562.8113 37.4790
1660.7776 476.5043
ms/ms peaklist data
/>
/>
/>
/>
/>
/>
/>
/>
Annotated ms/ms peaklist data
Semantic annotation of Scientific Data
Annotated ms/ms peaklist data
Service description using WSDL-S
Formalize description and classification of Web Services using ProPreO concepts xmlns: xmlns:xsd="http://www.w3.org/2001/XMLSchema"> wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns: W eb S ervice D escription L anguage Concepts defined in process Ontology ProPreO process Ontology • Ontology Schema: relatively simple in business/industry, highly complex in science • Ontology Population: could have millions of assertions, or unique features when modeling complex life science domains • Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population • Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world) • Some applications: semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, … • http://lsdis.cs.uga.edu/projects/glycomicsSummary, Observations, Conclusions
Summary, Observations, Conclusions
More information at