Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Download Report

Transcript Life Sciences: a case study for the Semantic Web Professor Carole Goble Information Management Group University of Manchester UK.

Life Sciences: a case study for the Semantic Web

Professor Carole Goble Information Management Group University of Manchester UK

Pioneers and incubators

• The Web -> Physics – well-organised microcosm of the general community. – definite and clearly articulated information dissemination needs.

– smart motivated people prepared to co operate, and with the means and desires to do so. • The Semantic Web -> Life Sciences

Why Life Sciences?

• Knowledge-based discipline – Collaborative history – Publication shift: articles -> data -> knowledge – Content with extensive metadata -> annotation & controlled vocabularies – Highly contextual, unstable and fuzzy • In silico experiments – Information harvesting & PSE – Orchestrating resources -> workflow – Services that exploit enriched content – Support for scientific/research method = SW issues – Transparent collection of annotation

Why Life Sciences?

• Strong enthusiastic cohesive community – I3C use cases – Grass roots ontologies and annotation – Distributed annotation services – NEED for provenance, audit, security … – A chance of concrete articulation – Sanger, EBI & NCBI – ISCB

Disease Genetics & Pharmacogenomics

Hypotheses Design Data Capture Clinical Image/Signal Genomic/Proteomic Analysis Integration Clinical Resources Individualised Medicine Model & Analysis Libraries Knowledge Repositories Information Sources Data Mining Case-Base Reasoning Information Fusion Annotation / Knowledge Representation

Cows to Proteins

• Jim Hendler-> how many cows in Texas?

Q: What ATPase superfamily proteins are found in mouse?

A: 1. P21958 (from Swiss-Prot) 2. InterPro is a pattern database and could tell you 3. Attwood’s lab expertise is in nucleotide binding proteins ….

Which compounds interact with (alpha adrenergic receptors) ((over expressed in (bladder epithelial cells)) but not (smooth muscle tissue)) of ((patients with urinary flow dysfunction) and a sensitivity to the (quinazoline family of compounds))?

Drug formulary High thro’put screening Express n .

database Chemical database Tissue database Clinical trials database Enzyme database Receptor database SNPs database

Webs of Knowledge

Interoperating e-Services

Service provider Service provider Service provider Service provider Service provider Interoperation is by hand or Perl scripts

But surely this is just all about querying and linking (lots of) databases?

Isn’t the information all computationally accessible already?

The document publishing navigation interface legacy

Navigation-based interaction

Identity

“Inaccessible” Descriptions

• Evolving • Non predictive • The structured part of the schema is open to change • Hence flat file mark up’s prevalence • XML is king.

ID AC DE RP RA

PRIO_HUMAN STANDARD; PRT; 253 AA.

P04156; MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).

OS Homo sapiens (Human).

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

OX NCBI_TaxID=9606; RN [1] SEQUENCE FROM N.A.

RX MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672;

Swiss-Prot Flat file

Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.

CC CC CC CC CC CC CC DR RT

"Molecular cloning of a human prion protein cDNA.";

RL

RN [6]

RP RA

DNA 5:315-324(1986).

STRUCTURE BY NMR OF 23-231.

RX MEDLINE=97424376 [NCBI, ExPASy, Israel, Japan]; PubMed=9280298; Riek R., Hornemann S., Wider G., Glockshuber R., Wuethrich K.;

RT RL

"NMR characterization of the full-length recombinant murine prion protein, mPrP(23-231)."; FEBS Lett. 413:282-288(1997).

CC CC CC CC CC CC CC CC -!- FUNCTION:

THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.

-!- SUBUNIT:

PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS".

-!- SUBCELLULAR LOCATION:

ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.

-!- DISEASE:

PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2) SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS.

-!- SIMILARITY:

HSSP; P04925; BELONGS TO THE PRION FAMILY.

1AG2

. [HSSP ENTRY / SWISS-3DIMAGE / PDB] MIM; 176640; -. [NCBI / EBI]

DR DR DR DR KW

InterPro; IPR000817; -.

Pfam

;

PF00377; prion; 1.

PRINTS

;

PR00341; PRION.

Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation.

Literature holds knowledge

Consequence -> information extraction big business & metadata is required.

Community-wide markup Annotation and Curation

Expressed Sequence Tagsmillions nrdb 503,479 TrEMBL 234,059 Swiss-Prot 85,661 InterPro 2990 PRINTS 1310 “the elucidation and description of biologically relevant features”   Computationally formed – e.g. cross references to other database entries, date collected; Intellectually formed – the accumulated knowledge of an expert distilling the aggregated information drawn from multiple data sources and analyses, and the annotators knowledge.

ID AC DE RP RA

PRIO_HUMAN STANDARD; PRT; 253 AA.

P04156; MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).

OS Homo sapiens (Human).

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

OX NCBI_TaxID=9606; RN [1] SEQUENCE FROM N.A.

RX MEDLINE=86300093 [NCBI, ExPASy, Israel, Japan]; PubMed=3755672;

Swiss-Prot Annotation

Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H., Prusiner S.B., Dearmond S.J.

CC CC CC CC CC CC CC DR RT

"Molecular cloning of a human prion protein cDNA.";

RL

RN [6]

RP RA

DNA 5:315-324(1986).

STRUCTURE BY NMR OF 23-231.

RX MEDLINE=97424376 [NCBI, ExPASy, Israel, Japan]; PubMed=9280298; Riek R., Hornemann S., Wider G., Glockshuber R., Wuethrich K.;

RT RL

"NMR characterization of the full-length recombinant murine prion protein, mPrP(23-231)."; FEBS Lett. 413:282-288(1997).

CC CC CC CC CC CC CC CC -!- FUNCTION:

THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.

-!- SUBUNIT:

PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED "RODS".

-!- SUBCELLULAR LOCATION:

ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.

-!- DISEASE:

PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE: CREUTZFELDT-JAKOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY(EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2) SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED FOODSTUFFS.

-!- SIMILARITY:

HSSP; P04925; BELONGS TO THE PRION FAMILY.

1AG2

. [HSSP ENTRY / SWISS-3DIMAGE / PDB] MIM; 176640; -. [NCBI / EBI]

DR DR DR DR KW

InterPro; IPR000817; -.

Pfam

;

PF00377; prion; 1.

PRINTS

;

PR00341; PRION.

Prion; Brain; Glycoprotein; GPI-anchor; Repeat; Signal; Polymorphism; Disease mutation.

gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gd; gc; gx; gt; gp; gp; gp; gp; bb; gr; gr; gr; gr; gr; gr; gr; gr; gr; gr; gr; bb; gd; gd; gd; gd; gd;

PRION PR00341 Prion protein signature INTERPRO; IPR000817 PROSITE; PS00291 PRION_1; PS00706 PRION_2 BLOCKS; BL00291 PFAM; PF00377 prion 1. STAHL, N. AND PRUSINER, S.B.

Prions and prion proteins.

FASEB J. 5 2799-2807 (1991).

2. BRUNORI, M., CHIARA SILVESTRINI, M. AND POCCHIARI, M.

The scrapie agent and the prion hypothesis.

TRENDS BIOCHEM.SCI. 13 309-313 (1988).

3. PRUSINER, S.B.

Scrapie prions.

ANNU.REV.MICROBIOL. 43 345-374 (1989).

PRINTS Annotation

Prion protein (PrP) is a small glycoprotein found in high quantity in the brain of animals infected with certain degenerative neurological diseases, such as sheep scrapie and bovine spongiform encephalopathy (BSE), and the human dementias Creutzfeldt-Jacob disease (CJD) and Gerstmann-Straussler syndrome (GSS). PrP is encoded in the host genome and is expressed both in normal and infected cells. During infection, however, the PrP molecules become altered and polymerise, yielding fibrils of modified PrP protein.

PrP molecules have been found on the outer surface of plasma membranes of nerve cells, to which they are anchored through a covalent-linked glycolipid, suggesting a role as a membrane receptor. PrP is also expressed in other tissues, indicating that it may have different functions depending on its location. The primary sequences of PrP's from different sources are highly similar: all bear an N-terminal domain containing multiple tandem repeats of a Pro/Gly rich octapeptide; sites of Asn-linked glycosylation; an essential disulphide bond; and 3 hydrophobic segments. These sequences show some similarity to a chicken glycoprotein, thought to be an acetylcholine receptor-inducing activity (ARIA) molecule. It has been suggested that changes in the octapeptide repeat region may indicate a predisposition to disease, but it is not known for certain whether the repeat can meaningfully be used as a fingerprint to indicate susceptibility.

PRION is an 8-element fingerprint that provides a signature for the prion proteins. The fingerprint was derived from an initial alignment of 5 sequences: the motifs were drawn from conserved regions spanning virtually the full alignment length, including the 3 hydrophobic domains and the octapeptide repeats (WGQPHGGG). Two iterations on OWL18.0 were required to reach convergence, at which point a true set comprising 9 sequences was identified. Several partial matches were also found: these include a fragment (PRIO_RAT) lacking part of the sequence bearing the first motif,and the PrP homologue found in chicken - this matches well with only 2 of the 3 hydrophobic motifs (1 and 5) and one of the other conserved regions (6), but has an N-terminal signature based on a sextapeptide repeat (YPHNPG) rather than the characteristic PrP octapeptide.

The “Annotation Workflow”

Analysis Analysis Analysis PRINTS EMBL Swiss Prot GPCRDB TrEMBL Analysis

In silico experiments

Nicola: Domain; Task; Events ontologies Simon: Support of research itself

In silico experiments

• Resource discovery, interoperation, fusion, sharing, finding, filtering • Work flows • Science is dynamic – change propagation • Problem Solving Environments • Collaborative and dynamic virtual organisations

Annotating the annotations

• Transparent annotation by side effect • Provenance, Trust, Authentication • Audit • Versioning, roll-backs and snap shots • Confidentiality • Credit – digital signatures • Authorisation & security … • Automated side effects of as part of the PSE • All potentials for Semantic Web Markup

Not just data and tools…

Teams Laboratories Repositories People

Problem Space

• Ability to store and retrieve huge volumes of information • Ability to capture, enrich, classify, publish and structure knowledge about •Domains Organisations •Individuals Research Collaborations •Experiments Results •Services

Share info -> share meaning

Service provider Service provider Service provider Service provider Service provider

Ontologies are big news

• Gene Ontology – Marking up annotation of major databases – Identity, Linking databases together – Classification/index framework for instances & results – It is sloppy but it is used by everybody!

– Gene Ontology -> DAML+OIL -> inference!

• http://www.geneontology.org

BioOntology Consortium

• 150 people attended the last BOC meeting • GSK and BOC mandated DAML+OIL • Plethora of other ontologies – Bioinformatics • Many ontologies but under control – Medical informatics • Tons of ontologies, out of control • Representing the natural world is tough!!

– Sufficiency conditions …

Functional genomics Tissue Disease Structural Genomics Population Genetics Genome sequence Clinical Data Clinical trial • Data resources have been built introspectively for human researchers • Information is machine readable not machine understandable • Sharing vocabulary is a step towards unification

“The technical advantages of knowledge modeling are obvious. automatically checked for consistency; they support inference mechanisms which derive data which have not been explicitly stored Knowledge bases can be ; they also offer extensive request and navigation facilities. However, the most immediate benefit of knowledge base design lies in the modeling process itself, through the effort of explication, organization and structuration [sic] of the knowledge it requires.” Editorial: Bioinformatics, July 2000

Quality & Stability

• Open Knowledge & transparency • Data quality • Inconsistency, incompleteness • Provenance • Contamination, noise, experimental rigour • Data irregularity • Evolution, Audit, Versioning “ … the problem in the field is not a lack of good integrating software, Smith says. The packages usually end up leading back to public databases. " God-awful ," he told BioMedNet.

The problem is: the databases are If the data is still fundamentally flawed, then better algorithms add little ” Temple Smith, director of the Molecular Engineering Research Center at Boston University, BioMedNet 2000

Supporting Science

• All the great stuff Simon talked about • Information is contextual • Personalisation – My view of a metabolic pathway – My experimental process flows • Science is not linear – What did we know then – What do we know now • Longevity of data – It has to be available in 50 years time.

The Grid

• Large scale distributed data management • Large scale distributed computation • High speed communications • Dynamic collaborative virtual organisations • UK Govt £120 million • http://www.gridform.org

Eating our own dog food myGrid

• UK research council funded e-Science Project • Start 1 st • 19 FTEs October for 36-42 months • £3.4 million • 6 academic partners, 8 commercial • Web Services + Semantic Web + Grid • http://www.mygrid.org.uk

myGrid Objectives

• Straightforward discovery, interoperation, sharing – information AND processes AND best practice • Improving quality of both experiments and data – provenance through information <-> process linkage – propagating change • Individual creativity & collaborative working • Enabling genomic level bioinformatics

       •

myGrid Technologies

Database access from the Grid Process enactment on the Grid Personalisation services Metadata services & Ontologies  DAML+OIL !!

Laying the foundations for Agent Services Collaboration Environments Service composition Ontologies, Protocols & APIs Grid + Services + Semantic Web

“Bioinformatics is a knowledge-based discipline. Many predictions, and interpretations, of data in biology are made by comparing the data in hand against existing knowledge” Dr. Andy Brass, ad nauseum • Analogy/knowledge-based rather than axiom-based

Remarks

• Semantic Web literacy in biology weak • Grid literacy in biology strong • Biology loves XML and ignores RDF – Annotations sit in other (non RDF) databases.

• Role of (legacy) databases and semantic web markup – Lots of metadata already in databases – Will we really mark up every database instance?

– Exporting results as RDF – Using inference over results of queries

Remarks

• Change management – What did we know then? • Custodianship, guardianship, longevity… • Performance, robustness, scale.

• Tools & easy to use environments • Demonstrators

How does this bit fit?

?