Document 7722375

Download Report

Transcript Document 7722375

Provenance in myGrid
Jun Zhao
School of Computer Science
The University of Manchester, U.K.
21 October, 2004
Outline
•
•
•
•
•
•
myGrid
Motivation
Challenges
myGrid approach
Related work
Conclusions
myGrid Project
• http://www.mygrid.co.uk
• A pilot e-Science project in U.K.;
• Target at biologists and
bioinformatician;
• Three bio-test beds:
• Providing middleware services in a
Grid environment, which are
orchestrated in the mechanism of
workflows;
e-Science in silico Experiments
(workflows)
• Automate the process of
experiments;
• Orchestrate distributed
resources and Web/Grid
services;
• Transparent, seamless
access to remote data
and computation
resources
• Increase the
collaboration and
results sharing across
multi-scale communities
Forming
experiments
Discovering
and reusing
experiments
and resources
Soaplab
Personalisatio
n
Executing
and
monitoring
experiments
Sharing
services &
experiments
Managing
lifecycle,
provenance and
results of
experiments
Problems when doing
in silico experiments
Experiments being performed
repeatedly, at different sites,
different times, by different users
or groups;
Scientists
• A large repository of
zipped records about
experiments!!
• frequently updated
resources;
• volatile, distributed
environment
Problems when doing
in silico experiments
Experiments being performed
repeatedly, at different sites,
different times, by different users
or groups;
Scientists
PROVENANCE
•verification of data;
• “recipes” for
experiment designs;
• explanation for the
impact of changes;
• ownership;
• performance of
services;
• data quality;
Provenance Forms
mass = 200
decay = bb
mass = 200
• Derivations
–
–
–
–
mass = 200
decay = ZZ
mass = 200
decay = WW
stability = 3
mass = 200
decay = WW
mass = 200
A workflow log.
decay = WW
mass = 200
stability = 1
event = 8
Linking items, in a directed graph.
when, who, how, which, what, where
mass = 200
mass = 200
decay = WW
plot = 1
event = 8
Execution Process-centric
mass = 200
• Annotations
– Attached to items or collections of
items, in a structured, semi-structured
or free text form.
– Annotations on one item or linking
items.
– why, when, where, who, what, how.
– Data-centric
decay = WW
plot = 1
mass = 200
decay = WW
stability = 1
LowPt = 20
HighPt = 10000
mass = 200
decay = WW
stability = 1
event = 8
mass = 200
decay = WW
stability = 1
plot = 1
Challenges
 cross-referencing across runs and
within experiment;
 Provenance of *good* metadata
annotation
 Bridging provenance islands
 Moreover….
Challenges: Complex
cross-referencing information
Experiment design file
 Complex control flow
 Iterative data and
process flow
 Repetitive running
producing crossreferencing
information
 human interaction
activities v.s. service
invocations
 Service failure and
experiment reRevised experiment
composition
State
controls
Iterative
service
Experiment run
with interactions
Experiment run
with failures
Challenges
• Annotations:
– Mandatory / automatic
– Who did that
– How much should be
trusted
– Security control
– Authenticity validation
– Quality
– Cross-referencing
– Versioning
Diverse
informatio
n
Challenges:
provenance islands
Service 1
Service 2
Workflow 1
Experimental
Investigation 1
Diverse
metadata of
information
Data 1
Moreover
•
•
•
•
•
•
•
•
Intellectual property
Preservation
Archiving
Query and access
Integration
Investigation
Impact analysis
……
myGrid Approach
• Taverna workflow workbench
– Provenance plug-in;
– mIR(myGrid Information Repository) plug-in;
• myGrid information model
– Based on CCLRC scientific metadata model
– Providing shared model for services and components
interactions
• Semantic Web technologies
– RDF (Resource Description Framework)
– Ontologies
• LSIDs and URNs
http://taverna.sourceforge.net
http://freefluo.sourceforge.net
B. Matthews and S. Sufi: The CLRC Scientific Metadata Model, version 1, DL
TR 02001, CLRC, February 2001
RDF in a Nutshell
• Resource Description
Language
• Common model for
metadata
• A graph of triples
• <subject, predicate,
object>
• RDQL, repositories,
integration tools,
presentation tools
http://www.w3.org/RDF/
• Jena, Haystack
Organisation level provenance
Process level provenance
runBy
e.g. BLAST @ NCBI
project
Process
Experiment design
partOf
Service
Workflow design
componentProcess
Event
e.g. web service invocation
of BLAST @ NCBI
instanceOf
componentEvent
Workflow run
hasOutput
e.g. completion of a
web service invocation
at 12.04pm
hasInput
Data/ knowledge level provenance
knowledge statements
run for
User can add templates to each
Person
workflow process
to determine
knowledge links between data items.
Organisation
e.g. similar protein sequence to
Data
subClass
Blast Result
DNA sequence
data derivation e.g. output data derived from input data
Representing links
http://www.mygrid.org.uk/ontology#derived_from
urn:lsid:taverna.sf.net:datathing:45fg6
urn:lsid:taverna.sf.net:datathing:23ty3
• Identify link type
– Again use URI
– Allows us to use RDF infrastructure
• Repositories
• Ontologies
LSID for
GenBank
Data
Provenance
Web
Personalization
view
Reflection
 First attempt
 Bridging the island
 Provenance modelling: relational + schema-less
model
 Provenance collection
 Moreover:




Provenance slicing
Security control
Authenticity validation
Provenance versioning and (long-time) preservation
Related Work
• Chimera:
 Provenance cross-referencing
– www.griphyn.org/chimera/
• CombeChem:
– www.combechem.org/
• PASOA (Provenance Aware Service Oriented Architecture)
– http://twiki.pasoa.ecs.soton.ac.uk/bin/view/PASOA/WebHome
• CMCS(Collaboratory for Multi-Scale Chemical Science)
– http://cmcs.ca.sandia.gov/index.php
• ESSW (Earth System Science Workbench)
– http://essw.bren.ucsb.edu/
Acknowledgement
– myGrid team:
• esp. Carole Goble, Robert Stevens, Chris Wroe,
Mark Greenwood, Phil Lord
– IBM:
• Dennis Quan
– Williams Group
• Esp. Hannah Tipney