MyGrid - University of Southampton

Download Report

Transcript MyGrid - University of Southampton

Provenance in myGrid and beyond
www.mygrid.org.uk
Luc Moreau,
University of Southampton, UK
… or the Provenance of
my interest for Provenance
Luc Moreau,
University of Southampton, UK
Overview







Bioinformatics background
myGrid facts
Services and Workflows
Provenance in myGrid
Beyond myGrid Provenance
Architectural vision
Conclusions
Overview







Bioinformatics background
myGrid facts
Services and Workflows
Provenance in myGrid
Beyond myGrid Provenance
Architectural vision
Conclusions
http://www3.ebi.ac.uk/Services/DBStats/
Large amounts of data

EMBL


150 Gbytes
Microarray


July 2001
1 Petabyte per
annum
Sanger Centre


20 terabytes of
data
Genome
sequences
increase 4x per
annum
Heterogeneity


Complexity
Diversity
Phenotyp
e
Gene
Genom
sequen
e
ce
sequen
ce
Disease
Drug
Gene
Gene
express
express
ion
ion
Genomic, proteomic,
transcriptomic,
Proteo
metabalomic, proteinme
protein interactions,
regulatory bio-networks,
Protein
Protein
Structur
alignments, disease,
e
patterns & motifs, protein
structure, protein
classifications, specialist
Disease
Clinical
trial
Disease
Disease
Protein
Sequen
ce
P-P
interaction
s
homology
Heterogeneity




Data types & forms
Community
Autonomy
Over 500 different
databases


Different formats,
structure, schemas,
coverage…
Web interfaces, flat
file distribution,…
Heterogeneous Data





Multimedia
Images & Video
Text annotations
& literature
Descriptive as well
as numeric
Knowledge-based
Text
Extraction
Bioinformatics Analysis

Different algorithms


Different
implementations


BLAST, FASTA, pSW
WU-BLAST,
NCBI-BLAST
Different service
providers

NCBI, EBI, DDBJ
Drug Discovery
In silico experimentation

Discovery of resources and tools, staging of
operations, sharing of results

Process is as important as outcome

Science is dynamic – change happens

Scientific discovery is personal & global

Provenance and history
Overview







Bioinformatics background
myGrid facts
Services and Workflows
Provenance in myGrid
Beyond myGrid Provenance
Architectural vision
Conclusions
myGrid




EPSRC funded pilot project
Generic middleware within application setting
36 month in 42 month performance period
Start 1st October 2001
 16 full-time post docs altogether
 6 DTA studentships
 1 technical project manager
myGrid consortium

Scientific Team



Biologists and Bioinformaticians
GSK, AZ, Merck KGaA, Manchester, EBI
Technical Team




Manchester, Southampton, Newcastle, Sheffield,
EBI, Nottingham
IBM, SUN
GeneticXchange
Network Inference, Epistemics Ltd
myGrid outcomes
 e-Scientists

Bioinformatics demonstrator (Graves’ disease
and Williams’ syndrome)
 Developers

myGrid-in-a-Box developer’s kit
(currently myGrid 0.4)

Integrating some existing bioinformatics tools
with myGrid (EBI services)
Overview







Bioinformatics background
myGrid facts
Services and Workflows
Provenance in myGrid
Beyond myGrid Provenance
Architectural vision
Conclusions
Graves disease


Autoimmune disease of
the thyroid in which the
immune system of an
individual attacks cells
in the thyroid gland
resulting in
hyperthyroidism
Weight loss, trembling,
muscle weakness,
increased pulse rate,
increased sweating and
heat intolerance, goitre,
exophthalmos
The Biology


GD caused by the
stimulation of the
thyrotrophin receptor
by thyroid-stimulating
autoantibodies
secreted by
lymphocytes of the
immune system.
Why is the
lymphocyte causing
the antibodies that
attack the thyroid
cell?
Graves’ Disease Experimental Process
What genes are associated with Graves’ Disease?
Candidate Gene
What is known
about my candidate
gene? Literature
Previous Research
Databases
Experimental
Annotation Pipeline
What SNPs (single
nucleotide
polymorphisms) in
my candidate gene
might be relevant?
How can I visualise
annotations to my
candidate gene?
Verify relevance of
SNPs.
Genotype Assay Design System
3D Protein Structure &
SNP Visualisation
Experiment life cycle
Personalised registries
Personalised workflows
Info repository views
Personalised annotations
Personalised metadata
Security
Resource & service discovery
Repository creation
Workflow creation
Database query formation
Forming experiments
Personalisation
Discovering and reusing
experiments and resources
Workflow
discovery &
refinement
Resource &
service discovery
Repository
creation
Provenance
Providing services &
experiments
Service registration
Workflow deposition
Metadata Annotation
Third party
registration
Executing
experiments
Managing
experiments
Information repository
Metadata management
Provenance
management
Workflow evolution
Event notification
Workflow enactment
Distributed Query
processing
Job execution
Provenance
generation
Single sign-on
authorisation
Event notification
A work bench for demonstrating services
myView on
the mIR
Workflow
Metadata
about
workflow
note about
workflow
Worflows
• A workflow represents an experiment that
can be run on the Grid.
• A workflow takes data as input.
• It performs ‘activities’, which are steps
involved in analysing the data, including using
tools and services, querying databases and
running other workflows.
• A workflow can be run on the user’s local
machine, or remotely, taking advantage of
resources that are distributed.
• Data intensive grid having to deal with
heterogeneity of the data and processes.
myGrid schematic
Graves disease scenario
Workbench
Workflow editor
Talisman
Gateway
Event
Notification
Workflow
Enactment
Information
repository
Service
Registry
Exemplars
Generic
Applications
Core
components
Knowledge management
SoapLab
Text services
Bio services
Distributed
query
processing
Services
Service Oriented Architecture
Knowledge
Knowledge
Services
Service
Ontology Server
Semantic registration
Registry
Registry
Reasoner
UDDI
Matcher
Service
Registry
View
UDDI-M
Notification
Notification
Service
Service
Service Discovery
Test
Data
Build/Edit
Workflow
JMS
Workflow
enactment
engine
Provenance
service
mIR
WSFL
Component Discovery
mInfo Repository
Workflow
templates
Distributed
Query Processor
Job Execution
Information
Extraction
PASTA
Workflow
instances
Metadata
Concepts
Provenance Data
DB2
DB2
Service Service Service
SoapLab
myGrid Deployment
User 1
User 2
Workbench
Enactor
Gateway
Browser
Team Server
Portal
Gateway
SemanticFind
UserProxy
Views
User 1 view
User 1 proxy
User 2 proxy
Enactor
User 2 view
DQPService
Notifications
MIR
BioService6
Organization Servers
Reg istry
Service Provider 1
BioService1
Reg istry Provider
Reg istry
Service Provider 2
BioService2
TextServ ice
BioService4
BioService5
BioService3
myGrid 0.4 (Nov 2003)










Describer (MAN): A tool for attaching semantic descriptions to WS and
workflows
Find Service (MAN): A component for classifying and discovering services and
workflows via their semantic descriptions
Ontology Server (MAN): The DAML+OIL reasoner
Workbench (NOT): a NetBeans module for examining and updating the MIR
and submitting workflows for enactment
e-Science Gateway (NOT): An API giving access to myGrid core services
MIR (myGrid information repository) (MAN/NEW): A Web Service
accessing a repository that can hold data for an individual scientist or a team of
scientists.
Notification Service (IAM): A general-purpose Web Service that supports a
publish/subscribe model of event notification, based on JMS
Registry View service (IAM): A Web Service supporting a registry of
published Web Services and workflows annotated with metadata, including
semantic descriptions
Freefluo (ITI): workflow enactment engine
Taverna (EBI): workflow editing environment
Overview







Bioinformatics background
myGrid facts
Services and Workflows
Provenance in myGrid
Beyond myGrid Provenance
Architectural vision
Conclusions
Provenance: definition

Main Entry: prov·e·nance
Pronunciation: 'präv-n&n(t)s, 'prä-v&-"nän(t)s
Function: noun
Etymology: French, from provenir to come forth,
originate, from Latin provenire, from pro- forth +
venire to come -- more at PRO-, COME
Date: 1785
1 : ORIGIN, SOURCE
2 : the history of ownership of a valued object or
work of art or literature
Ownership History for European Paintings and Sculpture
(1933-1945)
As a fundamental part of its mission, The Art Institute
of Chicago has always conducted research on works in
its collection. An important part of that research is the
effort to establish the provenance (chain of ownership)
for a work, from the moment it leaves the artist’s studio
to the present. Since 1997, and in keeping with the
guidelines issued by the American Association of
Museums (AAM) in November 1999 and May 2001, the
Art Institute has intensified its efforts to determine the
provenance for the period 1933-1945 for paintings and
sculpture in its collection.
This research began with an initial survey of all the
European paintings and sculpture in the collection that
were created before 1946 and acquired by the museum
after 1932. Out of the surveyed objects, curators
identified those that had gaps in their chain of
ownership for the Nazi era (1933 to 1945). These
objects were first published on the Art Institute's
website in April 2000, and are listed here with asterisks
at the beginning of their captions. In conformity with
the May 2001 AAM guidelines, this group of objects
published on the website was expanded to include
works that have a documented chain of ownership and
hence no gaps in their provenance, but nevertheless
were in continental Europe from 1933 to 1945, and also
changed hands during that period. These works are
listed without an asterisk.
The Art Institute of Chicago continues to research the
ownership history of those objects that lack conclusive
provenance documentation.Ongoing efforts to identify
past owners include physical examination of works and
consultation of object files. Also critical are the
investigation of museum archives, auction and
exhibition catalogues, monographic studies, directories
Provenance is a fancy word that refers to
an object's history, or who owned the
object, when and where. It is to an object
what a deed trail is to a piece of land.
"Provenance is important because objects
are almost never singular," says Michael
Flanigan, a Baltimore dealer of antique
American furniture. "Provenance separates the special, almost
sacred objects of the past from the ordinary ones." Michael
describes provenance as an object's "itinerary through time."
Search tips
In other words, pens or pencils from the 1860s may be nearly a
dime-a-dozen, but the one that Abraham Lincoln held in his hand
while scribbling the rough draft of the "Gettysburg Address" is not.
In many fields of collecting, provenance is the name of the game.
For example, provenance makes all the difference in military
antiques and collectibles, Michael explains. A gun used by the
French in the 19th century may have value, but one used to defeat
Napoleon at Waterloo would be worth considerably more. If you
can trace that gun into Napoleon's pocket, you have a treasure in
hand. Provenance connected to the infamous — a revolver traced
back to Billy the Kid, for example — can also boost an object's
value exponentially.
"People want to know if something was touched by a great man or
woman," says Michael. "People want some of the gilt from the past
to rub off on them. That's why it matters so much." Provenance that
leads back to not only the great names of the past but also to great
collectors also has value. "Famous collectors burnish the objects
they own," explains Michael, "so new collectors want to join that
distinguished line."
In a clear-cut case, an object's trail of ownership may consist of a
series of bills of sale. However, determining an object's provenance
is not an exact science. Sometimes a photo can pair a person to an
object, or there may be an oral or written history, such as an
autobiography, that connects a person to a particular thing.
Michael adds, however, that provenance can be a shady business.
"Provenance is something that is often proffered, but less often
proved, because it's so difficult to pin down."
Religious antiques, for example, have often been given cachet by
associating them with valued religious figures. "At one time in the
Acacia
Vineyard
Canoe
Provenance Vineyards showcases the
Ridge
distinctiveness of Napa Valley's best
Vineyard
growing areas for red wines by
Chalone
selecting some of the finest Cabernet
Vineyard
Chateau
Sauvignon vineyards in the Rutherford
Duhart- Milon
and Oakville districts and Merlot
Dynamite
vineyards in the Carneros district. The
Vineyards
same uncompromising approach also
Ec helon
characterizes Provenance
Vineyards
winemaking, with the goal of bringing
Edna Valley
the grapes from these great vineyards
Vineyard
to their fullest potential. Starting with
Hew itt
its inaugural 1999 vintage,
Vineyard
Provenance's approachable, yet
Jade
cellar-worthy wines display a fine
Mountain
symmetry of substance and elegance.
Monte Xanic
Moon
Mountain
2001 Provenance
Vineyards Cabernet
Sauvignon Rutherford
Pr ovenance
Vineyards
The word provenance means “origin”
or “source.” In the art world, a
provenance is the record of a work’s ownership, from its
Sagelands
creation to the present day. Connoisseurs place great
Vineyard
importance in provenance as a guarantee of authenticity.
The name reflects Provenance Vineyards' goal of creating
fine wines that are works of art.
Click here for a brochure on Provenance Vineyards
Winemaker's Notes
GO
Provenance Vineyards
PO Box 668
Rutherford, CA 94573
Menu
Provenance







Provenance is related to:
Experiment is repeatable, if not
reproducible, and explained by
provenance records
Who, what, where, why, when,
(w)how?
The traceability of knowledge as
it is evolves and as it is derived.
Immutable metadata
Migration – travels with its data
but may not be stored with it.
Private vs Shared provenance
records.
Credit.
Early Provenance Capture
A full
provenance
record is
linked with
the results.
It’s a log of
execution.
Kinds of “Provenance”

Backward Derivation




An explanation of when,
by who, how something
was produced.
Linking items, usually in
a directed graph.
Execution Processcentric
To be contrasted with
forward derivation,
which is a path like a
workflow, script or
query.
mass = 200
decay = bb
mass = 200
mass = 200
decay = ZZ
mass = 200
decay = WW
stability = 3
mass = 200
decay = WW
mass = 200
decay = WW
stability = 1
LowPt = 20
HighPt = 10000
mass = 200
decay = WW
stability = 1
mass = 200
event = 8
mass = 200
plot = 1
mass = 200
decay = WW
event = 8
mass = 200
decay = WW
plot = 1
mass = 200
decay = WW
stability = 1
event = 8
mass = 200
decay = WW
stability = 1
plot = 1
Kinds of “Provenance”

Annotations




Attached to items or
collections of items, in a
structured, semistructured or free text
form.
Annotations on one item
or linking items.
An explanation of why,
when, where, who,
what, how.
Data-centric
Kinds of “Provenance” in myGrid

Derivations



Workflow Enactment Engine
provides a detailed
provenance record stored in
the myGrid Information
Repository (mIR) describing
what was done, with what
services and when
XML document, soon to be
an RDF model
Annotations

Every mIR object has Dublin
Core provenance properties
described in an attribute
value model
Provenance of data

Operational execution trail
Gene:AC005412.6
SNP:000010197
input
run_for
urn: Claire
Jennings
output
process
start time
end time
by_service
lsid:HGVBase_retrieve
From Provenance to Knowledge
• Declarative semantic execution trail
contains_single_nucleotide_polymorphism
Gene:AC005412.6
input
as stated by
run_for
urn: Claire
Jennings
SNP:000010197
output
process
start time
end time
by_service
lsid:HGVBase_retrieve
From Provenance to Knowledge
• Trust and attribution
urn: Carole
Goble
disputed by
contains_single_nucleotide_polymorphism
Gene:AC005412.6
input
as stated by
run_for
urn: Claire
Jennings
SNP:000010197
output
process
start time
end time
by_service
lsid:HGVBase_retrieve
Provenance vs …

Provenance vs Annotation



Provenance of an annotation
Annotation of Provenance
Provenance vs Workflow


Provenance describes past execution
A workflow is a script for future execution
What is Provenance?


Annotations may be subject of
interpretation (e.g. Alice believes
annotation X, whereas Bob does not).
Provenance should aim at recording an
undisputed view of an execution.
What is Provenance?




Provenance traces execution
Provenance must be generated
automatically
Annotations can be either generated
automatically or created by the user
Annotations can contain semantic
augmentation, which can be derived
automatically or supplied manually.
Generating provenance
Data and metadata from the run RDF+OWL
Scufl
Workflow
execution
Template
startTime,
endTime,
service
instances
invoked …
RDF+OWL
Identify
workflow
mIR
Input data &
parameters
OWL
descriptions RDF
Bind
services
FreeFluo
WFEE
Execution
Provenance
log
Workflow
knowledge
template
Knowledge
Provenance
log
RDF+OWL
registry
Knowledge arising from workflow
Overview







Bioinformatics background
myGrid facts
Services and Workflows
Provenance in myGrid
Beyond myGrid Provenance
Architectural vision
Conclusions
Provenance in a Bioinformatics
Grid
myGrid builds a personalised problem-solving
environment that helps bioinformaticians find, adapt,
construct and execute in silico experiments

Provenance in Drugs Discovery process:
FDA requirement on drug companies to keep a
record of provenance of drug discovery as long
as the drug is in use (up to 50 years
sometimes).
Provenance in Aerospace
Engineering


Provenance requirement:
to maintain a historical
record of outputs from
each sub-system
involved in simulations.
Aircrafts’
provenance
data need to be kept for
up to 99 years when
sold to some countries.
Currently, little direct
support is available for
this.
Provenance in Organ Transplant
Management




Decision support systems for organ and
tissue transplant, rely on a wide range of
data sources, patient data, and doctors’ and
surgeons’ knowledge
Heavily regulated domain: European,
national, regional and site specific rules
govern how decisions are made.
Application of these rules must be ensured,
be auditable and may change over time
Provenance allows tracking previous
decisions: crucial to maximise the efficiency
in matching and recovery rate of patients
The Grid and Virtual
Organisations



The Grid problem is defined as coordinated
resource sharing and problem solving in
dynamic,
multi-institutional
virtual
organisations [FKT01].
Effort is required to allow users to place their
trust in the data produced by such virtual
organisations
Understanding how a given service is likely to
modify data flowing into it, and how this data
has been generated is crucial.
Provenance and Virtual
Organisations


Given a set of services in an open grid
environment that decide to form a virtual
organisation with the aim to produce a given
result;
How can we determine the process that
generated the result, especially after the
virtual organisation has been disbanded?
The lack of information about the origin of
results does not help users to trust such open
environments.
Provenance and Workflows



Workflow enactment has become
popular in the Grid and Web Services
communities
Workflow enactment can be seen as a
scripted form of virtual organisation.
The problem is similar: how can we
determine the origin of enactment
results.
Provenance: Definition



Provenance is some data able to explain how
a particular result has been derived.
In a service-oriented architecture,
provenance identifies what data is passed
between services, what services are
available, and what results are generated for
particular sets of input values, etc.
Using provenance, a user can trace the
“process” that led to the aggregation of
services producing a particular output.
Overview







Bioinformatics background
myGrid facts
Services and Workflows
Provenance in myGrid
Beyond myGrid Provenance
Architectural vision
Conclusions
What is the problem?


Provenance recording should be part of the
infrastructure, so that users can elect to
enable it when they execute their complex
tasks over the Grid or in Web Services
environments.
Currently, the Web Services protocol stack
and the Open Grid Services Architecture do
not provide any support for recording
provenance.
Architectural Vision
Architectural Vision



Provenance gathering is a collaborative
process that involves multiple entities,
including the workflow enactment engine, the
enactment engine's client, the service
directory, and the invoked services.
Provenance data will be submitted to one or
more “provenance repositories” acting as
storage for provenance data.
Upon user's requests, some analysis,
navigation and reasoning over provenance
data can be undertaken.
Architectural Vision



Storage
could
be
provenance service.
achieved
by
a
Provenance service would provide
support for analysis, navigation or
reasoning over provenance
Client side support for submitting
provenance data to the provenance
service.
A First Prototype





(Szomszor,Moreau 03)
A service-oriented architecture for
provenance support in Grid and Web Services
environments, based on the idea of a
provenance service;
A client-side API for recording provenance
data for Web Service invocation;
A data model for storing provenance data;
A server-side interface for querying
provenance data;
Two components making use of provenance:
provenance browsing and provenance
validation.
Prototype Overview
Prototype Sequence Diagram
Prototype Sequence Diagram




To identify the interactions between
provenance service, client side library and
enactment engine
Creation of a session
Need to be able to support the most complex
workflows including conditional branching,
iteration, recursion and parallel execution.
Support asynchronous submission of
provenance data so that provenance
submission does not delay workflow
execution.
Prototype Provenance Data
Model
Prototype Provenance Data Model


Must support recording of all
information necessary to replay
execution
Must support all complex forms of
workflows (recursion, iterations, parallel
execution).
Prototype Provenance
Browser
Discussion




In order for provenance data to be useful, we expect
such a protocol to support some “classical”
properties of distributed algorithms.
Using mutual authentication, an invoked service can
ensure that it submits data to a specific provenance
server, and vice-versa, a provenance server can
ensure that it receives data from a given service.
With non-repudiation, we can retain evidence of the
fact that a service has committed to executing a
particular invocation and has produced a given
result.
We anticipate that cryptographic techniques will be
useful to ensure such properties
Towards Trust
Towards Trust





Using the provenance of data, trust metrics
of the data can be derived from:
Trust the user places in invoked services
Trust the user places in the input data
Trust the user places in the enacted workflow
Trust the user places in the enactor
Trust the user places in the provenance
service.




The purpose of project PASOA to investigate
provenance in Grid architectures
Funded by EPSRC under the “fundamental
computer science for e-Science call”
In collaboration with Cardiff
www.pasoa.org
Conclusion




Provenance is a rather unexplored domain
Strategic to bring trust in open environment
Necessity to design a configurable
architecture capable of support multiple
requirements from very different application
domains.
Need to further investigate the algorithmic
foundations of provenance, which will lead to
scalable and secure industrial solutions.
Publications

[SM03] Martin Szomszor and Luc Moreau. Recording and
reasoning over data provenance in web and grid services. In
International Conference on Ontologies, Databases and
Applications of SEmantics (ODBASE'03), volume 2888 of Lecture
Notes in Computer Science, pages 603-620, Catania, Sicily,


Italy, November 2003.
[MCS+03] Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf
Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven
Willmott. Provenance-based trust for grid computing - position
paper. 2003.
[GGS+03] Mark Greenwood, Carole Goble, Robert Stevens, Jun
Zhao, Matthew Addis, Darren Marvin, Luc Moreau, and Tom
Oinn. Provenance of e-science experiments - experience from
bioinformatics. In Proceedings of the UK OST e-Science second
All Hands Meeting 2003 (AHM'03), pages 223-226, Nottingham,
UK, September 2003.
Acknowledgements








The myGrid Southampton Team: Simon Miles,
Juri Papay, Ananth Krishna, Michael Luck,
David De Roure, Terry Payne
Mark Greenwood, Carole Goble, Manchester
Martin Szomszor, Southampton
Syd Chapman, IBM
Omer Rana, Cardiff
Andreas Schreiber and Rolf Hempel, DLR
Lazslo Varga, SZTAKI
Ulises Cortes and Steven Willmott, UPC
www.mygrid.org.uk
m