slides (ppt) - The Open Provenance Model

Transcript slides (ppt) - The Open Provenance Model

Open Provenance Model Tutorial
Session 2: OPM Overview and Semantics
Luc Moreau
[email protected]
University of Southampton
Session 2: Aims
In this session, you will learn about:
• The Open Provenance Model
• The definition of its abstract model
• The inferences it supports
• Various efforts to provide OPM with a
semantics
Session 2: Contents
•
•
•
•
Requirements and non-requirements
Definition of OPM
Specialization of OPM with Profiles
Formalizations of OPM
OPM (NON-)REQUIREMENTS
OPM Requirements
• To allow provenance information to be
exchanged between systems, by means of a
compatibility layer based on a shared provenance
model.
• To allow developers to build and share tools that
operate on such provenance model.
• To define the model in a precise, technologyagnostic manner.
• To define bindings to XML/RDF separately
• To support a digital representation of provenance
for any “thing”, whether produced by computer
systems or not
OPM Non-Requirements
• OPM does not specify the internal
representations that systems have to adopt to
store and manipulate provenance internally.
• OPM does not specify protocols to store such
provenance information in provenance
repositories.
• OPM does not specify protocols to query
provenance repositories.
OPM Domain Specialization: Workflow, Web
OPM Essential Profiles: Collections, Attribution
OPM Core
OPM Sig
OPM based APIs: record, query
Technology Bindings: XML, RDF
OPM Layered Model
7
THE OPEN PROVENANCE MODEL
(OPM)
Open Provenance Model
• Allow us to express all the causes of an item
– e.g., provenance of a bottle of wine includes:
•
•
•
•
•
Grapes from which it is made
Where those grapes grew
Process in the wine’s preparation
How the wine was stored
Between which parties the wine was transported, e.g. producer to
distributer to retailer
• Where it was auctioned
• Allow for process-oriented and dataflow oriented
views
• Based on a notion of annotated causality graph
Nodes
• Artifact: Immutable piece of state, which
may have a physical embodiment in a
physical object, or a digital
representation in a computer system.
• Process: Action or series of actions
performed on or caused by artifacts, and
resulting in new artifacts.
• Agent: Contextual entity acting as a
catalyst of a process, enabling,
facilitating, controlling, affecting its
execution.
A
P
Ag
Edges
A
used(R)
P
P1
P
wasGeneratedBy(R)
Ag
P2
A
A1
wasControlledBy(R)
wasTriggeredBy
wasDerivedFrom
A2
P
Edge labels are in the past to express that these are used to describe past executions
Illustration
A1
A2
used(dividend)
used(divisor)
P
wasGeneratedBy(quotient)
A3
• Process “used” artifacts and
“generated” artifact
• Edge “roles” indicate the
function of the artifact with
respect to the process (akin
to function parameters)
• Edges and nodes can be
typed
type=division
wasGeneratedBy(rest)
A4
Causation chain:
• P was caused by A1 and A2
• A3 and A4 were caused by P
• Does it mean that A3 and A4
were caused by A1 and A2?
Hierarchical Descriptions (1)
A1
A2
used(r1)
used(r2)
P
wasGeneratedBy(r4)
A3
wasGeneratedBy(r3)
A4
Hierarchical Descriptions (2)
A1
Drill down
A2
used(r1)
used(r2)
P1
P2
wasGeneratedBy(r4)
A3
wasGeneratedBy(r3)
A4
Hierarchical Descriptions (3)
A1
A2
used(r1)
used(r2)
P
wasGeneratedBy(r4)
A3
wasGeneratedBy(r3)
A4
A1
A2
used(r1)
used(r2)
P1
P2
wasGeneratedBy(r4)
A3
wasGeneratedBy(r3)
A4
If these two graphs denote the same execution, it is not true that A4 was caused by A1; hence
dependencies between artifacts need to be asserted explicit
Explicit Data Derivations (1)
A1
A2
used(r1)
wasDerivedFrom
wasGeneratedBy(r4)
A3
used(r2)
P
wasDerivedFrom
wasGeneratedBy(r3)
A4
A1
used(r1)
wasDerivedFrom
P1
wasGeneratedBy(r4)
A3
A2
used(r2)
P2
wasDerivedFrom
wasGeneratedBy(r3)
A4
If these two graphs denote the same execution, it is not true that A4 was cause by A1; hence
dependencies between artifacts need to be asserted explicit
Explicit Data Derivations (2)
A2
used(dividend)
used(divisor)
P
wasGeneratedBy(quotient)
A3
type
=division
wasDerivedFrom
wasDerivedFrom
A1
wasGeneratedBy(rest)
A4
Causation chain:
• P was caused by A1 and
A2
• A3 and A4 were caused
by P
• A3 was caused by A1
and A2
• A4 was caused by A1
and A2
Provenance of Physical Objects
Another Account of a same
Execution
Accounts
• Mechanism by which multiple descriptions of a
same execution can co-exist in a same OPM graph
• Different accounts may be provided by different
observers (or asserters)
• Accounts can overlap if they have some OPM
subgraph in common
• An account can be a refinement of another, if it
provides more details
– Support for hierarchical descriptions
• Accounts may be conflicting!
Accounts
• Account is like a graph
colouring
• Nodes/edges are
asserted to belong to
some accounts
Bake execution
Bad Bake execution
Both executions
OPM SEMANTICS
Completion Rules
P1
P1
A1
A1
A
P2
P2
Equivalence
P
A2
A2
Converse does not
necessarily hold
Inferences
A/P1
A
A
A/P2
A
A/P1
A
*
A/P2
A
• Transitivity of edges
connecting an artifact
• Starred edge “was
Caused by”
• What we can infer is
defined by transitive
closure
WasTriggeredBy is not transitive
P1
P1
P2
P3
*
P3
• By completion, there
exists A12 generated by
P1 and used by P2
• By completion, there
exists A23 generated by
P2 and used by P3
• A23 could have been
generated before A12
was used
OPM Inferences
Valid OPM Graphs
• WasDerivedFrom* is acyclic within one
account
– Intuition: a data item cannot be derived from itself
– Note: cycles may exist in multiple accounts
• An artifact can be generated by at most one
process in a given account
Time Information
• Causality implies time ordering, but not the
converse
• Time regarded as crucial information in the
provenance of data (though time does not
imply causality)
• The model specifies constraints that time
information must satisfy with respect to
causal dependencies
Time Constraints
Ag
start: T2
end: T5
wasControlledBy(R)
wasGeneratedBy(R)
T1
A
used(R)
T3
P
wasGeneratedBy(R)
T4
A
T1<T3 (artifact must exist before being used)
T2<T3 (process must have started before using artifacts)
T3<T5 (process uses artifacts before it ends)
T2<T4 (process must have started before generating artifacts)
T4<T5 (process generates artifacts before it ends)
T4<T6 (artifact must exist before being used)
T2<T5 (process must have started before ending)
no constraint between t3 and t4
used(R)
T6
Annotations
Let’s no reinvent the wheel!
• All OPM entities (edges, nodes, graphs,
accounts can be annotated)
• All annotations should be addressable
(allowing for annotations of annotations)
• Bindings to formalize how annotations can be
serialized (standard in RDF, custom in XML)
• Reserved properties: hasType, hasValue, ...
OPM SPECIALIZATIONS
Concept of a Profile
• A specialisation of an OPM graph for a specific
domain or to handle a specific problem
• Profile definitions are welcome!
• Note: profile multiplicity challenges interoperability
• A profile has a unique identity
• Defines vocabulary, guidelines, expansion
guidance, serialisation format
Profile Compliance
PROFILE
•Id
•Vocabulary
•Guidance
•Expansion directives
•Serialisation
Profile Expansion
Profile
Compliant
Graph
Profile-expanded
Graph
Profile Compliance
Profile
Compliant
Graph
Profile-expanded
Graph
OPM Inference
Inferred
Graph1
Inferred Graph 2
Emerging Profiles
• Emerging Profiles
– Collections
– Dublin Core
– D-Profile
• Will be discussed in separate session
OPM FORMALIZATIONS
Early Formalizations
• OPM v1.00 and OPMv1.01 contained a settheoretic definition of OPM and permitted
inferences
• Moved out of OPMv1.1 since it is difficult to
keep specification and formalization in sync
• While the formalization is useful in defining
OPM precisely, it does not give OPM a
meaning!
Reproducibility Semantics
(Moreau 2010)
• Sees OPM graph as an executable program:
– Each process is associated with the name of an
executable primitive
– Primitive environment maps primitive names to
primitives
• PrimitiveEnv = PrimitiveNamePrimitive
• Primitive = P(RoleValue) P(RoleValue)
– Graph factories to create new artifacts, new
processes …
Reproducibility Semantics
(Moreau 2010)
• An execution of an OPM graph results in
– A new OPM graph, describing re-execution
– A mapping between nodes of the original graph
and the resulting graph
• Execution proceeds by ordering processes
(assumes acyclicity) and re-executing them,
one by one; for each process executed, new
process node and new output artifacts are
created by factory
Reproducibility Semantics
(Moreau 2010)
Temporal Semantics
(Kwasnikowska, Moreau, Van den Bussche 2010)
• Timepoints
– create(A): creation of artifact A
– begin(P), end(P): beginning and end of process P
– use(P,r,A): use of artifact A in role r, by process P
• Temporal theory Th(G) of a graph G is a set of inequalities: e.g.,
– begin(P)≤create(A) for any generated-by edge AP
– create(A)≤end(P) for any used edge PA
• Temporal interpretation of G is a triple (T, , τ)
• A temporal interpretation satisfies u≤v if τ(u)  τ(v)
• A temporal model of G is a is a temporal interpretation that
satisfies all inequalities from Th(G)
• Logical consequence G ⊨ u≤v if it is satisfied in every temporal
model of G.
Temporal Semantics
(Kwasnikowska, Moreau, Van den Bussche 2010)
• OPM Inference: G ⊢ AP
• Why this set of inference rules?
• Characterization of OPM inference rules in the
form of a soundness and completeness result
Cases not involving use-timepoints
– G ⊨ begin(P)≤create(A) iff G ⊢ AP
Cases involving use-timepoints
– G ⊨ begin(P)≤use(Q,r,A) iff G ⊢ some pattern
Temporal Semantics
(Kwasnikowska, Moreau, Van den Bussche 2010)
Refinement of two OPM graphs
• Let us consider two OPM graphs G and H,
• For any timepoints u,v of both G and H,
• G is refined by H
•
If G ⊨ u≤v then H ⊨ u≤v
Causality Semantics
(Cheney 2010)
• Exploits Halpern and Pearl’s causal theory of
explanation
• The semantics of an OPM graph is a causal
function, mapping graph inputs to outputs
• Provenance semantics P f approximates locally
a function f, if for any u1, …, un
[[P f(u1, …, un)]]τ=fτ(u1, …, un)
for some intervention τ fixing some inputs of f
Workflow Semantics
(Missier and Goble 2010)
• Two functions:
– W2G: Workflow × Trace  OPM Graph
– G2W: OPM Graph  Workflow
• Two properties:
– Plausible workflow:
• W2G(G2W(g),T)=g
– Lossless-ness:
• G2W(W2G(w,T))=w
• Define W2G and G2W for Taverna workflow language
• Introduce annotations to be able to reconstruct
Taverna iterations
• In essence, provide a semantics for OPM by composing
G2W and Taverna semantics
Provenance Vocabulary Mappings
(Sahoo et al 2010)
OPM selected as the reference provenance model.
• First, because OPM is a general and broad model that
encompasses many aspects of provenance.
• Second, it already represents a community effort that spans
several years and is still ongoing, already benefiting from
many discussions, practical use, and several versions.
• Finally, many groups are already undergoing efforts to map
their vocabularies to OPM, and in addition there are
already some mappings (called profiles in OPM) developed
by the OPM group to some existing vocabularies.
Conclusions on OPM Semantics
• Four novel semantics of OPM published in
2010
• Deal with different subsets of OPM
• Not all fully “compatible” with OPM v1.1
• Grand theory of OPM is still an open problem
CONCLUSION AND OPEN ISSUES
Conclusions
• Over 14 teams have implemented the OPM
specification for a successful inter-operability
exercise PC3
• Open source governance model for OPM
• OPM1.1 published and to be used in PC4
• OPM consists of a common core found in
many provenance vocabularies
• What beyond?
– Define useful profiles
– Finalize semantics
Open Issues (inter-operability)
• List of technical issues: agents, annotations,
time, streamed data, collections, mutable
objects
• How to express queries over OPM graphs?
• Security: attribution and non-repudiation
• API for recording and querying
• How to inter-operate in a distributed system?
Open Issues (research)
• Accounts
• Relations between accounts: refinement,
overlap, alternate
• Reasoning with conflicting provenance
• Reasoning with incomplete provenance
• Can we formalise profiles?