Realizing the life science grid with Taverna

Download Report

Transcript Realizing the life science grid with Taverna

Tom Oinn, [email protected]
In general a grid system is, or should be :
“A collection of a resources able to act collaboratively
in pursuit of an overall objective”
A life science grid is therefore :
“A collection of resources able to act collaboratively
to solve a problem in the life science domain”

Massive diversity of
Information classes
 Services
 Data
 Problems





Relatively small data sizes
Relatively small computational load
Challenge is complexity and heterogeneity
Much scientific work is exploratory
Environment must be flexible and easy to reconfigure
 Environment must provide facilities for provenance capture


Existing diverse services


Web based, SOAP services, custom protocols such as
BioMoby etc.
Existing data resources
Relational, unstructured flat file, XML
 May or may not be exposed through some kind of service
interface i.e. SRS, BioMart


Existing user communities
Large well funded service and research projects with
substantial IT support
 Small groups with no IT support, little funding but interesting
problems



Experts in their domain
Little or no experience with distributed computing



Most bioinformaticians are not computer scientists
Generally not supported by dedicated CS groups
Need to allow these users to make use of their existing
expertise but remove concerns such as:






Parallelism
Distributed programming
Fault recovery
Job dispatch and submission
Provenance capture
Logging and auditing
“A collection of existing legacy and novel tools and databases exposed
through a variety of technologies able to act collaboratively to solve
a problem posed by an ‘IT naïve’ user in the life science domain
across the public internet and with little or no technological support
and as inexpensively as possible.”
Users typically have no control over services (provided by 3rd parties)
so create a client side integration platform.
Should be accessible to an unsupported PhD student with standard
networking, a three year old PC and no dedicated IT support.
http://taverna.sf.net
A ‘super client’ to a variety of disparate services on both intra-net and inter-net




Project homepage : http://taverna.sf.net
myGrid project page : http://www.mygrid.org.uk
OMII-UK home : http://www.omii.ac.uk
Alberto’s Taverna + EBI mini tutorial :
http://www.ebi.ac.uk/Tools/webservices/tutorials/taverna

Taverna is :
A
workflow language based on a dataflow model.
 A graphical editing environment for that language.
 An invocation system to run instances of that language
on data supplied by a user of the system.


When you download it you get all this rolled into a
single piece of desktop software
The enactor can be run independently of the GUI

Taverna can interoperate the following by default :
 SOAP
based web services
 Biomart data warehouses
 Soaplab wrapped command line tools
 BioMoby services and object constructors
 Inline interpreted scripting (Java based)

Other service classes can be added through an
extension point (but you probably don’t need to)
•Add service to services list by
pointing Taverna to Web Service
Description Language (WSDL)
document online
Document builders
•Taverna inspects WSDL, extracts
operations
Service invocation
(creates job)
•Add operations to workflow, right
click to automatically add
document builders and splitters for
doc/literal style services
Polling loop (check
status, fail if not ready)
•Use nested workflow to define
polling logic, sub-workflow fails,
waits and retries if data is not
ready
Get results
*SOAP is the Simple Object Access Protocol - http://www.w3.org/TR/soap/ & http://www.w3.org/TR/wsdl
Individual tool
within category
Soaplab server in
services list
Soaplab services support rich
descriptive metadata
•Soaplab services are added to the services palette by
pointing Taverna at the root of the Soaplab installation.
•Individual services within that server are categorized
and displayed within categories
•Services support polling and provide links to metadata
directly within Taverna
http://www.ebi.ac.uk/Tools/webservices/soaplab/guide



BioMoby provides semantic description of services
Taverna can use this to assist in the service
composition at design time
All this provided by the Moby team – Taverna’s
extension architecture allows third party developers
to contribute in a loosely coupled way

Service discovery



Provenance tracking




Lineage tracking of result data.
Automatic semantic annotation of data from service annotations.
Possible as the workflow engine creates a ‘managed environment’ with an overview of all
data movement.
Result visualization


Free text search over ‘known’ services.
Semantic search over service repository, relies on manual service annotation and
submission of those annotations to the repository.
Common renderers included in base distribution include 3d structure, images, graph
rendering
Extensibility



New service classes
New renderer types
New UI elements





Funded through the Open Middleware Infrastructure
Institute (OMII-UK) as part of the myGrid project run
by Carole Goble
Four years old, funding secured through 2008 and
beyond.
Development team at Manchester & Hinxton, UK
Wide group of ‘friends and allies’ across the world
particularly within UK eScience
Implemented in Java, released under LGPL licence.
OMII-UK
(Virtual Institute)
Top level
support,
overall
strategy,
coordinated
builds
OGSADAI
(group)
Edinburgh
Areas of
expertise,
project
management,
per-project
support
myGrid (group)
Manchester & EBI
Taverna
myExperiment
…
OGSA-DQP
OMII Stack (group)
Southampton
OMII Server
side stack, WS
Security etc.
GridSAM job
submission tool
…
Software
products




Science varies widely in scale both in space (CPU
cycles required, storage, numbers of services etc)
and time (duration of collaborations, stability of VO
membership)
Current grid infrastructure is focused on projects
with large spatial and temporal scale
Does this existing work map well to scientific
problems with different characteristics, especially
different temporal characteristics?
What about security…?


A workflow can access multiple resources
These resources can have arbitrary security constraints





It is likely that a given workflow requires more than one principal to be
available to complete.
How can we make multiple security agents available to the workflow
engine in a principled fashion?
Define the basic unit of a virtual experiment or fast virtual
organization to map directly to a peer group within a peer to peer
framework
Peer group contains a workflow instance along with any resources
required to enact that instance including arbitrarily many security
agents, data stores, metadata stores etc.
Services accessed by the workflow may (and usually will) exist
outside of the peer group.
Grid service
Web service
REST service
...
External tools, data
and services
Policy
Workflow
instance
Policy
engine
Set of
credentials
Peer Group
(Virtual Experiment)
Policy
Data Manager
Policy
engine
Set of
credentials



A Virtual Experiment (VE) is created by the construction of a new
peer group within the P2P framework
Resources such as workflow engines, data managers and security
agents exist as factory services.
Each factory can construct a limited version of itself




These limited proxy objects connect to the peer group


Workflow engines with specific workflow definitions loaded
Data managers with specific levels of storage space
Security agents with policies to restrict full use of credentials
This is a secured operation but as there is no delegation existing security
mechanisms are adequate to get this far
Factories may be on the intranet or internet (most likely for workflow
services) or on the user’s workstation, PDA or cellphone (for security
agents).



A VE becomes collaborative when more than one user can
access the objects within the peer group.
A VE uses collaborative security when more than one user
inserts a security agent into the peer group.
Note that the peer group structure also allows multiple views
on the same VE as objects can exist in more than one peer
group.


For example, you could split the workflow instance into a
monitoring and steering component and give some users access to
a peer group containing both and others to one containing only
the monitoring part.
The peer group has a unique identity which can be used to
discover or register it with any registry service available.


Taverna2 under development, delivery by the end of 2007
Rewrite of Taverna to support, amongst other things:








Integration with grid technologies through a set of new extensibility
points
Transient VO management (short lived virtual organizations, 20 second
upwards lifetime!)
More sophisticated computational model
Massive scalability, pipelining of nested token streams, single threaded
execution model, transparent reference passing architecture
Monitoring and steering of running processes with arbitrary granularity
through an extension point
Implement extensions to interface to your GRID
Get a free and well supported rich client portal for non expert users
Access otherwise out of reach user communities

If you have a grid with resources that our community could
use



If you have a scientific community who wants to access such
resources




Talk to us, tell us about it
Write a plugin for its resource broker, data system or security
model
Again, please let us know
We can provide on site training
We are always interested in new application areas for our work
I can be contacted at [email protected], or for more general
discussion please join the mailing lists linked from
http://taverna.sf.net
Please see http://www.mygrid.org.uk/wiki/Mygrid/Acknowledgements for most up to date list