Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Download Report

Transcript Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Large-Scale Science
Through Workflow
Management
Ewa Deelman
Center for Grid Technologies
USC Information Sciences Institute
Acknowledgements




Ewa Deelman, Carl Kesselman, Gaurang
Mehta, Gurmeet Singh, Mei-Hui Su, Karan
Vahi (Center for Grid Technologies, ISI)
James Blythe, Yolanda Gil (Intelligent
Systems Division, ISI)
http://pegasus.isi.edu
Research funded as part of the NSF
GriPhyN, NVO and SCEC projects and EUfunded GridLab
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Today’s Scientific
Applications




Increasing in the level of complexity
Use of individual application components
Reuse of individual intermediate data products (files)
Description of Data Products using Metadata Attributes

Execution environment is complex and very dynamic
 Resources come and go
 Data is replicated
 Components can be found at various locations or staged in on
demand

Separation between
 the application description
 the actual execution description
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Workflow Definitions



Workflow template: shows the main steps in
the scientific analysis and their dependencies
without specifying particular data products
Abstract workflow: depicts the scientific
analysis including the data used and
generated, but does not include information
about the resources needed for execution
Concrete workflow: an executable workflow
that includes details of the execution
environment
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Scientific Analysis
Construct the Analysis
Workflow Evolution
Workflow Template
Select the Input Data
Abstract Worfklow
Map the Workflow onto
Available Resources
Concrete Workflow
Execute the Workflow
Ewa Deelman, [email protected]
Tasks to be executed
www.isi.edu/~deelman
pegasus.isi.edu
Grid Resources
Scientific Analysis
Execution Environment
User guided
Construct the Analysis
Component characteristics
Workflow Template
Select the Input Data
Workflow Evolution
Library of
Application
Components
Data
Catalogs
Data properties
Abstract Worfklow
Automated
Map the Workflow onto
Available Resources
Resource availability and
characteristics
Information
Services
Concrete Workflow
Execute the Workflow
Ewa Deelman, [email protected]
Tasks to be executed
www.isi.edu/~deelman
pegasus.isi.edu
Grid Resources
Concrete Workflow
Generation and Mapping
Compositional
Analysis Tool
(CAT)
Workflow Template
Input Data
Selector
Montage
Abstract
Workflow
Service
Chimera
Results
Abstract Workflow
Applicationdependent
Application
independent
Ewa Deelman, [email protected]
Pegasus
Concrete
Workflow
Condor
DAGMan
www.isi.edu/~deelman
jobs
Grid Resources
pegasus.isi.edu
Pegasus:
Planning for Execution in
Grids

Maps from abstract to concrete workflow





Algorithmic and AI-based techniques
Automatically locates physical locations for
both workflow components and data
Finds appropriate resources to execute
Reuses existing data products where
applicable
Publishes newly derived data products

Provides provenance information
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Generating a Concrete
Workflow
Information


location of files and component
Instances
State of the Grid resources
FFT filea
Select specific



Resources
Files
Add jobs required to form a
concrete workflow that can be
executed in the Grid environment



Abstract
Workflow
/usr/local/bin/fft /home/file1
DataTransfer
Data movement
Data registration
Each component in the abstract
workflow is turned into an
executable job
Ewa Deelman, [email protected]
Move filea from host1://
home/filea
to host2://home/file1
Concrete
Workflow
www.isi.edu/~deelman
Data Registration
pegasus.isi.edu
Information Components
used by Pegasus

Globus Monitoring and Discovery Service
(MDS)


Locates available resources
Finds resource properties



Globus Replica Location Service



Dynamic: load, queue length
Static: location of GridFTP server, RLS, etc
Locates data that may be replicated
Registers new data products
Transformation Catalog

Locates installed executables
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Example Workflow Reduction

Original abstract workflow
a

d1
b
d2
c
If “b” already exists (as determined by query to the
RLS), the workflow can be reduced
b
Ewa Deelman, [email protected]
d2
c
www.isi.edu/~deelman
pegasus.isi.edu
Mapping from abstract to
concrete
b

d2
c
Query RLS, MDS, and TC, schedule computation and data
movement
Move b
from A
to B
Ewa Deelman, [email protected]
Execute
d2 at B
Move c
from B
to U
www.isi.edu/~deelman
Register
c in the
RLS
pegasus.isi.edu
Pegasus Research




resource discovery and assessment
resource selection
resource provisioning
workflow restructuring


task merged together or reordered to improve
overall performance
adaptive computing

Workflow refinement adapts to changing
execution environment
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Benefits of the workflow &
Pegasus approach

The workflow exposes



Pegasus can take advantage of the structure to



the structure of the application
maximum parallelism of the application
Set a planning horizon (how far into the workflow to plan)
Cluster a set of workflow nodes to be executed as one (for
performance)
Pegasus shields from the Grid details
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Benefits of the workflow
& Pegasus approach





Pegasus can run the workflow on a variety of resources
Pegasus can run a single workflow across multiple
resources
Pegasus can opportunistically take advantage of
available resources (through dynamic workflow mapping)
Pegasus can take advantage of pre-existing intermediate
data products
Pegasus can improve the performance of the
application.
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Mosaic of M42
created on the
Teragrid resources
using Pegasus
Pegasus improved
the runtime of this
application by 90%
over the baseline
case
Bruce Berriman,
John Good (Caltech)
Joe Jacob, Dan Katz
(JPL)
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Future Directions
Support for workflows with real-time feedback to scientists.
Providing intermediate analysis results so that the
experimental setup can be adjusted while the short-lived
samples or human subjects are available.
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
Cognitive Grids: Distributed Intelligent Reasoners
that Incrementally Generate the Workflow
User’s
Request
Workflow
refinement
Levels of
abstraction
Application
-level
knowledge
Policy
reasoner
Workflow
repair
Relevant
components
Logical
tasks
Full
abstract
workflow
Tasks
bound to
resources
and sent for
execution
Onto-based
Matchmaker
Ewa Deelman, [email protected]
Not yet
executed
Partial
execution
executed
www.isi.edu/~deelman
pegasus.isi.edu
time
BLAST: set of sequence comparison algorithms that are used
to search sequence databases for optimal local alignments to a query
2 major runs were performed using
Chimera and Pegasus:
1) 60 genomes (4,000 sequences each),
In 24 hours processed Genomes selected
from DOE-sponsored sequencing
projects
67 CPU-days of processing time
delivered
~ 10,000 Grid jobs
>200,000 BLAST executions
50 GB of data generated
2) 450 genomes processed
Speedup of 5-20 times were achieved
because the compute nodes we used
efficiently by keeping the submission
of the jobs to the compute cluster
constant.
Lead by
Nefedova (ANL) as part www.isi.edu/~deelman
of the Paci Data Quest Expedition
program
EwaVeronika
Deelman, [email protected]
pegasus.isi.edu
Tomography (NIH-funded project)
 Derivation of 3D structure from a
series of 2D electron microscopic
projection images,
 Reconstruction and detailed
structural analysis
 complex structures like synapses



large structures like dendritic
spines.
Acquisition and generation of huge
amounts of data
Large amount of state-of-the-art
image processing required to
segment structures from
extraneous background.
Dendrite structure to be rendered by
Tomography
Work performed with Mark Ellisman, Steve Peltier, Abel Lin, Thomas Molina (SDSC)
Ewa Deelman, [email protected]
www.isi.edu/~deelman
pegasus.isi.edu
LIGO’s pulsar search at SC 2002
The pulsar search conducted
at SC 2002
 Used LIGO’s data collected
during the first scientific run
of the instrument
 Targeted a set of 1000
locations of known pulsar as
well as random locations in
the sky
 Results of the analysis were
be published via LDAS (LIGO
Data Analysis System) to the
LIGO Scientific Collaboration
 performed using LDAS and
compute and storage
ISI people
involved:
resources
atGaurang
Caltech,Mehta,
Sonal Patil,
Srividya
Gurmeet
University
ofRao,
Southern
Singh, California,
Karan Vahi University of
Ewa Deelman, [email protected]
Wisconsin
Milwaukee.
Visualization
by Marcus
Thiebaux

www.isi.edu/~deelman
pegasus.isi.edu
Southern California Earthquake
Center
• Southern California Earthquake Center (SCEC), in collaboration with the USC
Information Sciences Institute, San Diego Supercomputer Center, the
Incorporated Research Institutions for Seismology, and the U.S. Geological
Survey, is developing the Southern California Earthquake Center Community
Modeling Environment (SCEC/CME).
•Create fully three-dimensional (3D) simulations of fault-system dynamics.
•Physics-based simulations can potentially provide enormous practical benefits
for assessing and mitigating earthquake risks through Seismic Hazard Analysis
(SHA).
•The SCEC/CME system is an integrated geophysical simulation modeling
framework that automates the process of selecting, configuring, and executing
models of earthquake systems.
Acknowledgments :
Philip Maechling and Vipin Gupta
University Of Southern California
Ewa Deelman, [email protected]
Figure 1: Fréchet sensitivity Kernel showing travel path between a Yorba
Linda earthquake and the TriNet Station DLA.
www.isi.edu/~deelman
pegasus.isi.edu