Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Download ReportTranscript Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute Acknowledgements Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi (Center for Grid Technologies, ISI) James Blythe, Yolanda Gil (Intelligent Systems Division, ISI) http://pegasus.isi.edu Research funded as part of the NSF GriPhyN, NVO and SCEC projects and EUfunded GridLab Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Today’s Scientific Applications Increasing in the level of complexity Use of individual application components Reuse of individual intermediate data products (files) Description of Data Products using Metadata Attributes Execution environment is complex and very dynamic Resources come and go Data is replicated Components can be found at various locations or staged in on demand Separation between the application description the actual execution description Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Workflow Definitions Workflow template: shows the main steps in the scientific analysis and their dependencies without specifying particular data products Abstract workflow: depicts the scientific analysis including the data used and generated, but does not include information about the resources needed for execution Concrete workflow: an executable workflow that includes details of the execution environment Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Scientific Analysis Construct the Analysis Workflow Evolution Workflow Template Select the Input Data Abstract Worfklow Map the Workflow onto Available Resources Concrete Workflow Execute the Workflow Ewa Deelman, [email protected] Tasks to be executed www.isi.edu/~deelman pegasus.isi.edu Grid Resources Scientific Analysis Execution Environment User guided Construct the Analysis Component characteristics Workflow Template Select the Input Data Workflow Evolution Library of Application Components Data Catalogs Data properties Abstract Worfklow Automated Map the Workflow onto Available Resources Resource availability and characteristics Information Services Concrete Workflow Execute the Workflow Ewa Deelman, [email protected] Tasks to be executed www.isi.edu/~deelman pegasus.isi.edu Grid Resources Concrete Workflow Generation and Mapping Compositional Analysis Tool (CAT) Workflow Template Input Data Selector Montage Abstract Workflow Service Chimera Results Abstract Workflow Applicationdependent Application independent Ewa Deelman, [email protected] Pegasus Concrete Workflow Condor DAGMan www.isi.edu/~deelman jobs Grid Resources pegasus.isi.edu Pegasus: Planning for Execution in Grids Maps from abstract to concrete workflow Algorithmic and AI-based techniques Automatically locates physical locations for both workflow components and data Finds appropriate resources to execute Reuses existing data products where applicable Publishes newly derived data products Provides provenance information Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Generating a Concrete Workflow Information location of files and component Instances State of the Grid resources FFT filea Select specific Resources Files Add jobs required to form a concrete workflow that can be executed in the Grid environment Abstract Workflow /usr/local/bin/fft /home/file1 DataTransfer Data movement Data registration Each component in the abstract workflow is turned into an executable job Ewa Deelman, [email protected] Move filea from host1:// home/filea to host2://home/file1 Concrete Workflow www.isi.edu/~deelman Data Registration pegasus.isi.edu Information Components used by Pegasus Globus Monitoring and Discovery Service (MDS) Locates available resources Finds resource properties Globus Replica Location Service Dynamic: load, queue length Static: location of GridFTP server, RLS, etc Locates data that may be replicated Registers new data products Transformation Catalog Locates installed executables Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Example Workflow Reduction Original abstract workflow a d1 b d2 c If “b” already exists (as determined by query to the RLS), the workflow can be reduced b Ewa Deelman, [email protected] d2 c www.isi.edu/~deelman pegasus.isi.edu Mapping from abstract to concrete b d2 c Query RLS, MDS, and TC, schedule computation and data movement Move b from A to B Ewa Deelman, [email protected] Execute d2 at B Move c from B to U www.isi.edu/~deelman Register c in the RLS pegasus.isi.edu Pegasus Research resource discovery and assessment resource selection resource provisioning workflow restructuring task merged together or reordered to improve overall performance adaptive computing Workflow refinement adapts to changing execution environment Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Benefits of the workflow & Pegasus approach The workflow exposes Pegasus can take advantage of the structure to the structure of the application maximum parallelism of the application Set a planning horizon (how far into the workflow to plan) Cluster a set of workflow nodes to be executed as one (for performance) Pegasus shields from the Grid details Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Benefits of the workflow & Pegasus approach Pegasus can run the workflow on a variety of resources Pegasus can run a single workflow across multiple resources Pegasus can opportunistically take advantage of available resources (through dynamic workflow mapping) Pegasus can take advantage of pre-existing intermediate data products Pegasus can improve the performance of the application. Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Mosaic of M42 created on the Teragrid resources using Pegasus Pegasus improved the runtime of this application by 90% over the baseline case Bruce Berriman, John Good (Caltech) Joe Jacob, Dan Katz (JPL) Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Future Directions Support for workflows with real-time feedback to scientists. Providing intermediate analysis results so that the experimental setup can be adjusted while the short-lived samples or human subjects are available. Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu Cognitive Grids: Distributed Intelligent Reasoners that Incrementally Generate the Workflow User’s Request Workflow refinement Levels of abstraction Application -level knowledge Policy reasoner Workflow repair Relevant components Logical tasks Full abstract workflow Tasks bound to resources and sent for execution Onto-based Matchmaker Ewa Deelman, [email protected] Not yet executed Partial execution executed www.isi.edu/~deelman pegasus.isi.edu time BLAST: set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query 2 major runs were performed using Chimera and Pegasus: 1) 60 genomes (4,000 sequences each), In 24 hours processed Genomes selected from DOE-sponsored sequencing projects 67 CPU-days of processing time delivered ~ 10,000 Grid jobs >200,000 BLAST executions 50 GB of data generated 2) 450 genomes processed Speedup of 5-20 times were achieved because the compute nodes we used efficiently by keeping the submission of the jobs to the compute cluster constant. Lead by Nefedova (ANL) as part www.isi.edu/~deelman of the Paci Data Quest Expedition program EwaVeronika Deelman, [email protected] pegasus.isi.edu Tomography (NIH-funded project) Derivation of 3D structure from a series of 2D electron microscopic projection images, Reconstruction and detailed structural analysis complex structures like synapses large structures like dendritic spines. Acquisition and generation of huge amounts of data Large amount of state-of-the-art image processing required to segment structures from extraneous background. Dendrite structure to be rendered by Tomography Work performed with Mark Ellisman, Steve Peltier, Abel Lin, Thomas Molina (SDSC) Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu LIGO’s pulsar search at SC 2002 The pulsar search conducted at SC 2002 Used LIGO’s data collected during the first scientific run of the instrument Targeted a set of 1000 locations of known pulsar as well as random locations in the sky Results of the analysis were be published via LDAS (LIGO Data Analysis System) to the LIGO Scientific Collaboration performed using LDAS and compute and storage ISI people involved: resources atGaurang Caltech,Mehta, Sonal Patil, Srividya Gurmeet University ofRao, Southern Singh, California, Karan Vahi University of Ewa Deelman, [email protected] Wisconsin Milwaukee. Visualization by Marcus Thiebaux www.isi.edu/~deelman pegasus.isi.edu Southern California Earthquake Center • Southern California Earthquake Center (SCEC), in collaboration with the USC Information Sciences Institute, San Diego Supercomputer Center, the Incorporated Research Institutions for Seismology, and the U.S. Geological Survey, is developing the Southern California Earthquake Center Community Modeling Environment (SCEC/CME). •Create fully three-dimensional (3D) simulations of fault-system dynamics. •Physics-based simulations can potentially provide enormous practical benefits for assessing and mitigating earthquake risks through Seismic Hazard Analysis (SHA). •The SCEC/CME system is an integrated geophysical simulation modeling framework that automates the process of selecting, configuring, and executing models of earthquake systems. Acknowledgments : Philip Maechling and Vipin Gupta University Of Southern California Ewa Deelman, [email protected] Figure 1: Fréchet sensitivity Kernel showing travel path between a Yorba Linda earthquake and the TriNet Station DLA. www.isi.edu/~deelman pegasus.isi.edu