Transcript Slide 1
Pegasus: Planning for Execution in Grids http://pegasus.isi.edu Virtual Data Concepts Rls-client -- Capture and manage information about relationships among -- Data (of widely varying representations) -- Programs (& their execution needs) -- Computations (& execution environments) -- Apply this information to, e.g. -- Discovery: Data and program discovery -- Workflow: Structured paradigm for organizing, locating, specifying, & requesting data -- Explanation: provenance -- Ressearch part of NSF funded GriPhyN project Genpoolconfig client Tc-client Replica Query and Registration Mechanism Transformation Catalog Mechanism (TC) RLS File Resource Information Catalog Database MDS CPlanner (gencdag) Rls-query-client Replica Selection NonJava Callout Site Selector RLS Round Robin File Min-Min Random Group Grasp Max-Min PEGASUS ENGINE Data Transfer Mechanism Stork Gridlab transfer Submit Writer Transfer2 Globusurl-copy Multiple Transfer Condor Existing Interfaces Production Implementations Interfaces in development Research Implementations GridLab GRMS Stork Writer Pegasus command line clients Logic MDS (available Resources) Pool config Abstract Workflow RLS (available data) Check Resource Access TC MDS Pegasus: Planning for Execution in Grids Pool config Reduce the Workflow -- Maps from abstract to concrete workflow -- Algorithmic and AI based techniques -- Automatically locates physical locations for both components (transformations) and data -- Uses Globus RLS and the Transformation Catalog -- Finds appropriate resources to execute -- via Globus MDS -- Reuses existing data products where applicable -- Publishes newly derived data products -- Chimera virtual data catalog & MCS -- Uses Globus COG Kit for authentication Perform Site Selection Site Selector RLS TC Cluster Individual Jobs Fully Instantiated Workflow Add Transfer Nodes Write Submit Files DAGMan/ Condog-G file Replica Selector Planning and Scheduling Granularity Deferred Planning PW A PW B PW C New Abstract Worfklow A Particular Partitioning Re Planning Pegasus’ Log files record sites considered Pegasus(A) = Submit(A) DAGMan(Submit(A)) Retry Y times Pegasus(X): Pegasus generated the concrete workflow and the submit files for X = Submit(X) Pegasus(A) = Submit(A) DAGMan(Submit(A)) DAGMan(Submit(X)): DAGMan executes the concrete workflow for X Retry Y times Pegasus(A) = Submit(A) DAGMan(Submit(A)) Retry Y times Grid Setup -- Leverage Condor’s job retry mechanism to trigger retry on partition in case of failure. -- Parse Condor Log files to determine the sites at which job failed. -- Subsequent invocation of Pegasus on the same partition are aware of the bad sites. -- Partitioning -- Allows to set the granularity for planning ahead. -- Node Aggregation -- Allows to combine nodes in the workflow and schedule them as one unit. -- Minimizes scheduling overhead and planning overhead -- Related But Separate Concepts -- Small Jobs > High level of Node Aggregation > Large Partitions -- Very Dynamic System > Small Partitions Pegasus Portal jobmanager Compute Resource Chimera Information Pegasus DAGMan MDS Index RLI LRC MDS local GridFTP Server Storage System Pool of Resources Submit Host Pool of Resources Pool of Resources Transformation Catalog Pool Configuration Info Properties Pool of Resources jobs Pool of Resources Pool of Resources People Involved: USC/ISI: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi ,James Blythe, Yolanda Gil