Transcript Document

Workload Management WP
Status and next steps
Massimo Sgaravatto
INFN Padova
Where we are

CMS-HLT use case (Monte Carlo production
and reconstruction) analyzed in terms of
GRID requirements and GRID tools availability


Discussions with Globus team and Condor team
Definition of a prototype architecture of workload
management system


Use of Globus and Condor mechanisms
But major developments needed
Prototype workload management system
architecture
Submit jobs
Resource
Discovery
Master
Globus GRAM
as uniform interface
to different local resource
management systems
Local
Resource
Management
Systems
Farms
Site1
Grid Information
Service (GIS)
condor_submit
(Globus Universe)
Master chooses in which
Globus resources the jobs
must be submitted
Condor-G able to
provide a
reliable/crashproof job
submission service
Info
Condor-G
Globus
GRAM
Globus
GRAM
Globus
GRAM
CONDOR
LSF
PBS
Site2
Site3
Where we are


Evaluating the existing components (D1.1) and “putting together” the
various building blocks
Evaluation of Globus


Collaboration with WP 1 of INFN-GRID project (Evaluation of the Globus
toolkit) http://www.infn.it/globus
Evaluation of Globus GRAM




Evaluation of Condor-G

The current implementation is a prototype


GRAM as uniform interface to different underlying resource management systems
Evaluation of RSL
“Cooperation” between GRAM and GIS
It works, but some problems must be solved
Globus + Condor-G tested with a real CMS MC production

Many many many memory leaks found in the Globus jobmanager !!!

Fixes (provided by Francesco Prelz) submitted to Globus team

Feedback only for what concerning the bugs in the GAA and GSS modules (new fixes “merged” with the
original ones)
First deliverables



Month 3: Report on current technology
(report) D1.1
Month 6: Definition of architecture for
scheduling, resource management, security
and job description (report) D1.2
Month 9: Components and documentation for
the 1st release: initial workload management
system (prototype) D1.3
Proposed work plan

Let’s continue the implementation of the proposed
prototype



Evaluation of current technologies (Globus, Condor) (D1.1)
Functionalities for the 1st release
First release


We can propose the functionalities that could be
implemented
“Negotiation” in the ATF



To understand if these functionalities “address” the proposed
use cases
To understand if our module can be “plugged” together with
the other “pieces”
To understand if the other WPs can provide the required (by
WP 1) functionalities
Proposed functionalities for the
1st release





First version of job description language
(JDL)
First version of broker (master), that
decides where to submit the jobs
Job submission service
First version of logging and
bookkeeping services
First user interface
Job Description Language (JDL)

Used when the job is submitted, to specify


The application
The input data set

File ? Collection of files ? “Logical” or “physical” names ?

Need to be discussed with WP 2, WP 8, ATF

Where the output data must be saved
(Required and preferable) resources
Info for bookkeeping
… ???

Prototype: Condor ClassAds



Broker/Master

Choice of resource (farm) where to
submit job



Input: JDL expression
Output: computing resource choice
Published resource access lists
(gridmap-files in the Globus-based
prototype) are checked as a first step in
the resource match-making
Broker/Master

The “accessible” computing resources are matched
with the job request according to:

Availability of the requested input data set


Availability of the appropriate application "sandbox“


If necessary, it could be necessary to "copy" and install this
sandbox if not already available in the executing farm (“code
migration”) (in the 1st release ???)
Queue characteristics and status (architecture, etc…) vs.
job requests


In the 1st release the broker will have to choose a resource
where this input data set is already available (we are not going
to “trigger” the replica of the input data set)
Let’s start with a few, simple parameters
Availability of the requested amount of scratch space
Broker/Master


We assume that all the information needed
by the broker are “published” in one “Grid
Information Space” (GIS in the Globus-based
prototype) by the other WPs
Prototype: Condor matchmaking library


Match between the info published in the GIS and
the ClassAds defined in the JDL
Necessary a “translator” GIS attributes 
ClassAds

Some work already done by Globus team ???
Job submission service


Input: job to submit + computing resource
choice (provided by broker)
Reliable, fault tolerant, crash proof service


Reliability in the executing machines up to WP 4
Prototype: Condor-G


Submission of jobs to Globus resources (farms)
New implementation of Condor-G (+ new Globus
job manager) available soon
“Code” migration

Not easy at all !!!



Necessary to “install” in the target farm a complex
run time environment
Necessary a STRONG collaboration with WP 8
(and WP 4) to define an “application
sandbox”, that can easily be installed in one
farm, and doesn’t “conflict” with other
sandboxes
Use of “application repositories” ???

When an application must be installed on one
farm, the sandbox is downloaded from such
repository
Bookkeeping

Necessary to “record” for each job







Submitting user identity
Input data
Output data
Status of processing
Where and when the processing has been done
Other bookkeeping info specified in the JDL
…???
Logging

Necessary to keep tracks of the
significant events occurred in the
system




Requests by users
Computing resource choice (by broker)
Submission to resource
…???
User Interface

Job management






Job submission
Job removal
Job status monitoring
Access to bookkeeping info
Access to logging info
…???