Slide presentation Template - INFN

Download Report

Transcript Slide presentation Template - INFN

The Workload Management
System in the DataGrid
project
Massimo Sgaravatto
INFN Padova
[email protected]
The EDG Workload Management System – n° 1
Grid vision
“Dependable, consistent, pervasive access to resources”

Enable communities (“virtual organizations”) to share
geographically distributed resources as they pursue common goals in the absence of
central control, omniscience, trust relationships

Make it easy to use diverse, geographically distributed, locally managed and
controlled computing facilities as if they formed a coherent local cluster

People have been discussing about Grid for various years …

… but till some years ago more or less only Globus toolkit available


Globus toolkit: core services for Grid tools and applications (Authentication, Information
service, Resource management, etc…)
Good basis to build on but:

No higher level services

Many problems (e.g. handling of lots of data) not addressed

No production quality implementations
DB
DB

Not possible to do real work with Grids yet …
DB
DB
The EDG Workload Management System – n° 2
EU DataGrid


DataGrid funded by European Union whose objective to exploit and build
the next generation computing infrastructure providing intensive
computation and analysis of shared large-scale databases
Enable data intensive sciences by providing world wide Grid test beds to
large distributed scientific organizations ( “Virtual Organizations, VOs”)

Duration: Jan 1, 2001 - Dec 31, 2003

Applications/End Users Communities: HEP, Earth Observation, Biology

Specific Project Objectives:

Middleware for fabric & grid management

Large scale testbed

Collaborate and coordinate with other projects

Contribute to Open Standards and international bodies
The EDG Workload Management System – n° 3
DataGrid Main Partners

CERN – International (Switzerland/France)

CNRS - France

ESA/ESRIN – International (Italy)

INFN - Italy

NIKHEF – The Netherlands

PPARC - UK
The EDG Workload Management System – n° 4
Assistant Partners
Industrial Partners
•Datamat (Italy)
•IBM-UK (UK)
•CS-SI (France)
Research and Academic Institutes
•CESNET (Czech Republic)
•Commissariat à l'énergie atomique (CEA) – France
•Computer and Automation Research Institute,
Hungarian Academy of Sciences (MTA SZTAKI)
•Consiglio Nazionale delle Ricerche (Italy)
•Helsinki Institute of Physics – Finland
•Institut de Fisica d'Altes Energies (IFAE) - Spain
•Istituto Trentino di Cultura (IRST) – Italy
•Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany
•Royal Netherlands Meteorological Institute (KNMI)
•Ruprecht-Karls-Universität Heidelberg - Germany
•Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands
•Swedish Research Council - Sweden
The EDG Workload Management System – n° 5
DataGrid Work Packages

The EDG collaboration is structured in 12 Work Packages

WP1:
Workload Management System

WP2:
Data Management

WP3:
Grid Information and Monitoring

WP4:
Fabric Management

WP5:
Storage Element / Storage Resource Manager

WP6:
Testbed and demonstrators

WP7:
Network Monitoring

WP8:
High Energy Physics Applications

WP9:
Earth Observation

WP10: Biology

WP11: Dissemination

WP12: Management
The EDG Workload Management System – n° 6
WP1 Task
 The
objective of the first DataGrid workpackage was-is (according
to the project "Technical Annex"):
To define and implement a suitable architecture for distributed
scheduling and resource management on a GRID environment
 Many
challenging issues :

Large heterogeneous environments

Large numbers (thousands) of independent users


Optimizing the choice of execution location based on the availability of
data, computation and network resources
Uniform interface to possible different local resource management
systems under different administrative domains

Policies on resource usage

Reliability, scalability, …

…
The EDG Workload Management System – n° 7
EDG Tutorial Overview
Workload Management Services
Data Management Services
Networking
Information Service
Fabric Management
The EDG Workload Management System – n° 8
WP1 teams
 INFN

INFN Catania

INFN Cnaf (Bologna)

INFN Milano

INFN Padova

INFN Pisa

INFN Torino
 CESNET
(Czech Republic)
 Datamat
SpA (Rome)
 Imperial
College (UK)
The EDG Workload Management System – n° 9
Approach
 We
needed much more experience with the various Grid issues
 The
application requirements were not completely defined yet


They evolved as more familiarity with the Grid model was acquired
Fast prototyping instead of a classic top-down approach

Implementation of a first prototype Workload Management
System (WMS)


By integrating existing tools

Globus

Condor
And by implementing new middleware
The EDG Workload Management System – n° 10
Functionalities of the first WMS

Lightweight User Interface (UI) to submit jobs and control them




Allows also to transfer a "small" amount of data to and from the client machine
and the executing machine (input/output sandboxes)
Job characteristics and requirements described via an an appropriate Job
Description Language (JDL)
WP1's Resource Broker (RB) chooses an appropriate computing resource
(Computing Element, CE) for the job, based on the constraints specified in
the JDL and on the status of the Grid
RB strategy is to send the job to an appropriate CE:



Where the submitting user has proper authorization
That matches the characteristics specified in the JDL (architecture, computing
power, application environment, etc.)
Where the specified input data (and possibly the chosen output Storage Element)
are determined to be "close enough"
 Throughout
this process, WP1's Logging and Bookkeeping services
maintain a "state machine" view of each job
The EDG Workload Management System – n° 11
First WMS: players and flows
The EDG Workload Management System – n° 12
The EDG Workload Management System – n° 13
dg-job-submit myjob.jdl
Myjob.jdl
Executable = "$(CMS)/exe/sum.exe";
InputData
= "LF:testbed0-00019";
ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g,
dc=cnaf, dc=infn, dc=it";
DataAccessProtocol = "gridftp";
InputSandbox = {“$(CMS)/exe/sum.exe", "/home/user/DATA/*"};
OutputSandbox = {“sim.err”, “test.out”, “sim.log"};
Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2";
Rank = other.FreeCPUs;
The EDG Workload Management System – n° 14
The EDG Workload Management System – n° 15
The EDG Workload Management System – n° 16
The EDG Workload Management System – n° 17
The EDG Workload Management System – n° 18
The EDG Workload Management System – n° 19
The EDG Workload Management System – n° 20
The EDG Workload Management System – n° 21
The EDG Workload Management System – n° 22
Experiences with the first WMS
 First
Workload Management System deployed in the EDG testbed at
the end of first year of the project
 Application
users have now been experiencing for about one year
and a half with this first release of the WMS

Stress tests and quasi-production activities

CMS stress tests

Atlas efforts

…
 Significant
…

achievements exploited by the experiments
but also various problems were spotted
Impacting in particular the reliability and scalability of the system
The EDG Workload Management System – n° 23
Review of WP1 WMS architecture
WP1
Workload Management System
architecture reviewed



To apply the “lessons” learned and addressing the
shortcomings emerged with the first release of the
software, in particular

To increase the reliability problems

To address the scalability problems
To support new functionalities
To favor interoperability with other Grid frameworks,
by allowing exploiting WP1 modules (e.g. RB) also
“outside” the EDG WMS
The EDG Workload Management System – n° 24
Improvements wrt first rel. of WMS
 Reliability


and scalability problems addressed
No more a monolithic long-lived process

Some functionalities (e.g. matchmaking) delegated to pluggable modules

Less exposed to memory leaks (coming not only from EDG software)
No more multiple job info repositories

No more job status inconsistencies which caused problems
 Techniques
 Reliable


to quickly recover from failures
communications among components
Done via the file system (filequeues)
For example jobs are not lost if the target entity is temporary down:
when it restarts it gets and “process” the jobs
The EDG Workload Management System – n° 25
Improvements wrt first rel. of WMS
 Flexibility



and interoperability increased
Much more feasible to exploit the Resource Broker also outside the
DataGrid WMS
Much more easier to implement and “plug” in the system the module
implementing the chosen scheduling strategy defined according the
one’s own needs and requirements
Glue Schema for Information Services to describe Grid resources

Common schema agreed between US and EU High Energy Physics Grid
projects
 Various
rel. 1 problems fixed
 Various
enhancements in design and implementation in the various
modules

Also due to enhancements in the underlying software (e.g. Condor and
Globus)
The EDG Workload Management System – n° 26
WP1 WMS reviewed architecture
Details in EDG deliverable
D1.4 …
The EDG Workload Management System – n° 27
Job submission
RB node
Network
Server
RLS
UI
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
The EDG Workload Management System – n° 28
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
UI
Workload
Manager
UI: allows users to
access the functionalities
of the WMS
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job
Status
edg-job-submit myjob.jdl
RB node
Myjob.jdl
Job submission
UI
JobType = “Normal”;
Executable = "$(CMS)/exe/sum.exe";
InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};
OutputSandbox = {“sim.err”, “test.out”, “sim.log"};
Replica
Network
Requirements
= other. GlueHostOperatingSystemName == “linux" &&
Catalog
other. GlueHostOperatingSystemRelease
== "Red Hat 6.2“ &&
Server
other.GlueCEPolicyMaxWallClockTime > 10000;
Rank = other.GlueCEStateFreeCPUs;
Workload
Manager
Inform.
Service
Job Description Language
(JDL) to specify job
characteristics and
requirements
Job Contr.
CondorG
CE characts
& status
Computing
Element
submitted
SE characts
& status
Storage
Element
NS: network daemon
RB node
responsible for accepting
incoming requests
Job submission
Network
Server
Job
Job Status
submitted
RLS
waiting
UI
Input
Sandbox
files
RB
storage
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
UI
RLS
waiting
Job
RB
storage
WM: responsible to take
the appropriate actions to
satisfy the request
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Where must this
job be
executed ?
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Matchmaker: responsible
Server
UI
to find the “best” CE
where to submit a job
RB
storage
RLS
waiting
MatchMaker/
Broker
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
RB node
Job submission
Job Status
Where are (which SEs)
the needed data ?
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Job Contr.
CondorG
Inform.
Service
What is the
status of the
Grid ?
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Inform.
Service
CE choice
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
Workload
Manager
Inform.
Service
Job
Adapter
Job Contr.
CondorG
CE characts
“touches” & status
JA: responsible for the final
to the job before performing submission
(e.g. creation of wrapper script, etc.)
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
Job
Job Contr.
CondorG
JC: responsible for the
actual job management
operations (done via
CondorG)
Computing
Element
CE characts
& status
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
Input
Sandbox
files
CE characts
& status
SE characts
& status
Job
Computing
Element
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
running
Input
Sandbox
“Grid enabled”
data transfers/
accesses
Computing
Element
Job
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
Workload
Manager
ready
Inform.
Service
scheduled
Job Contr.
CondorG
running
Output
Sandbox
files
Computing
Element
done
Storage
Element
Job Status
RB node
edg-job-get-output <dg-job-id>
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
Workload
Manager
ready
Inform.
Service
scheduled
Job Contr.
CondorG
running
Output
Sandbox
Computing
Element
done
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
Output
Sandbox
files
ready
RB
storage
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
running
done
cleared
Computing
Element
Storage
Element
RB node
Job monitoring
edg-job-status <dg-job-id>
edg-job-get-logging-info <dg-job-id>
UI
LB: receives and stores
job events; processes
corresponding job status
Network
Server
Workload
Manager
Job
status
Job Contr.
CondorG
Logging &
Bookkeeping
Log
Monitor
Log of
job events
LM: parses CondorG log
file (where CondorG logs
info about jobs) and notifies LB
Computing
Element
The EDG Workload Management System – n° 44
Logging and Bookkeeping (LB) service
 Stores
logging and bookkeeping information concerning events
generated by the various components of the WMS (“push” model)
 Using
this information, the LB service keeps a state machine view of
each job
 Extended

E.g. Give me all jobs marked as ‘XYZ’ (user tag) and running on CE1 or C'E2
 Possible

querying capabilities
to have more LB servers per WMS
Could be useful in case of LB overloaded
The EDG Workload Management System – n° 45
Possible job states
The EDG Workload Management System – n° 46
Logging and Bookkeeping (LB) service
 LB

components:
Local logger: responsible for accepting messages from their sources and
for passing them to the inter-logger


Interlogger: responsible to forward them to the bookkeeping servers


Information flow implemented on top of inter-process communication
mechanisms and is backed up by a log file that allows a correct recovery of
the inter-logger if some problems occur
The inter-logger, running as a separate process, makes the logging procedure
robust with respect to local and network faults
Bookkeeping server: accept messages from the inter-logger and save
them on its permanent storage

Support queries (in particular job status queries) generated by a consumer
API
The EDG Workload Management System – n° 47
LB architecture
The EDG Workload Management System – n° 48
User Interface (UI)
 Allows
to access the functionalities of the WMS

To submit jobs

To see the suitable resources for a job (without submitting it)

To cancel a job

To see the status of the job

The retrieve the output of a job

…
 Interfaces

Command line interface (python)

C++ and Java API

GUI
The EDG Workload Management System – n° 49
GUI
The EDG Workload Management System – n° 50
GUI
The EDG Workload Management System – n° 51
Job preparation: JDL
 Information

Job characteristics






Executable
StdInput, StdOutput, Stderr
Environment variables
Input Sandbox and Output Sandbox
…
Job requirements and preferences (rank) on the computing resources



to be specified when a job has to be submitted:
Specified using attributes of resources published in the Information Service
E.g. requirements/preferences on the architecture type, OS, memory,
software dependencies, …
Job data requirements



Input Data
Protocols spoken by application to access data
Output Storage Element
The EDG Workload Management System – n° 52
Job preparation: JDL
 Information

specified using a Job Description Language (JDL)
Based upon Condor’s CLASSified ADvertisement language (ClassAd)

Fully extensible language

A ClassAd



Constructed with the classad construction operator []
It is a sequence of attributes separated by semi-colons
An attribute is a pair (key, value), where value can be a Boolean, an
Integer, a list of strings, …
<attribute> = <value>;
 So,
the JDL allows definition of a set of attribute, the WMS takes
into account when making its scheduling decision
The EDG Workload Management System – n° 53
Example of JDL File
[
JobType=“Normal”;
Executable = “gridTest”;
StdError = “stderr.log”;
StdOutput = “stdout.log”;
InputSandbox = {“home/joda/test/gridTest”};
OutputSandbox = {“stderr.log”, “stdout.log”};
InputData = {“lfn:green”, “guid:red”};
DataAccessProtocol = “gridftp”;
Requirements = other.GlueHostOperatingSystemNameOpSys == “LINUX”
&& other.GlueCEStateFreeCPUs>=4;
Rank = other.GlueCEPolicyMaxCPUTime;
]
The EDG Workload Management System – n° 54
Job Status
RB node
Workload Manager
submitted
Network
Server
UI
RLS
waiting
Job
RB
storage
WM: responsible to take
the appropriate actions to
satisfy the request
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Workload Manager (WM)
 Given
a valid request coming from the UI, it has to take the
appropriate actions to satisfy it
 It
may need support from other components, which are specific to
the different request types



All these components that offer support to the Workload Manager
provide a class whose interface is inherited from a Helper class
Essentially the Helper, given a JDL expression, returns a modified one,
which represents the output of the required action
E.g: if the request was to find a suitable resource for a job, the input
JDL expression is the one specified by the user at submission time, and
the output is be the JDL expression augmented with the resource choice
The EDG Workload Management System – n° 56
Job Status
RB node
Resource Broker
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Where must this
job be
executed ?
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Resource Broker (RB)


Helper providing support to the WM
Given a job submission request, the RB is responsible to find the “best”
resource where to submit the job



The CE chosen by RB has to match the job requirements (e.g. runtime
environment, data access requirements, and so on)
If 2 or more CEs satisfy all the requirements, the one with the best Rank (which
expresses a preference) is chosen
The Resource Broker can be “decomposed” in three sub-modules:



a sub-module responsible for performing the matchmaking, therefore returning
all the resources suitable for that JDL expression
a sub-module responsible for performing the ranking of matches resources,
therefore returning just the “best” resource suitable for that JDL expression
a sub-module implementing the chosen scheduling strategy

easily pluggable and replaceable with other ones implementing different scheduling
strategies
The EDG Workload Management System – n° 58
Resource Broker (RB)



It performs the matchmaking between the resource requirements (specified
in the job JDL expression) and the status of the Grid
It interacts with the Information Services, and also with the EDG Data
Management Services to resolve data requirements
Provides also gangmatching besides simple matchmaking

Allow to take into account both Computing Element and Storage Element
information in the matchmaking


For example to require a job to run on a CE close to a SE with “enough space”
Implementation


Information retrieved from Information Services (IS) and Data Management
Services converted into ClassAds
Matchmaking between User’s JDL and Information Services/Data Management
Services classads performed via Condor Matchmaking library
The EDG Workload Management System – n° 59
WMS matchmaking scenarios

Possible scenarios for matchmaking:
1.
Direct job submission

2.
3.
Resource chosen by user
Job submission with only computational
requirements
Job submission with also data access requirements
The EDG Workload Management System – n° 60
Direct job submission
edg-job-submit –r CEId
 Resource
 Job
 RB
specified by the user
is simply submitted on the given CE
doesn’t perform any matchmaking algorithm
 Information
services not queried at all
The EDG Workload Management System – n° 61
Job submission with only comput. reqs
 Matchmaking

Requirements check





algorithm:
RB contacts a “cache” of the Information Services “close” to the RB (where
information are not updated too often but where queries can be performed
quickly) to check which CEs satisfy all the requirements
This includes also authorization check (where is the user allowed to submit
jobs ?)
Suitable resources directly queried (GRISes queried) to evaluate Rank
expression (which usually refers to dynamic values)
If more than one CE satisfies the job requirements, the CE with the best
rank is chosen by the RB
If there are two or more CEs with the same best rank, the choice is
done in a random way among them
The EDG Workload Management System – n° 62
Job submission with data access reqs

Two strategies

Default one

Submit jobs close to data

Matchmaking algorithm:



Requirements check as in the previous case
CE chosen among the suitable ones (the CEs which passed the requirements check)
and where most of the needed files are “close” to it (where most of the needed files
are stored on SEs close to the considered CE)
Newer strategy

Matchmaking algorithm:



Requirements check as in the previous case
The CE is chosen by the Optor service provided by Data Management Services
among the suitable CEs (the CEs which passed the requirements check)
Optor
Exploit file replication as technique for improving data access
Optimizes the creation and deletion of replicas
Takes into account data location, network information, etc.

Will be new default when proven to work as expected (currently under test)
The EDG Workload Management System – n° 63
Job Submission
 Job
submission to the Globus resource (CE) done via CondorG,
which relies on the Globus GRAM mechanisms
 Why

Condor-G ?
Reliable job submission system



Persistency: CondorG keeps a persistent (crash proof) queue of jobs
Logging system: CondorG logs all the relevant events (e.g. job started
its execution, job execution completed, etc.) concerning the managed
jobs



Two-phase commit protocol used by CondorG for job management operations
Log file the parsed by LogMonitor, which triggers appropriate actions on
certain events
Need for interoperability with the US Grid projects, of which CondorG is
an important component
Increased openess of the CondorG framework

Condor is going open source
The EDG Workload Management System – n° 64
Job submission
A
job wrapper script (created by JobAdapter, JA) is submitted to the
CE
 Job
Wrapper:

Download of the input sandbox files from the RB node

Execution of user’s job

Upload of the produced (by the job) output sandbox files to the RB node

Log of some LB events

Set of some env. variables
The EDG Workload Management System – n° 65
Job types
 Possibility

to submit:
Normal jobs

Sequential jobs

Interactive jobs

Checkpointable jobs

Parallel (MPI) jobs

Partitionable jobs (not yet in release 2.0)

DAG jobs (not yet in release 2.0)
The EDG Workload Management System – n° 66
Interactive jobs
 Specified
setting JobType = “Interactive” in JDL
 When
an interactive job is executed, a window for the stdin, stdout,
stderr streams is opened

Possibility to send the stdin to
the job

Possibility the have the stderr
and stdout of the job when it
is running
The EDG Workload Management System – n° 67
Interactive jobs
UI
Submission
machine
JDL
edg-job-submit jobint.jdl
User Interface
InputSandbox
OutputSandbox
jobint.jdl
[JobType = “”interactive”;
Executable = “int-prg.exe";
StdOutput = Outfile;
InputSandbox = "/home/user/int-prg.exe”,
OutputSandbox = “Outfile”,
Requirements =
other. GlueHostOperatingSystemName == “linux" &&
Other.GlueHostOperatingSystemRelease == “RH 6.2“;]
Network Server
Workload Manager
Job Controller/CondorG
Job Shadow
shadow port
shadown host
OutputSandbox
RSL
InputSandbox
StdIn
StdOut
StdErr
LB
shadow port,
shadow host
Gatekeeper
LRMS
Bridging on the UI machine
of the job standard streams
(stdin, stdout and stderr) by
integrating the Condor Bypass
software
Console Agent
........
Pillow Process
Job
WN 1
Files transfer
WN n
Computing
Element
New flows
Usual Job
submission
flows
The EDG Workload Management System – n° 68
Job checkpointing


Checkpointing: saving from time to time job state

Useful to prevent data loss, due to unexpected failures

To allow job preemption

Approach: provide users with a “trivial” logical job checkpointing service

User can save from time to time the state of the job (defined by the application)

A job can be restarted from an intermediate (i.e. “previously” saved) job state

Also exploited in the job partitioning framework
Different than “classical checkpointing (i.e. saving all the information
related to a process: process’s data and stack segments, open files, etc.)


Very difficult to apply (e.g. problems to save the state of open network
connections)
Not necessary for most of the DataGrid reference applications


Sequential processing cases
The state of the application is represented by a small amount of information defined by
the application itself
The EDG Workload Management System – n° 69
Job checkpointing example
int main ()
{
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;}
...
exit(0); }
Example of
Application
(e.g. HEP MonteCarlo
simulation)
The EDG Workload Management System – n° 70
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
User code
must be easily
instrumented in order
to exploit the
checkpointing
framework …
The EDG Workload Management System – n° 71
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
•User defines what is a state
•Defined as <var, value> pairs
• Must be “enough” to restart a
computation from a
previously saved state
The EDG Workload Management System – n° 72
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
User can save
from time to time
the state of the job
The EDG Workload Management System – n° 73
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
Retrieval of the last saved state
The job can restart from that
point
The EDG Workload Management System – n° 74
Job checkpointing scenarios

Scenario 1

Job submitted to a CE

When job runs it saves from time to time its state

Job failure, due to a Grid problems (e.g. CE problem)

Job resubmitted by the WMS possibly to a different CE

Job restarts its computation from the last saved state



 No need to restart from the beginning
 The computation done till that moment is not lost
Scenario 2

Job failure, but not detected by the Grid middleware

User can retrieved a saved state for the job (typically the last one)


edg-job-get-chkpt –o <state><edg-jobid>
User resubmits the job, specifying that the job must start from a specific (the
retrieved one) initial state

edg-job-submit –chkpt <state> <JDL file>
The EDG Workload Management System – n° 75
Parallel jobs

Possibility to submit MPI jobs

MPICH implementation supported

Only parallel jobs inside a single CE can be submitted

Submission of parallel jobs very similar to normal jobs


Matchmaking



Just needed to specify the number (n) of requested CPUs in the JDL
CE chosen by RB has to have MPICH sw installed, and at least n total CPUs
If there are two or more CEs satisfying all the requirements, the one with the
highest number of free CPUs is chosen
Implementation

NodeNumber nodes allocated via CondorG-GRAM mechanisms

Job wrapper performs: mpirun –np #NodeNumber executable
The EDG Workload Management System – n° 76
Job dependencies

Problem



Define and manage dependencies between jobs
Can be defined by Directed Acyclic Graphs (DAGs), whose nodes are program
executions (jobs), and whose arcs represent dependencies between them
Approach

DAG managed within the WMS by a meta-scheduler, called DAGMan (DAG
Manager)





Originally developed within the Condor project
DAGMAN responsible to navigate the graph, determine which nodes are free of
dependencies, and follow the execution of the corresponding jobs
Before submitting a job (node), necessary to choose the resource where to
submit it
Lazy scheduling model: a job (node) is bound to a resource just before that job
is ready to be submitted
Ready for September release
The EDG Workload Management System – n° 77
Job partitioning

Problem



Worth to decompose the job into smaller sub-jobs (each one responsible
for processing just a sub-set of the original large set of elements), in
order to reduce the overall time needed to process all these elements,
and to optimize the usage of all available Grid resources
Approach

Use of job checkpointing and DAGMan mechanisms

Original job partitioned in sub-jobs which can be executed in parallel


A job which has to process a large set of “independent elements” (e.g.
HEP events)
At the end each sub-job must save a final state, then retrieved by a job
aggregator, responsible to collect the results of the sub-jobs and
produce the overall output
Ready for September release
The EDG Workload Management System – n° 78
Grid accounting


Purpose

Tracking the resource usage by each user

Avoid abuses of the DataGrid resources

Help the Workload Management System in balancing the resource loads
Approach:
●
Instead of using a “passive” accounting system that only stores information
about resource usage, we chose to experiment an “active” accounting system
based on an economic model



Every entity in the Grid (Users, Computing Elements, Storage Elements,…) spends or earns some
“virtual currency” (GridCredits) when performing some grid-related work
A price can be dynamically assigned to every Grid-resource, so an economic feedback can help the
Broker in maintaining a workload equilibrium
Some examples:




The accounting assigns a cost to the user jobs. The job cost is then paid by the users to the
resources.
Users can submit jobs only if they have the necessary funds.
Computing elements can buy access to the storage elements.
A storage element can buy a file from another storage element.
The EDG Workload Management System – n° 79
Grid Accounting
 DGAS
(DataGrid Accounting System) software composed of
modules:



Bank service: Every entity (User, CE,…) has one account on a bank
server known as HLR
Price Authority: It is the service responsible for assigning prices to the
Grid resources. The price is assigned according to some economic
model.
Economic model: Resources can be priced with many different
algorithms. Different pricing schemes result in different behaviours of
the Grid
The EDG Workload Management System – n° 80
DGAS working scheme
The EDG Workload Management System – n° 81
WMS Security
 The
secure environment within the DataGrid project is based on the
Globus GSI (Grid Security Infrastructure) mechanisms


Based on Public Key Infrastructure (PKI): each user and each service
owns a credential, consisting of a pair of <private, public> keys, whose
main purpose is to prove their identity
User or service identity and their public key are included in a X.509
certificate signed by a trusted Certification Authority (CA)
 Security

Within the WMS, the various components need a delegated user
credential to act on behalf of that user)


in the WMS
E.g.: necessary when a job has to be submitted to the Computing Element
(proxy needed by Globus GRAM mechanisms)
All interactions between WMS components, especially those that are
network-separated, are mutually authenticated (using either its own
credential or a delegated user credential)
The EDG Workload Management System – n° 82
Proxy renewal
 Why?

To avoid job failure because the delegated proxy outlived the validity of
the initial proxy, avoiding considering long term user proxies
 Solution

Short term proxies created as usual in the UI machine


User registers proxy into a MyProxy server:


myproxy-init –s <server> [-t <cred> -c <proxy>]
User specifies the MyProxy server in the JDL to enable proxy renewal:


grid-proxy-init –hours <hours>
MyProxyServer=myproxy.host.name
The proxy is automatically renewed by WMS without user intervention
for all the job life by a WMS proxy renewal service, which gets from the
MyProxy server a newer proxy when the existing one is going to expire
The EDG Workload Management System – n° 83
Status
 The
new WMS is being integrated in the DataGrid testbed
 Preliminary
results of the new Workload Management System, in
terms of reliability, stability and performance are very encouraging
A
more comprehensive evaluation will be possible when real test
activities performed by real users on the large scale DataGrid
testbed will be performed
The EDG Workload Management System – n° 84
Future
 Activities
will continue in the context of the EGEE project (?)
 Focus

EGEE aims to create a reliable and dependable European grid infrastructure

Production quality software

Addressing all aspects of RAMPS (Reliability, Availability, Maintainability,
Performance and Scale)

Reengineering of existing software instead of new developments

Adoption of standards

The ones emerging from the Global Grid Forum

OGSA
The EDG Workload Management System – n° 85
Conclusions
 In
the first phase of the EDG project, WP1 implemented a working
Workload Management System prototype
 Applications
half
 Revised
have been experiencing with this WMS for one year and a
WMS architecture

To address emerged shortcomings

To support new functionalities


APIs, Interactive jobs, Job checkpointing, Gangmatching, …
Hooks to support other functionalities planned to be integrated later

DAGman, Job partitioning, Grid accounting, Resource reservation and coallocation
The EDG Workload Management System – n° 86
Further information
 EDG
Web site
http://www.edg.org
 EDG
WP1 Web site
http://www.infn.it/workload-grid
The EDG Workload Management System – n° 87