Slide presentation Template - INFN
Download
Report
Transcript Slide presentation Template - INFN
The Workload Management
System in the DataGrid
project
Massimo Sgaravatto
INFN Padova
[email protected]
The EDG Workload Management System – n° 1
Grid vision
“Dependable, consistent, pervasive access to resources”
Enable communities (“virtual organizations”) to share
geographically distributed resources as they pursue common goals in the absence of
central control, omniscience, trust relationships
Make it easy to use diverse, geographically distributed, locally managed and
controlled computing facilities as if they formed a coherent local cluster
People have been discussing about Grid for various years …
… but till some years ago more or less only Globus toolkit available
Globus toolkit: core services for Grid tools and applications (Authentication, Information
service, Resource management, etc…)
Good basis to build on but:
No higher level services
Many problems (e.g. handling of lots of data) not addressed
No production quality implementations
DB
DB
Not possible to do real work with Grids yet …
DB
DB
The EDG Workload Management System – n° 2
EU DataGrid
DataGrid funded by European Union whose objective to exploit and build
the next generation computing infrastructure providing intensive
computation and analysis of shared large-scale databases
Enable data intensive sciences by providing world wide Grid test beds to
large distributed scientific organizations ( “Virtual Organizations, VOs”)
Duration: Jan 1, 2001 - Dec 31, 2003
Applications/End Users Communities: HEP, Earth Observation, Biology
Specific Project Objectives:
Middleware for fabric & grid management
Large scale testbed
Collaborate and coordinate with other projects
Contribute to Open Standards and international bodies
The EDG Workload Management System – n° 3
DataGrid Main Partners
CERN – International (Switzerland/France)
CNRS - France
ESA/ESRIN – International (Italy)
INFN - Italy
NIKHEF – The Netherlands
PPARC - UK
The EDG Workload Management System – n° 4
Assistant Partners
Industrial Partners
•Datamat (Italy)
•IBM-UK (UK)
•CS-SI (France)
Research and Academic Institutes
•CESNET (Czech Republic)
•Commissariat à l'énergie atomique (CEA) – France
•Computer and Automation Research Institute,
Hungarian Academy of Sciences (MTA SZTAKI)
•Consiglio Nazionale delle Ricerche (Italy)
•Helsinki Institute of Physics – Finland
•Institut de Fisica d'Altes Energies (IFAE) - Spain
•Istituto Trentino di Cultura (IRST) – Italy
•Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany
•Royal Netherlands Meteorological Institute (KNMI)
•Ruprecht-Karls-Universität Heidelberg - Germany
•Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands
•Swedish Research Council - Sweden
The EDG Workload Management System – n° 5
DataGrid Work Packages
The EDG collaboration is structured in 12 Work Packages
WP1:
Workload Management System
WP2:
Data Management
WP3:
Grid Information and Monitoring
WP4:
Fabric Management
WP5:
Storage Element / Storage Resource Manager
WP6:
Testbed and demonstrators
WP7:
Network Monitoring
WP8:
High Energy Physics Applications
WP9:
Earth Observation
WP10: Biology
WP11: Dissemination
WP12: Management
The EDG Workload Management System – n° 6
WP1 Task
The
objective of the first DataGrid workpackage was-is (according
to the project "Technical Annex"):
To define and implement a suitable architecture for distributed
scheduling and resource management on a GRID environment
Many
challenging issues :
Large heterogeneous environments
Large numbers (thousands) of independent users
Optimizing the choice of execution location based on the availability of
data, computation and network resources
Uniform interface to possible different local resource management
systems under different administrative domains
Policies on resource usage
Reliability, scalability, …
…
The EDG Workload Management System – n° 7
EDG Tutorial Overview
Workload Management Services
Data Management Services
Networking
Information Service
Fabric Management
The EDG Workload Management System – n° 8
WP1 teams
INFN
INFN Catania
INFN Cnaf (Bologna)
INFN Milano
INFN Padova
INFN Pisa
INFN Torino
CESNET
(Czech Republic)
Datamat
SpA (Rome)
Imperial
College (UK)
The EDG Workload Management System – n° 9
Approach
We
needed much more experience with the various Grid issues
The
application requirements were not completely defined yet
They evolved as more familiarity with the Grid model was acquired
Fast prototyping instead of a classic top-down approach
Implementation of a first prototype Workload Management
System (WMS)
By integrating existing tools
Globus
Condor
And by implementing new middleware
The EDG Workload Management System – n° 10
Functionalities of the first WMS
Lightweight User Interface (UI) to submit jobs and control them
Allows also to transfer a "small" amount of data to and from the client machine
and the executing machine (input/output sandboxes)
Job characteristics and requirements described via an an appropriate Job
Description Language (JDL)
WP1's Resource Broker (RB) chooses an appropriate computing resource
(Computing Element, CE) for the job, based on the constraints specified in
the JDL and on the status of the Grid
RB strategy is to send the job to an appropriate CE:
Where the submitting user has proper authorization
That matches the characteristics specified in the JDL (architecture, computing
power, application environment, etc.)
Where the specified input data (and possibly the chosen output Storage Element)
are determined to be "close enough"
Throughout
this process, WP1's Logging and Bookkeeping services
maintain a "state machine" view of each job
The EDG Workload Management System – n° 11
First WMS: players and flows
The EDG Workload Management System – n° 12
The EDG Workload Management System – n° 13
dg-job-submit myjob.jdl
Myjob.jdl
Executable = "$(CMS)/exe/sum.exe";
InputData
= "LF:testbed0-00019";
ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g,
dc=cnaf, dc=infn, dc=it";
DataAccessProtocol = "gridftp";
InputSandbox = {“$(CMS)/exe/sum.exe", "/home/user/DATA/*"};
OutputSandbox = {“sim.err”, “test.out”, “sim.log"};
Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2";
Rank = other.FreeCPUs;
The EDG Workload Management System – n° 14
The EDG Workload Management System – n° 15
The EDG Workload Management System – n° 16
The EDG Workload Management System – n° 17
The EDG Workload Management System – n° 18
The EDG Workload Management System – n° 19
The EDG Workload Management System – n° 20
The EDG Workload Management System – n° 21
The EDG Workload Management System – n° 22
Experiences with the first WMS
First
Workload Management System deployed in the EDG testbed at
the end of first year of the project
Application
users have now been experiencing for about one year
and a half with this first release of the WMS
Stress tests and quasi-production activities
CMS stress tests
Atlas efforts
…
Significant
…
achievements exploited by the experiments
but also various problems were spotted
Impacting in particular the reliability and scalability of the system
The EDG Workload Management System – n° 23
Review of WP1 WMS architecture
WP1
Workload Management System
architecture reviewed
To apply the “lessons” learned and addressing the
shortcomings emerged with the first release of the
software, in particular
To increase the reliability problems
To address the scalability problems
To support new functionalities
To favor interoperability with other Grid frameworks,
by allowing exploiting WP1 modules (e.g. RB) also
“outside” the EDG WMS
The EDG Workload Management System – n° 24
Improvements wrt first rel. of WMS
Reliability
and scalability problems addressed
No more a monolithic long-lived process
Some functionalities (e.g. matchmaking) delegated to pluggable modules
Less exposed to memory leaks (coming not only from EDG software)
No more multiple job info repositories
No more job status inconsistencies which caused problems
Techniques
Reliable
to quickly recover from failures
communications among components
Done via the file system (filequeues)
For example jobs are not lost if the target entity is temporary down:
when it restarts it gets and “process” the jobs
The EDG Workload Management System – n° 25
Improvements wrt first rel. of WMS
Flexibility
and interoperability increased
Much more feasible to exploit the Resource Broker also outside the
DataGrid WMS
Much more easier to implement and “plug” in the system the module
implementing the chosen scheduling strategy defined according the
one’s own needs and requirements
Glue Schema for Information Services to describe Grid resources
Common schema agreed between US and EU High Energy Physics Grid
projects
Various
rel. 1 problems fixed
Various
enhancements in design and implementation in the various
modules
Also due to enhancements in the underlying software (e.g. Condor and
Globus)
The EDG Workload Management System – n° 26
WP1 WMS reviewed architecture
Details in EDG deliverable
D1.4 …
The EDG Workload Management System – n° 27
Job submission
RB node
Network
Server
RLS
UI
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
The EDG Workload Management System – n° 28
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
UI
Workload
Manager
UI: allows users to
access the functionalities
of the WMS
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job
Status
edg-job-submit myjob.jdl
RB node
Myjob.jdl
Job submission
UI
JobType = “Normal”;
Executable = "$(CMS)/exe/sum.exe";
InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};
OutputSandbox = {“sim.err”, “test.out”, “sim.log"};
Replica
Network
Requirements
= other. GlueHostOperatingSystemName == “linux" &&
Catalog
other. GlueHostOperatingSystemRelease
== "Red Hat 6.2“ &&
Server
other.GlueCEPolicyMaxWallClockTime > 10000;
Rank = other.GlueCEStateFreeCPUs;
Workload
Manager
Inform.
Service
Job Description Language
(JDL) to specify job
characteristics and
requirements
Job Contr.
CondorG
CE characts
& status
Computing
Element
submitted
SE characts
& status
Storage
Element
NS: network daemon
RB node
responsible for accepting
incoming requests
Job submission
Network
Server
Job
Job Status
submitted
RLS
waiting
UI
Input
Sandbox
files
RB
storage
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
UI
RLS
waiting
Job
RB
storage
WM: responsible to take
the appropriate actions to
satisfy the request
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Where must this
job be
executed ?
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Matchmaker: responsible
Server
UI
to find the “best” CE
where to submit a job
RB
storage
RLS
waiting
MatchMaker/
Broker
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
RB node
Job submission
Job Status
Where are (which SEs)
the needed data ?
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Job Contr.
CondorG
Inform.
Service
What is the
status of the
Grid ?
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Inform.
Service
CE choice
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
Workload
Manager
Inform.
Service
Job
Adapter
Job Contr.
CondorG
CE characts
“touches” & status
JA: responsible for the final
to the job before performing submission
(e.g. creation of wrapper script, etc.)
Computing
Element
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
Job
Job Contr.
CondorG
JC: responsible for the
actual job management
operations (done via
CondorG)
Computing
Element
CE characts
& status
SE characts
& status
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
Input
Sandbox
files
CE characts
& status
SE characts
& status
Job
Computing
Element
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
running
Input
Sandbox
“Grid enabled”
data transfers/
accesses
Computing
Element
Job
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
Workload
Manager
ready
Inform.
Service
scheduled
Job Contr.
CondorG
running
Output
Sandbox
files
Computing
Element
done
Storage
Element
Job Status
RB node
edg-job-get-output <dg-job-id>
Job submission
submitted
Network
Server
RLS
waiting
UI
RB
storage
Workload
Manager
ready
Inform.
Service
scheduled
Job Contr.
CondorG
running
Output
Sandbox
Computing
Element
done
Storage
Element
Job Status
RB node
Job submission
submitted
Network
Server
RLS
waiting
UI
Output
Sandbox
files
ready
RB
storage
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
running
done
cleared
Computing
Element
Storage
Element
RB node
Job monitoring
edg-job-status <dg-job-id>
edg-job-get-logging-info <dg-job-id>
UI
LB: receives and stores
job events; processes
corresponding job status
Network
Server
Workload
Manager
Job
status
Job Contr.
CondorG
Logging &
Bookkeeping
Log
Monitor
Log of
job events
LM: parses CondorG log
file (where CondorG logs
info about jobs) and notifies LB
Computing
Element
The EDG Workload Management System – n° 44
Logging and Bookkeeping (LB) service
Stores
logging and bookkeeping information concerning events
generated by the various components of the WMS (“push” model)
Using
this information, the LB service keeps a state machine view of
each job
Extended
E.g. Give me all jobs marked as ‘XYZ’ (user tag) and running on CE1 or C'E2
Possible
querying capabilities
to have more LB servers per WMS
Could be useful in case of LB overloaded
The EDG Workload Management System – n° 45
Possible job states
The EDG Workload Management System – n° 46
Logging and Bookkeeping (LB) service
LB
components:
Local logger: responsible for accepting messages from their sources and
for passing them to the inter-logger
Interlogger: responsible to forward them to the bookkeeping servers
Information flow implemented on top of inter-process communication
mechanisms and is backed up by a log file that allows a correct recovery of
the inter-logger if some problems occur
The inter-logger, running as a separate process, makes the logging procedure
robust with respect to local and network faults
Bookkeeping server: accept messages from the inter-logger and save
them on its permanent storage
Support queries (in particular job status queries) generated by a consumer
API
The EDG Workload Management System – n° 47
LB architecture
The EDG Workload Management System – n° 48
User Interface (UI)
Allows
to access the functionalities of the WMS
To submit jobs
To see the suitable resources for a job (without submitting it)
To cancel a job
To see the status of the job
The retrieve the output of a job
…
Interfaces
Command line interface (python)
C++ and Java API
GUI
The EDG Workload Management System – n° 49
GUI
The EDG Workload Management System – n° 50
GUI
The EDG Workload Management System – n° 51
Job preparation: JDL
Information
Job characteristics
Executable
StdInput, StdOutput, Stderr
Environment variables
Input Sandbox and Output Sandbox
…
Job requirements and preferences (rank) on the computing resources
to be specified when a job has to be submitted:
Specified using attributes of resources published in the Information Service
E.g. requirements/preferences on the architecture type, OS, memory,
software dependencies, …
Job data requirements
Input Data
Protocols spoken by application to access data
Output Storage Element
The EDG Workload Management System – n° 52
Job preparation: JDL
Information
specified using a Job Description Language (JDL)
Based upon Condor’s CLASSified ADvertisement language (ClassAd)
Fully extensible language
A ClassAd
Constructed with the classad construction operator []
It is a sequence of attributes separated by semi-colons
An attribute is a pair (key, value), where value can be a Boolean, an
Integer, a list of strings, …
<attribute> = <value>;
So,
the JDL allows definition of a set of attribute, the WMS takes
into account when making its scheduling decision
The EDG Workload Management System – n° 53
Example of JDL File
[
JobType=“Normal”;
Executable = “gridTest”;
StdError = “stderr.log”;
StdOutput = “stdout.log”;
InputSandbox = {“home/joda/test/gridTest”};
OutputSandbox = {“stderr.log”, “stdout.log”};
InputData = {“lfn:green”, “guid:red”};
DataAccessProtocol = “gridftp”;
Requirements = other.GlueHostOperatingSystemNameOpSys == “LINUX”
&& other.GlueCEStateFreeCPUs>=4;
Rank = other.GlueCEPolicyMaxCPUTime;
]
The EDG Workload Management System – n° 54
Job Status
RB node
Workload Manager
submitted
Network
Server
UI
RLS
waiting
Job
RB
storage
WM: responsible to take
the appropriate actions to
satisfy the request
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Workload Manager (WM)
Given
a valid request coming from the UI, it has to take the
appropriate actions to satisfy it
It
may need support from other components, which are specific to
the different request types
All these components that offer support to the Workload Manager
provide a class whose interface is inherited from a Helper class
Essentially the Helper, given a JDL expression, returns a modified one,
which represents the output of the required action
E.g: if the request was to find a suitable resource for a job, the input
JDL expression is the one specified by the user at submission time, and
the output is be the JDL expression augmented with the resource choice
The EDG Workload Management System – n° 56
Job Status
RB node
Resource Broker
submitted
Network
Server
RLS
waiting
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Where must this
job be
executed ?
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
Resource Broker (RB)
Helper providing support to the WM
Given a job submission request, the RB is responsible to find the “best”
resource where to submit the job
The CE chosen by RB has to match the job requirements (e.g. runtime
environment, data access requirements, and so on)
If 2 or more CEs satisfy all the requirements, the one with the best Rank (which
expresses a preference) is chosen
The Resource Broker can be “decomposed” in three sub-modules:
a sub-module responsible for performing the matchmaking, therefore returning
all the resources suitable for that JDL expression
a sub-module responsible for performing the ranking of matches resources,
therefore returning just the “best” resource suitable for that JDL expression
a sub-module implementing the chosen scheduling strategy
easily pluggable and replaceable with other ones implementing different scheduling
strategies
The EDG Workload Management System – n° 58
Resource Broker (RB)
It performs the matchmaking between the resource requirements (specified
in the job JDL expression) and the status of the Grid
It interacts with the Information Services, and also with the EDG Data
Management Services to resolve data requirements
Provides also gangmatching besides simple matchmaking
Allow to take into account both Computing Element and Storage Element
information in the matchmaking
For example to require a job to run on a CE close to a SE with “enough space”
Implementation
Information retrieved from Information Services (IS) and Data Management
Services converted into ClassAds
Matchmaking between User’s JDL and Information Services/Data Management
Services classads performed via Condor Matchmaking library
The EDG Workload Management System – n° 59
WMS matchmaking scenarios
Possible scenarios for matchmaking:
1.
Direct job submission
2.
3.
Resource chosen by user
Job submission with only computational
requirements
Job submission with also data access requirements
The EDG Workload Management System – n° 60
Direct job submission
edg-job-submit –r CEId
Resource
Job
RB
specified by the user
is simply submitted on the given CE
doesn’t perform any matchmaking algorithm
Information
services not queried at all
The EDG Workload Management System – n° 61
Job submission with only comput. reqs
Matchmaking
Requirements check
algorithm:
RB contacts a “cache” of the Information Services “close” to the RB (where
information are not updated too often but where queries can be performed
quickly) to check which CEs satisfy all the requirements
This includes also authorization check (where is the user allowed to submit
jobs ?)
Suitable resources directly queried (GRISes queried) to evaluate Rank
expression (which usually refers to dynamic values)
If more than one CE satisfies the job requirements, the CE with the best
rank is chosen by the RB
If there are two or more CEs with the same best rank, the choice is
done in a random way among them
The EDG Workload Management System – n° 62
Job submission with data access reqs
Two strategies
Default one
Submit jobs close to data
Matchmaking algorithm:
Requirements check as in the previous case
CE chosen among the suitable ones (the CEs which passed the requirements check)
and where most of the needed files are “close” to it (where most of the needed files
are stored on SEs close to the considered CE)
Newer strategy
Matchmaking algorithm:
Requirements check as in the previous case
The CE is chosen by the Optor service provided by Data Management Services
among the suitable CEs (the CEs which passed the requirements check)
Optor
Exploit file replication as technique for improving data access
Optimizes the creation and deletion of replicas
Takes into account data location, network information, etc.
Will be new default when proven to work as expected (currently under test)
The EDG Workload Management System – n° 63
Job Submission
Job
submission to the Globus resource (CE) done via CondorG,
which relies on the Globus GRAM mechanisms
Why
Condor-G ?
Reliable job submission system
Persistency: CondorG keeps a persistent (crash proof) queue of jobs
Logging system: CondorG logs all the relevant events (e.g. job started
its execution, job execution completed, etc.) concerning the managed
jobs
Two-phase commit protocol used by CondorG for job management operations
Log file the parsed by LogMonitor, which triggers appropriate actions on
certain events
Need for interoperability with the US Grid projects, of which CondorG is
an important component
Increased openess of the CondorG framework
Condor is going open source
The EDG Workload Management System – n° 64
Job submission
A
job wrapper script (created by JobAdapter, JA) is submitted to the
CE
Job
Wrapper:
Download of the input sandbox files from the RB node
Execution of user’s job
Upload of the produced (by the job) output sandbox files to the RB node
Log of some LB events
Set of some env. variables
The EDG Workload Management System – n° 65
Job types
Possibility
to submit:
Normal jobs
Sequential jobs
Interactive jobs
Checkpointable jobs
Parallel (MPI) jobs
Partitionable jobs (not yet in release 2.0)
DAG jobs (not yet in release 2.0)
The EDG Workload Management System – n° 66
Interactive jobs
Specified
setting JobType = “Interactive” in JDL
When
an interactive job is executed, a window for the stdin, stdout,
stderr streams is opened
Possibility to send the stdin to
the job
Possibility the have the stderr
and stdout of the job when it
is running
The EDG Workload Management System – n° 67
Interactive jobs
UI
Submission
machine
JDL
edg-job-submit jobint.jdl
User Interface
InputSandbox
OutputSandbox
jobint.jdl
[JobType = “”interactive”;
Executable = “int-prg.exe";
StdOutput = Outfile;
InputSandbox = "/home/user/int-prg.exe”,
OutputSandbox = “Outfile”,
Requirements =
other. GlueHostOperatingSystemName == “linux" &&
Other.GlueHostOperatingSystemRelease == “RH 6.2“;]
Network Server
Workload Manager
Job Controller/CondorG
Job Shadow
shadow port
shadown host
OutputSandbox
RSL
InputSandbox
StdIn
StdOut
StdErr
LB
shadow port,
shadow host
Gatekeeper
LRMS
Bridging on the UI machine
of the job standard streams
(stdin, stdout and stderr) by
integrating the Condor Bypass
software
Console Agent
........
Pillow Process
Job
WN 1
Files transfer
WN n
Computing
Element
New flows
Usual Job
submission
flows
The EDG Workload Management System – n° 68
Job checkpointing
Checkpointing: saving from time to time job state
Useful to prevent data loss, due to unexpected failures
To allow job preemption
Approach: provide users with a “trivial” logical job checkpointing service
User can save from time to time the state of the job (defined by the application)
A job can be restarted from an intermediate (i.e. “previously” saved) job state
Also exploited in the job partitioning framework
Different than “classical checkpointing (i.e. saving all the information
related to a process: process’s data and stack segments, open files, etc.)
Very difficult to apply (e.g. problems to save the state of open network
connections)
Not necessary for most of the DataGrid reference applications
Sequential processing cases
The state of the application is represented by a small amount of information defined by
the application itself
The EDG Workload Management System – n° 69
Job checkpointing example
int main ()
{
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;}
...
exit(0); }
Example of
Application
(e.g. HEP MonteCarlo
simulation)
The EDG Workload Management System – n° 70
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
User code
must be easily
instrumented in order
to exploit the
checkpointing
framework …
The EDG Workload Management System – n° 71
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
•User defines what is a state
•Defined as <var, value> pairs
• Must be “enough” to restart a
computation from a
previously saved state
The EDG Workload Management System – n° 72
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
User can save
from time to time
the state of the job
The EDG Workload Management System – n° 73
Job checkpointing example
#include "checkpointing.h"
int main ()
{ JobState state(JobState::job);
event = state.getIntValue("first_event");
PFN_of_file_on_SE = state.getStringValue("filename");
….
var_n = state.getBoolValue("var_n");
< copy file_on_SE locally>;
…
for (int i=event; i < EVMAX; i++)
{ < process event i>;
...
state.saveValue("first_event", i+1);
< save intermediate file on a SE>;
state.saveValue("filename", PFN of file_on_SE);
...
state.saveValue("var_n", value_n);
state.saveState(); }
…
exit(0); }
Retrieval of the last saved state
The job can restart from that
point
The EDG Workload Management System – n° 74
Job checkpointing scenarios
Scenario 1
Job submitted to a CE
When job runs it saves from time to time its state
Job failure, due to a Grid problems (e.g. CE problem)
Job resubmitted by the WMS possibly to a different CE
Job restarts its computation from the last saved state
No need to restart from the beginning
The computation done till that moment is not lost
Scenario 2
Job failure, but not detected by the Grid middleware
User can retrieved a saved state for the job (typically the last one)
edg-job-get-chkpt –o <state><edg-jobid>
User resubmits the job, specifying that the job must start from a specific (the
retrieved one) initial state
edg-job-submit –chkpt <state> <JDL file>
The EDG Workload Management System – n° 75
Parallel jobs
Possibility to submit MPI jobs
MPICH implementation supported
Only parallel jobs inside a single CE can be submitted
Submission of parallel jobs very similar to normal jobs
Matchmaking
Just needed to specify the number (n) of requested CPUs in the JDL
CE chosen by RB has to have MPICH sw installed, and at least n total CPUs
If there are two or more CEs satisfying all the requirements, the one with the
highest number of free CPUs is chosen
Implementation
NodeNumber nodes allocated via CondorG-GRAM mechanisms
Job wrapper performs: mpirun –np #NodeNumber executable
The EDG Workload Management System – n° 76
Job dependencies
Problem
Define and manage dependencies between jobs
Can be defined by Directed Acyclic Graphs (DAGs), whose nodes are program
executions (jobs), and whose arcs represent dependencies between them
Approach
DAG managed within the WMS by a meta-scheduler, called DAGMan (DAG
Manager)
Originally developed within the Condor project
DAGMAN responsible to navigate the graph, determine which nodes are free of
dependencies, and follow the execution of the corresponding jobs
Before submitting a job (node), necessary to choose the resource where to
submit it
Lazy scheduling model: a job (node) is bound to a resource just before that job
is ready to be submitted
Ready for September release
The EDG Workload Management System – n° 77
Job partitioning
Problem
Worth to decompose the job into smaller sub-jobs (each one responsible
for processing just a sub-set of the original large set of elements), in
order to reduce the overall time needed to process all these elements,
and to optimize the usage of all available Grid resources
Approach
Use of job checkpointing and DAGMan mechanisms
Original job partitioned in sub-jobs which can be executed in parallel
A job which has to process a large set of “independent elements” (e.g.
HEP events)
At the end each sub-job must save a final state, then retrieved by a job
aggregator, responsible to collect the results of the sub-jobs and
produce the overall output
Ready for September release
The EDG Workload Management System – n° 78
Grid accounting
Purpose
Tracking the resource usage by each user
Avoid abuses of the DataGrid resources
Help the Workload Management System in balancing the resource loads
Approach:
●
Instead of using a “passive” accounting system that only stores information
about resource usage, we chose to experiment an “active” accounting system
based on an economic model
Every entity in the Grid (Users, Computing Elements, Storage Elements,…) spends or earns some
“virtual currency” (GridCredits) when performing some grid-related work
A price can be dynamically assigned to every Grid-resource, so an economic feedback can help the
Broker in maintaining a workload equilibrium
Some examples:
The accounting assigns a cost to the user jobs. The job cost is then paid by the users to the
resources.
Users can submit jobs only if they have the necessary funds.
Computing elements can buy access to the storage elements.
A storage element can buy a file from another storage element.
The EDG Workload Management System – n° 79
Grid Accounting
DGAS
(DataGrid Accounting System) software composed of
modules:
Bank service: Every entity (User, CE,…) has one account on a bank
server known as HLR
Price Authority: It is the service responsible for assigning prices to the
Grid resources. The price is assigned according to some economic
model.
Economic model: Resources can be priced with many different
algorithms. Different pricing schemes result in different behaviours of
the Grid
The EDG Workload Management System – n° 80
DGAS working scheme
The EDG Workload Management System – n° 81
WMS Security
The
secure environment within the DataGrid project is based on the
Globus GSI (Grid Security Infrastructure) mechanisms
Based on Public Key Infrastructure (PKI): each user and each service
owns a credential, consisting of a pair of <private, public> keys, whose
main purpose is to prove their identity
User or service identity and their public key are included in a X.509
certificate signed by a trusted Certification Authority (CA)
Security
Within the WMS, the various components need a delegated user
credential to act on behalf of that user)
in the WMS
E.g.: necessary when a job has to be submitted to the Computing Element
(proxy needed by Globus GRAM mechanisms)
All interactions between WMS components, especially those that are
network-separated, are mutually authenticated (using either its own
credential or a delegated user credential)
The EDG Workload Management System – n° 82
Proxy renewal
Why?
To avoid job failure because the delegated proxy outlived the validity of
the initial proxy, avoiding considering long term user proxies
Solution
Short term proxies created as usual in the UI machine
User registers proxy into a MyProxy server:
myproxy-init –s <server> [-t <cred> -c <proxy>]
User specifies the MyProxy server in the JDL to enable proxy renewal:
grid-proxy-init –hours <hours>
MyProxyServer=myproxy.host.name
The proxy is automatically renewed by WMS without user intervention
for all the job life by a WMS proxy renewal service, which gets from the
MyProxy server a newer proxy when the existing one is going to expire
The EDG Workload Management System – n° 83
Status
The
new WMS is being integrated in the DataGrid testbed
Preliminary
results of the new Workload Management System, in
terms of reliability, stability and performance are very encouraging
A
more comprehensive evaluation will be possible when real test
activities performed by real users on the large scale DataGrid
testbed will be performed
The EDG Workload Management System – n° 84
Future
Activities
will continue in the context of the EGEE project (?)
Focus
EGEE aims to create a reliable and dependable European grid infrastructure
Production quality software
Addressing all aspects of RAMPS (Reliability, Availability, Maintainability,
Performance and Scale)
Reengineering of existing software instead of new developments
Adoption of standards
The ones emerging from the Global Grid Forum
OGSA
The EDG Workload Management System – n° 85
Conclusions
In
the first phase of the EDG project, WP1 implemented a working
Workload Management System prototype
Applications
half
Revised
have been experiencing with this WMS for one year and a
WMS architecture
To address emerged shortcomings
To support new functionalities
APIs, Interactive jobs, Job checkpointing, Gangmatching, …
Hooks to support other functionalities planned to be integrated later
DAGman, Job partitioning, Grid accounting, Resource reservation and coallocation
The EDG Workload Management System – n° 86
Further information
EDG
Web site
http://www.edg.org
EDG
WP1 Web site
http://www.infn.it/workload-grid
The EDG Workload Management System – n° 87