Transcript Document

Job Submission and Resource Brokering
WP 1
Contents:
•The components
•What (should) works now and configuration
•How to submit jobs … the UI and JDL
•Planned future functionality
Documentation available from:
http://server11.infn.it/workload-grid/documents.htm
A particularly gripping read is the “Administrator and
User Guide” released last Friday.
The User Interface (UI):
All user interactions are through the UI
Installed on the submitting machine
Communicates with both the Resource Broker (RB) and
the Logging Broker (LB)
On job submission the UI assigns a unique job
identifier to the job (dg_jobId), sends the executable,
Job Description File and Input Sandbox to the RB. It
also sends notification of the submission to the LB.
The User Interface (UI):
The UI can also be used to query the status of a job…
which it does by interrogating the LB
Configuration:
The UI configuration is contained UI_ConfigEnv.cfg
which contains the following information:
• address and port of accessible RBs
• address and port of accessible LBs
• default location of the local storage areas for the
Input/Output sandbox files
• default values for the JDL mandatory attributes
• default number of retrials on fatal errors when
connecting to the LB.
The User Interface (UI):
Users concurrently using the same submitting machine
use the same configuration files.
For users (or groups of users) having particular needs it
is possible to “customise” the UI configuration through
the -config option supported by each UI command.
The Resource Broker (RB):
Situated at a central location (not local to your machine).
Expected to have one per VO, currently only one at CERN
Jobs are queued locally(stored in a PostgreSQL Database)
Interrogates the replica catalogue and the information
services and attempts to match the job to an available
resource. Matching is based on the Condor ClassAd
Libraray.
If a suitable match is made the RB can submit the job to
the Job Submission Service (JSS). Of course all events
and status information is sent to the LB.
The Resource Broker (RB):
Configuration:
Most people will never need to configure their own RB.
However for completeness the configuration file is:
<install path>/etc/rb.conf. This contains entries for the
replica catalogue, the MDS etc.
For more detailed information see the “Administrator
and User Guide”.
Input/Output Sandboxes etc are stored on the machine
hosting the RB and so a reasonable amount of disk space
is required.
The Job Submission Service (JSS):
If the RB has successfully matched a job to a resource it
is passed to the JSS (which is usually on the same
machine).
The JSS queues the job internally in a PostgreSQL
database.
Job submission is performed using Condor-G
The JSS also monitors job until their completion,
notifying the LB of any significant events.
The Job Submission Service (JSS):
Configuration:
Again most people will need to configure a JSS sever.
The configuration file is <install path>/etc/jss.conf
The Logging Broker (LB):
All events throughout the job submission, execution
and output retrieval processes are logged by the LB
in a MySQL database.
All information is time stamped.
It is through the logged information that users are
able to discover the state of their jobs.
The Logging Broker (LB):
Configuration:
An LB local logger must be installed on all machines
which are pushing information into the LB system (RB
and JSS machines and the gatekeeper machines of
each CE). An exception to this is the job submission
machine which can have a local logger but it is not
mandatory.
The LB server needs only be installed on a server
machine.
The Logging Broker (LB):
Configuration:
The local logger requires no configuration and the
server is configured when the database is created
using <install path>/etc/server.sql.
No further configuration is required.
Submitting a job:
ClassAds are:
Declarative – rather than procedural… that is they
describe notions of compatibility rather than specifying
a procedure to determine compatibilty
Simple – both syntactically and semantically … easy to use
Portable – Nothing is used that requires features specific
to a given architecture
Submitting a job:
ClassAds have dynamic typing and so only values have
types (not expressions)
As well as the usual type (numeric, string Boolean) values
can also have types such as time intervals and timestamps
and esoteric values such as undefined and error.
ClassAds can be nested
ClassAds have the usual set of operators (See the JDL
how to).
Submitting a job:
An example:
Executable = "WP1testF";
StdOutput = "sim.out";
StdError = "sim.err";
InputSandbox = {"/home/datamat/sim.exe",
"/home/datamat/DATA/*"};
OutputSandbox = {"sim.err","sim.err","testD.out"};
Rank = other.TotalCPUs * other.AverageSI00;
Requirements = other.LRMSType == "PBS" \
&& (other.OpSys == "Linux RH 6.1" || other.OpSys == "Linux
RH 6.2") && \
self.Rank > 10 && other.FreeCPUs > 1;
RetryCount = 2;
Arguments = "file1";
InputData = "LF:test10099-1001";
ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2
INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn,
dc=it";
DataAccessProtocol = "gridftp";
OutputSE = "grid001.cnaf.infn.it";
Submitting a job:
AND
F
T
U
E
O
R
F
T
U
E
NOT
F
F
F
F
E
F
F
T
U
E
F
T
T
F
T
U
E
T
T
T
T
E
T
F
U
F
U
U
E
U
U
T
U
E
U
U
E
E
E
E
E
E
E
E
E
E
E
E
Submitting a job:
ClassAds have dynamic typing and so only values have
types (not expressions)
As well as the usual type (numeric, string Boolean) values
can also have types such as time intervals and timestamps
and esoteric values such as undefined and error.
ClassAds can be nested
ClassAds have the usual set of operators (See the JDL
how to).
Submitting a job:
–
dg-job-submit
Allows the user to submit a job for execution on remote
resources in a grid.
SYNOPSIS
dg-job-submit [-help]
dg-job-submit [-version]
dg-job-submit [-template]
dg-job-submit <job_description_file> [-input input_file | resource
res_id]
[-notify
e_mail_address]
[-config
group_name] [-output out_file] [-noint] [-debug]
##############################################
#
# -------- Job description file ---------#
##############################################
Executable
= "$(CMS)/fpacini/exe/sum.exe";
InputData
= "LF:testbed0-00019";
ReplicaCatalog="ldap://firefox.esrin.esa.it:2155/ReplicaCatalog1";
DataAccessProtocol = "gridftp";
RetryCount
= 10;
Rank
= other.MaxCpuTime;
Requirements
= other.LRMSType == "Condor" && \
other.Architecture == "INTEL" && other.OpSys== "LINUX"
&& \
other.FreeCpus >= 4;
dg-get-job-output
This command requests the RB for the job output files (specified by the
OutputSandbox attribute of the job-ad) and stores them on the submitting
machine local disk.
–
SYNOPSIS
dg-get-job-output [-help]
dg-get-job-output [-version]
dg-get-job-output < dg_jobId1 …. dg_jobIdn | -input input_file > [-dir
directory_path] [-config group_name] [-noint] [-debug]
Examples
Let us consider the following command:
$> dg-get-job-output
https://grid004.it:2234/124.75.74.12/12354732109721?www.rb.com:4577 –dir
/home/data
it retrieves the files listed in the OutputSandbox attribute from the RB and
stores them locally in /home/data/12354732109721.
dg-list-job-match
Returns the list of resources fulfilling job requirements.
–
SYNOPSIS
dg-list-job-match [-help]
dg-list-job-match [-version]
dg-list-job-match <job_description_file> [-verbose] [-config group_name] [-output
output_file] [-noint] [-debug]
–
dg-job-cancel
Cancels one or more submitted jobs.
SYNOPSIS
dg-job-cancel [-help]
dg-job-cancel [-version]
dg-job-cancel < dg_jobId1 …. dg_jobIdn | -input input_file | -all >
e_mail_address] [-config group_name] [-output output_file] [-noint] [-debug]
[-notify
dg-job-status
Displays bookkeeping information about submitted jobs.
–
SYNOPSIS
dg-job-status [-help]
dg-job-status [-version]
dg-job-status < dg_jobId1 …. dg_jobIdn | -input input_file
group_name] [-output output_file] [-noint] [-debug]
| -all > [-full] [-config
Examples
$> dg-job-status dg_jobId2
displays the following lines:
********************************************************************
BOOKKEEPING INFORMATION
Printing status for the job: dg_jobId2
--dg_JobId
=
firefox.esrin.esa.it__20010514_163007_21833_RB1_LB3
Job Owner
= /C=IT/O=ESA/OU=ESRIN/CN=Fabrizio
Pacini/[email protected]
Status
= RUNNING
Location
= firefox.esa.it:2119/jobmanager-condor
Job Destination
= http://ramses.esrin.esa.it/rams/dataset1
Status Enter Time = 10:24:32 05-06-2001 GMT
Last Update Time
= 10:25:45 05-06-2001 GMT
CpuTime
= 1
********************************************************************
dg-get-logging-info
Displays logging information about submitted jobs.
–
SYNOPSIS
dg-get-logging-info [-help]
dg-get-logging-info [-version]
dg-get-logging-info < dg_jobId1 …. dg_jobIdn | -input input_file | -all > [from T1] [-to T2] [-level logLevel] [-config group_name] [-output output_file] [noint] [-debug]
Job Submission: There is a GUI
Things to come over the next year
Release
Dependencies
Job
Partner
1.4
WP4
Support for interactive jobs
UI/RB/JSS groups
2
WP4
Support for job partioning
INFN PD/PPARC
1.3
WP4
Ability to submit MPI jobs
UI/RB/JSS groups
1.4
WP4
Specification of job dependencies
INFN CNAF/PPARC
1.4
WP7 WP2
Triggering of file transfers
INFN TO +Catania
1.4
WP7
Integration of network into scheduling policy
INFN TO + Catania +CNAF?
1.3
Develop APIs for application
DATAMAT
1.4
Development of GUI
DATAMAT
Deployment of accounting infrastructure over testbeds
(HLR with command line interface)
INFN TO
2
Full integration of cost estimation/accounting into
scheduling policies
INFN TO +CT
1.?
Review command requirement from D8.1A: "hold",
"move queue. Document reviewed by February.
Implications to RB architecture to be understood.
DATAMAT
1.4
Globus CAS
+WP4
1.?
WP8 WP9
WP10
1.2
Review of job info from D8,1A. Document to be
reviewed by January . Implications may need
coordination/blessing of WP4, and needs to be finalised
and matched alongside their schedule
CESNET
Support for Proxy renewal
CESNET, JSS part UNFN PD
possible UI change
???
WP3
Availability of L&B info through "standard" WP3
mechanism. Interfacing with WP3 R-GMA will tested
by MAY. Feedback will be provided
CESNET
1.4
WP2 WP4 WP5
WP7
Advanced reservation API. Usefulness dependent on
Testbed QoS configuration
INFN CNAF
Integration of advanced reservation(co-allocation) into
RB
INFN CNAF
2