No Slide Title

Download Report

Transcript No Slide Title

Workload management
Owen Maroney, Imperial College London
(with a little help from David Colling)
Contents
• Brief review of the WMS architecture used in
LCG2.
• Future UK plans in WMS area.
WMS used in LCG2:
• EDG release 2(.1) architecture
• Slightly hardened and made more robust
Now takes less time to submit jobs
But appears to be reliable and
scalable to current levels of LCG-2
• Uses (modified) bdII instead of RGMA (gin/gout)
Strictly speaking this is a monitoring issue
rather than a WMS issue.
WMS used in LCG2:
Description that follows was shown at GridPP7 and
mainly taken from an even earlier presentation by
Massimo Sgaravatto. So this is just a reminder, however
there have been no changes in the basic architecture
between then and LCG2.
Job
Status
edg-job-submit myjob.jdl
RB node
Myjob.jdl
UI
JobType = “Normal”;
Executable = "$(CMS)/exe/sum.exe";
InputData
= "LF:testbed0-00019";
InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};
Replica
Network
OutputSandbox
= {“sim.err”, “test.out”, “sim.log"};
Catalog
Requirements
= other. GlueHostOperatingSystemName == “linux"
&&
Server
other. GlueHostOperatingSystemRelease == "Red Hat 7.3“ &&
other.GlueCEPolicyMaxWallClockTime > 10000;
Rank = other.GlueCEStateFreeCPUs;
Workload
Manager
UI: allows users to
access the functionalities
of the WMS
Inform.
Service
Job Description Languag
(JDL) to specify job
characteristics and
requirements
Job Contr.
CondorG
CE characts
& status
Computing
Element
submitted
SE characts
& status
Storage
Element
NS: network daemon
RB node
responsible for accepting
incoming requests
Job submission
Replica
Catalog
Network
Server
Job
Job
Status
UI
Input
Sandbox
files
RB
storage
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
submitted
waiting
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
UI
Job
RB
storage
WM: responsible to take
the appropriate actions to
satisfy the request
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
waiting
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
UI
Matchmaker
RB
storage
Workload
Manager
Where does this
job must be
executed ?
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
waiting
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Matchmaker: responsible
Server
UI
to find the “best” CE
where to submit a job
RB
storage
MatchMaker/
Broker
Workload
Manager
Inform.
Service
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
waiting
RB node
Job submission
Job
Status
Where are (which SEs)
the needed data ?
submitted
Replica
Catalog
Network
Server
MatchMaker/
Broker
UI
RB
storage
Workload
Manager
Job Contr.
CondorG
Inform.
Service
What is the
status of the
Grid ?
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
waiting
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
UI
Matchmaker
RB
storage
Workload
Manager
Inform.
Service
CE choice
Job Contr.
CondorG
CE characts
& status
Computing
Element
SE characts
& status
Storage
Element
waiting
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
UI
RB
storage
Workload
Manager
Inform.
Service
Job
Adapter
Job Contr.
CondorG
CE characts
“touches”
& status
JA: responsible for the final
to the job before performing submission
(e.g. creation of wrapper script, etc.)
Computing
Element
SE characts
& status
Storage
Element
waiting
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
Job
Job Contr.
CondorG
JC: responsible for the
actual job management
operations (done via
CondorG)
Computing
Element
CE characts
& status
SE characts
& status
Storage
Element
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
Input
Sandbox
files
CE characts
& status
SE characts
& status
Job
Computing
Element
Storage
Element
Job
Status
RB node
Job submission
submitted
Replica
Catalog
Network
Server
waiting
UI
RB
storage
ready
Workload
Manager
Inform.
Service
scheduled
Job Contr.
CondorG
running
Input
Sandbox
“Grid enabled”
data transfers/
accesses
Computing
Element
Job
Storage
Element
Job
Status
RB node
Job submission
submitted
Network
Server
Replica
Catalog
UI
RB
storage
Workload
Manager
ready
Inform.
Service
Job Contr.
CondorG
scheduled
running
Output
Sandbox
files
Computing
Element
waiting
done
Storage
Element
Job submission
Job
Status
RB node
edg-job-get-output <dg-job-id>
submitted
Network
Server
Replica
Catalog
UI
RB
storage
Workload
Manager
ready
Inform.
Service
Job Contr.
CondorG
scheduled
running
Output
Sandbox
Computing
Element
waiting
done
Storage
Element
Job
Status
RB node
Job submission
submitted
Network
Server
Replica
Catalog
waiting
UI
Output
Sandbox
files
ready
RB
storage
Workload
Manager
Inform.
Service
Job Contr.
CondorG
scheduled
running
done
cleared
Computing
Element
Storage
Element
RB node
Logging and bookkeeping.
edg-job-status <dg-job-id>
UI
Network
Server
LB: receives and stores
job events; processes
corresponding job status
Workload
Manager
Job
status
Job Contr.
CondorG
Logging &
Bookkeeping
Log
Monitor
Log of
job events
LM: parses CondorG log
file (where CondorG logs
info about jobs) and notifies LB
Computing
Element
Future UK plans
The WMS will be change with ARDA
(e.g. will go to pull rather push model for job distribution)
UK emphasis is going to be on testing scalability
Plan is:
Instrument WMS code
Build testbed (between Imperial HEP and LeSC)
capable of simulating the load of entire LCG
Understand the characteristics of different sorts of
(HEP) job and feed this into simulation.
Also Plan:
To examine and understand the performance of the
WMS in operation.
Future UK plans
Details of the testbed construction to be worked out,
however this effort will be integrated into the EGEE/LCG
testplan.
This effort also neatly dovetails into the GridCC project
(see talk at GridPP11?)