Workpage 1 Middleware David Colling, Imperial College London, [email protected] I shall talk about release 2 as you should all know about release 1 by.
Download ReportTranscript Workpage 1 Middleware David Colling, Imperial College London, [email protected] I shall talk about release 2 as you should all know about release 1 by.
Workpage 1 Middleware David Colling, Imperial College London, [email protected] I shall talk about release 2 as you should all know about release 1 by now! With thanks … Many slides have been taken from Massimo Sgaravatto and were used in the EU review The architecture Completely new architecture: The workload management system has been refactored to streamline the flow of job information, therefore addressing problems and shortcomings found with release 1.x. The re-factored components also provide hooks and features to support new functionality. Best description is in deliverable D1.4 State at time of review The architecture See D1.4 for details… The architecture User Interface: Although there have been several changes to the architecture, the commands available at the user end are the same… so the new architecture looks the same to the users. Network Server The Network Server is a generic network daemon, responsible for accepting incoming requests from the UI (e.g. job submission, job removal), which, if valid, are then passed to the Workload Manager. The architecture Workload manager: The Workload Manager is the core component of the Workload Management System. Given a valid request, it has to take the appropriate actions to satisfy it. To do so, it may need support from other components, which are specific to the different request types. The architecture Resource Broker: This has been turned into one of the modules that help the workload manager, actually 3 submodules… • Matchmaking • Ranking • Scheduling Job Adapter The Job Adapter put the finishing touches to the job’s jdl and creates the job wrapper. The architecture Job Controller and CondoG Actually submit the job to the resources and track progress. So how does this all work… Job submission example (for a “simple” job) RB node Replica Catalog Network Server UI Workload Manager Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element Job Status edg-job-submit myjob.jdl RB node Myjob.jdl JobType = “Normal”; Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed0-00019"; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Replica Network dc=cnaf, dc=infn, dc=it"; Catalog,dc=sunlab2g, Catalog DataAccessProtocol Server = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000; Rank = other.GlueCEStateFreeCPUs; Job submission UI Workload Manager UI: allows users to access the functionalities of the WMS Inform. Service Job Description Languag (JDL) to specify job characteristics and requirements Job Contr. CondorG CE characts & status Computing Element submitted SE characts & status Storage Element NS: network daemon RB node responsible for accepting incoming requests Job submission Replica Catalog Network Server Job Job Status UI Input Sandbox files RB storage Workload Manager Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element submitted waiting Job Status RB node Job submission submitted Replica Catalog Network Server UI Job RB storage WM: responsible to take the appropriate actions to satisfy the request Workload Manager Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element waiting Job Status RB node Job submission submitted Replica Catalog Network Server UI Matchmaker RB storage Workload Manager Where does this job must be executed ? Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element waiting Job Status RB node Job submission submitted Replica Catalog Network Matchmaker: responsible Server UI to find the “best” CE where to submit a job RB storage MatchMaker/ Broker Workload Manager Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element waiting RB node Job submission Job Status Where are (which SEs) the needed data ? submitted Replica Catalog Network Server MatchMaker/ Broker UI RB storage Workload Manager Job Contr. CondorG Inform. Service What is the status of the Grid ? CE characts & status Computing Element SE characts & status Storage Element waiting Job Status RB node Job submission submitted Replica Catalog Network Server UI Matchmaker RB storage Workload Manager Inform. Service CE choice Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element waiting Job Status RB node Job submission submitted Replica Catalog Network Server UI RB storage Workload Manager Inform. Service Job Adapter Job Contr. CondorG CE characts “touches” & status JA: responsible for the final to the job before performing submission (e.g. creation of wrapper script, etc.) Computing Element SE characts & status Storage Element waiting Job Status RB node Job submission submitted Replica Catalog Network Server waiting UI RB storage ready Workload Manager Inform. Service Job Job Contr. CondorG JC: responsible for the actual job management operations (done via CondorG) Computing Element CE characts & status SE characts & status Storage Element Job Status RB node Job submission submitted Replica Catalog Network Server waiting UI RB storage ready Workload Manager Inform. Service scheduled Job Contr. CondorG Input Sandbox files CE characts & status SE characts & status Job Computing Element Storage Element Job Status RB node Job submission submitted Replica Catalog Network Server waiting UI RB storage ready Workload Manager Inform. Service scheduled Job Contr. CondorG running Input Sandbox “Grid enabled” data transfers/ accesses Computing Element Job Storage Element Job Status RB node Job submission submitted Network Server Replica Catalog UI RB storage Workload Manager ready Inform. Service Job Contr. CondorG scheduled running Output Sandbox files Computing Element waiting done Storage Element Job submission Job Status RB node edg-job-get-output <dg-job-id> submitted Network Server Replica Catalog UI RB storage Workload Manager ready Inform. Service Job Contr. CondorG scheduled running Output Sandbox Computing Element waiting done Storage Element Job Status RB node Job submission submitted Network Server Replica Catalog waiting UI Output Sandbox files ready RB storage Workload Manager Inform. Service Job Contr. CondorG scheduled running done cleared Computing Element Storage Element RB node Logging and bookkeeping. edg-job-status <dg-job-id> UI Network Server LB: receives and stores job events; processes corresponding job status Workload Manager Job status Job Contr. CondorG Logging & Bookkeeping Log Monitor Log of job events LM: parses CondorG log file (where CondorG logs info about jobs) and notifies LB Computing Element Timescales and functionality… Release 2 of WP 1 software is due J+27 New functionality to include: • MPI job submission •User APIs •Accounting infrastructure •Interactive job support •Job logical checkpointing Pretty much on time (says Massimo) Will be tested against non-EDG resources using GLUE Don’t have time to go through all of these so will just will just go through checkpointing (as this was shown in the review). Job Checkpointing Job checkpoint states saved in the LB server Retrieval of job checkpoint Logging & Bookkeeping Server Job Also info used (even in rel. 1) as repository of job status Already The proved to be robust and reliable load can be distributed between multiple LB servers, to address scalability problems Saving of job checkpoint state state.saveState() Job checkpointing scenario RB node Network Server UI Workload Manager Logging & Bookkeeping Server Job Contr. CondorG Computing Element X Computing Element Y Job Statu s edg-job-submit jobchkpt.jdl RB node jobchkpt.jdl UI [JobType = “Checkpointable”; Executable = "hsum.exe"; StdOutput = Outfile; InputSandbox = "/home/user/hsum.exe”, Replica Network OutputSandbox = “Outfile”, Catalog Requirements = member("ROOT", Server other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("CHKPT", other.GlueHostApplicationSoftwareRunTimeEnvironment); Rank = -other.GlueCEStateEstimatedResponseTime;] Workload Manager UI: allows users to access the functionalities of the WMS Computing Element X Job Contr. CondorG submitted Logging & Bookkeeping Server Job Description Languag (JDL) to specify job characteristics and requirements Computing Element Y Job Status RB node submitted 1 Network Server Job UI 1 2 Input Sandbox files RB storage 3 Job Contr. CondorG 6 Logging & Bookkeeping Server 4 Job waiting Matchmaker Workload Manager 5 Input Sandbox files Job Job Adapter ready scheduled running 6 Job Computing Element X Computing Element Y Job Job Status RB node submitted Network Server waiting UI RB storage Workload Manager Job Contr. CondorG From time to time user’s job asks to save the intermediate state Computing Element X Logging & Bookkeeping Server … <save intermediate files>; State.saveValue(“var1”, value1>; … State.saveValue(“varn”, valuen); State.saveState(); … ready scheduled running Computing Element Y Job Job Status RB node submitted Network Server waiting UI RB storage Workload Manager Logging & Bookkeeping Server Job Contr. CondorG Saving of intermediate files Computing Element X ready scheduled running Saving of job state Computing Element Y Job Job Status RB node submitted Network Server waiting UI RB storage ready Workload Manager Logging & Bookkeeping Server Job Contr. CondorG scheduled running done (failed) Job fails (e.g. for a CE problem) Computing Element X Job Computing Element Y Job Status RB node submitted Network Server UI waiting Matchmaker RB storage Workload Manager Job Contr. CondorG Reschedule and resubmit job ready Logging & Bookkeeping Server Where must this job be executed ? Possibly on a different CE where the job was previously submitted … scheduled running done (failed) Job waiting Computing Element X Computing Element Y Job Job Status RB node submitted Network Server UI waiting Matchmaker RB storage Workload Manager Job Contr. CondorG ready CE choice: CEy Logging & Bookkeeping Server scheduled running done (failed) waiting Computing Element X Computing Element Y Job Status RB node Network Server ready UI scheduled RB storage Workload Manager Job Logging & Bookkeeping Server running Job Adapter Job Contr. CondorG done (failed) CE characts & status waiting ready Computing Element X Computing Element Y Job Status RB node ready Network Server scheduled UI RB storage Workload Manager Logging & Bookkeeping Server running done (failed) Job Contr. CondorG Input Sandbox files waiting ready Job scheduled Computing Element Y Computing Element X Job Job Status RB node scheduled Network Server running UI RB storage Workload Manager Job Contr. CondorG Logging & Bookkeeping Server Retrieval of last saved state when job starts done (failed) waiting ready Retrieval of intermediate files (previously saved) scheduled running Computing Element Y Computing Element X Job Job Status RB node scheduled Network Server running UI RB storage Workload Manager Logging & Bookkeeping Server Job Contr. CondorG Job keeps running done (failed) waiting ready starting from the point corresponding to the retrieved state (doesn’t need to start from the beginning) scheduled running Computing Element Y Computing Element X Job Job Job checkpointing example int main () { … for (int i=event; i < EVMAX; i++) { < process event i>;} ... exit(0); } Example of Application (e.g. HEP MonteCarlo simulation) Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User code must be easily instrumented in order to exploit the checkpointing framework … Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } •User defines what is a state •Defined as <var, value> pairs • Must be “enough” to restart a computation from a previously saved state Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } User can save from time to time the state of the job Job checkpointing example #include "checkpointing.h" int main () { JobState state(JobState::job); event = state.getIntValue("first_event"); PFN_of_file_on_SE = state.getStringValue("filename"); …. var_n = state.getBoolValue("var_n"); < copy file_on_SE locally>; … for (int i=event; i < EVMAX; i++) { < process event i>; ... state.saveValue("first_event", i+1); < save intermediate file on a SE>; state.saveValue("filename", PFN of file_on_SE); ... state.saveValue("var_n", value_n); state.saveState(); } … exit(0); } Retrieval of the last saved state The job can restart from that point Further additional functionality The order of implementation is not up to WP 1 people… Dependent jobs: Using Condor DAGMan Uses same jdl as normal jobs… for example Further additional functionality A=[ Executable = "A.sh"; PreScript = "PreA.sh"; PreScriptArguments = { "1" }; Children = { "B", "C" } ]; B=[ Executable = "B.sh"; PostScript = "PostA.sh"; PostScriptArguments = { "$RETURN" }; Children = { "D" } ]; C=[ Executable = "C.sh"; Children = { "D" } ]; D=[ Executable = "D.sh"; PreScript = "PreD.sh"; PostScript = "PostD.sh"; PostScriptArguments = { "1", "a" } ] Further additional functionality Job partitioning will be similar to checkpointing, with the jobs being partitioned according to some variable. Partitioned jobs will also have a pre-job and aggregator e.g. Further additional functionality JobType = Partitionable; Executable = ...; JobSteps = ...; StepWeight = ...; Requirements = ...; ... ... Prejob = [ Executable = ... Requirements = ...; ... ... Aggregator = [ Executable = ... Requirements = ...; ... ... ]; Further additional functionality There will also be advanced reservation of resources and co-location. Further into the future… EDG will not use OGSA, however the future is in the OGSA grid world. Work is being done at LeSC (See Steven Newhouse’s talk tomorrow) to wrap the WP 1 components. Communication via JDML and LBML Virtualisation of RB through OGSA factory Use virtualisation to load balance Increase interoperability Summary • The workload management middleware is being refactored addressing short comings in releases 1.x • This allows additional functionality to be easily incorporated and use of components by other projects. • OGSA is the future and work is being done to allow the WP 1 components to work in such a world.