SLURM for watson Bluegene/Q

Download Report

Transcript SLURM for watson Bluegene/Q

SLURM for Yorktown Bluegene/Q
© 2014 IBM Corporation
SLURM on Wat2q
• Goals
• Setup a scheduler for the Yorktown Bluegene system to increase research
utilization of the system.
• Become familiar with the Bluegene/Q SRM (system resource manager)
interfaces as it is a model for future HPC control API’s.
• Divide the Yorktown system into multipl[‘e submidplane blocks.
• Develop scripts to allow users (optionally) to land on a specific
submidplane block.
• Get slurm to run the bgas.pl script automatically based on information in
the SLURM sbatch command used to queue a job.
• This requires that jobs be limited to running on complete partitions.
• SLURM by default will attempt to run a job on part of a submidplane
partition if that partition is already booted.
• This is accomplished with prolog scripts.
2
© 2014 IBM Corporation
SLURM Scheduling Jobs
3
© 2014 IBM Corporation
SLURM Allocation Vs. Task Placement
 Allocation is the selection of the resources needed for the job
– Each job includes zero or more job steps (srun)
– Each job step is comprised of one to multiple tasks
– This is done by the “sbatch” command.
 Task placement is the process of assigning a subset of the job’s allocated resources (cpus)
to each task.
– This is handled by the SLURM “srun” command invoked from within the script scheduled
by “sbatch”.
4
© 2014 IBM Corporation
Effectively this becomes a game of Tetris
5
© 2014 IBM Corporation
Slurm documentation
 Slurm docs can be found here:
– http://slurm.schedmd.com/documentation.html
– Typical commands:
6
sacct
displays accounting data for all jobs and job steps in the
SLURM job accounting log.
sbatch
Submit a batch job to SLURM.
scancel
Used to signal jobs or job steps that are under the
control of Slurm.
scontrol
Used view and modify Slurm configuration and state
sinfo
view information about SLURM nodes and partitions
smap
graphically view information about SLURM jobs,
partitions, and set configurations parameters.
squeue
view information about jobs located in the SLURM
scheduling queue.
srun
run parallel jobs.
sstat
Display various status information of a running job/step.
sview
graphical user interface to view and modify SLURM
state.
© 2014 IBM Corporation
SLURM functions
 SLURMD carries out five key tasks and has five corresponding subsystems:
– Machine Status
• responds to SLURMCTLD requests for machine state information and sends
asynchronous reports of state changes to help with queue control.
– Job Status
• responds to SLURMCTLD requests for job state information and sends
asynchronous reports of state changes to help with queue control.
– Remote Execution
• starts, monitors, and cleans up after a set of processes (usually shared by a parallel
job), as decided by SLURMCTLD (or by direct user intervention).
– Stream Copy Service
• handles all STDERR, STDIN, and STDOUT for remote tasks. This may involve
redirection, and it always involves locally buffering job output to avoid blocking local
tasks.
– Job Control
• propagates signals and job-termination requests to any SLURM-managed processes
(often interacting with the Remote Execution subsystem).
7
© 2014 IBM Corporation
Slurm software
 SLURM daemons don’t execute directly on the compute nodes.
 SLURM gets system state, allocates resources and other state from the Bluegene/Q control
system.
 This interface is entirely contained in a SLURM plugin (src/plugings/select/bluegene).
 The user interacts bluegene with the following slurm commands.
– sbatch.
– srun.
– scontrol.
– squeue.
8
© 2014 IBM Corporation
Slurm Architecture for Bluegene/Q
9
© 2014 IBM Corporation
Job Launch Process
10
© 2014 IBM Corporation
Sview of BlueGene system
11
© 2014 IBM Corporation
Slurm naming conventions
• Slurm names things with torus coordinates
• Top level names use 4 dimension midplane coordinates.
• Submidplane partitions use 5 dimension torus coordinates.
Bgq name
Slurm name
Bgq name
Slurm name
R00-M0
bgq0000
R00-M0-N00
bgq0000[00000x11111]
R00-M1
bgq0001
R00-M0-N01
bgq0000[00200x11311]
R01-M0
bgq0010
R00-M0-N02
bgq0000[00020x11131]
R01-M1
bgq0011
R00-M0-N03
bgq0000[00220x11331]
R00-M0-N04
bgq0000[20000x31111]
R00
bgq[0000x0001]
R00-M0-N05
bgq0000[20200x31311]
R00-M0-N06
bgq0000[20020x31131]
R00-M0-N07
bgq0000[20220x31331]
R00-M0-N08
bgq0000[02000x12111]
R00-M0-N09
bgq0000[02200x13311]
R00-M0-N10
bgq0000[02020x13131]
R00-M0-M11
bgq0000[03330x13331]
R00-M0-N12
bgq0000[22000x33111]
R00-M0-N13
bgq0000[22200x33311]
R00-M0-N14
bgq0000[22020x33131]
R00-M0-N15
bgq0000[22220x33331]
R01
Bgq[0010x0011]
R00R01
Bgq[0000x0011]
Example larger blocks
R01-M0-N00-128
12
Bgq0010[00000x11331]
© 2014 IBM Corporation
Slurm queuing a JOB.
 Use the sbatch command to queue a script that will run one or more jobs.
 Within the script presented to the sbatch command do one or more “srun” commands.
– The srun command will eventually cause a runjob command to be created.
 For example:
– This schedules the script rj01.sh to be run when a 64 node block on the partition “prod”
is booted.
sbatch –nodes=64 --partition=prod rj01.sh
– Inside rj01.sh we have:
#!/bin/bash
srun --chdir=/bgusr/home1/bvt_scratch
/bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf
– The srun will call runjob as follows:
runjob --exe /bgusr/home1/bgqadmin/bvtapps/dgemmdiag/dgemmdiag.elf --block RMP28Ap122959767
--cwd /bgusr/home1/bvt_scratch
13
© 2014 IBM Corporation
Queuing a job with only one script.
 Using sbatch/srun to queue a job typically requires two scripts, one to queue the job,
(sbatch) and one to run one or more jobs (srun) once the block is allocated.
 One can do this with a single script with this simple boilerplate.
##!/bin/bash
if [ -z "$SLURM_JOBID" ]; then
sbatch --gid=bqluan --time=5:00 --nodes=128 --ntasks-per-node=32 -O --qos=umax-128 $0
else
srun --chdir=/gpfs/DDNgpfs2/bqluan/mushroomP \
--output=equilibrate-4V-21-new.out --error=equilibrate-4V-22-new.namd \
/gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 equilibrate-4V-22-new.namd
fi
 The above script is a re-expression of the following (original) run job script
runjob --block R01-M0-N04-128 --ranks-per-node 32 --cwd /gpfs/DDNgpfs2/bqluan/mushroomP \
--exe /gpfs/DDNgpfs1/smts/bin/bgq/namd2.9 \
--args equilibrate-4V-21-new.namd > equilibrate-4V-21-new.out 2> equilibrate-4V-21-new.err &
14
© 2014 IBM Corporation
Srun/runjob decoder
Runjob option
Srun option
--cwd
--chdir
--exe
(first field without an option)
--label xx
--label=xx
--verbose
--verbose
--ranks-per-node
--ntasks-per-node
All other options
--launcher_opts=
• Launcher options is a catch-all for all other runjob options
• For example:
--launcher-opts=“—timeout-300 –strace”
15
© 2014 IBM Corporation
Partitions (SLURM queue names).
 We have setup multiple basic slurm queues (partitions).
– prod – regular production nodes (R00-M0, R00-M1, R01-M0, R01-M1).
– bgas – full system bgas allocation (R00-M0, R00-M1, R01-M0, R01-M1).
 There are a couple of midplane level reservations setup to run each day.
– bgas_daily – active 3am to 3:30pm
– bgas_full – 3:30 pm to 6pm.
– The default queue/partition is the “prod” queue.
– The queue/partition name is used by the prolog script to determine if it is necessary to
switch the IO nodes to either BGAS or production.
16
© 2014 IBM Corporation
SLURM small block divisions.
 Block divisions as of May 2024.
– bgq0000 (R00-M0) – divided into 16 32 way blocks.
– bgq0001 (R00-M1) – divided into 32,64,128,256 way (overlapping blocks)
– Bgq0010 (R01-M0) – divided into ,64,128,256 way (overlapping blocks)
– Bgq0011 (R01-M1) – divided into ,64,128,256 way (overlapping blocks)
• sbatch option “--nodes=xx” where xx is, either 32,64,128,256 will cause a job to land on one
of the small block partitions. Slurm will pick which small block to run it on.
• Prolog scripts ensure that partial blocks are not used (i.e. 2 32 way jobs running on the
same 64 way block at the same time.
• You can restrict which midplane that slurm will try to select its blocks from with the –
nodelist=xxxx, where xxxx is bgq0000, bgq0001, bgq0010, or bgq0011.
17
© 2014 IBM Corporation
Getting SLURM to run on a specific node card/block
 To get slurm to land on a specific block we use the prolog script and the “nodelist” and
“constraint” option for sbatch.
 For example:
sbatch --partition=prod –nodelist=bgq0000 --nodes=32 --constraint=N00-32
 NOTE:
– The --nodes option and the constraint must agree as to the size.
– A sub-block of that size MUST exist on the nodelist requested.
 Valid constraints are:
– Nxx-32, where xx is 00-15
– Nxx-64, where xx is 00,02,04,06,08,10,12,14
– Nxx-128, where xx is 00,04,08,12
– Nxx-256, where xx is 00,08
 If the block is not capable of being scheduled the job will be canceled and a message will
appear in the stdout file (slurm-$jobid.out).
 Trying to use the higher number Nxx cards for 64 and 32 ways is discouraged, because the
system will try to run the jobs on the Lower Number cards first and down each node card in
turn until it lands on the card it needs to run on.
18
© 2014 IBM Corporation
SLURM Job order.
 If the user uses the –constraints parameter to select a specific node card, the order that jobs
are submitted on may not be respected.
 This is because the prolog scripts can reject the node SLURM first selects either due to it
trying to run on a block larger than requested, or by a constraint.
– When the job is rejected on a specific node, it gets re-queued and this will cause some
reordering.
 If Job order is required one can use the --singleton and --jobname options as follows:
– sbatch
--job-name=a --dependency=singleton -N32 --constraint=N01-32 rj01.s
 Another way to do this is with the “--dependency”:
– after:job_id[:jobid...] :
This job can begin execution after the specified jobs have begun execution.
– afterany:job_id[:jobid...] :
This job can begin execution after the specified jobs have terminated.
– afternotok:job_id[:jobid...]: This job can begin execution after the specified jobs have terminated in some failed
state (non-zero exit code, node failure, timed out, etc).
– afterok:job_id[:jobid...] :
This job can begin execution after the specified jobs have successfully executed (ran
to completion with an exit code of zero).
19
© 2014 IBM Corporation
SLURM – reservations.
 Slurm can reserve an entire Midplane for jobs by a specific reservation id.
 The current version can only reserve entire midplane blocks (not sub-midplane)
– The September release of SLURM is supposed to have better sub-midplane capabilities
for both node selection and reservations.
 Creating a reseveration:
scontrol create reservation user=myid starttime=now
nodes=bgq0001
duration=120 \
– This will reply with a reservation id as follows:
Reservation created: myid_5
 Using the reservation:
sbatch --reservation=myid_5 –nodes=64 my.script
This web page outlines reservations in more detail
https://computing.llnl.gov/linux/slurm/reservations.html
20
© 2014 IBM Corporation
Reservation Time-limit interaction.
 For each job in there queue there is an execution timelimit imposed on it.
 The default for this normally comes from the queue name.
– It can be overridden at various levels such as the sbatch command line.
– The initial default for the SLURM queues is 1 hour, so to over ride it use the --time
parameter on the sbatch as follows:
sbatch –time=xxx nameofscript.sh
• The xxx value is in minutes, other forms of date/times can be found in the sbatch
man page: “man sbatch”
 The job will not run if the timelimit overlaps a node reservation.
– So for example, if there is a reservation every day at 3:30 for the entire machine and the
time limit associated for the job will over lap that full system reservation, the job won’t
run. Until after the reservation is over.
– If the time-limit exceeds the queue/partition time-limit the job will be left in the pending
state indefinitely.
21
© 2014 IBM Corporation
QOS settings.
 QOS (quality of service settings), are used by SLURM to control limits on the amount of
resources a given user/group/account/job can consume at any one time.
 Our initial deployment of SLURM will associate a default QOS setting limiting each user to
the total number of compute nodes that they previously had as a static allocation.
 This will be used to keep users from consuming all of the machine by submitting multiple
sbatch commands, but still allow a user to run 3 32 way jobs if their normal allocaiton was
128 nodes.
 Each user will have a “default QOS” setting associated with their ID as well as a list of qos
settings they are allowed to use.
– umax-32 == user max nodes = 32
– umax-64 == user max nodes = 64
– umax-128 == user max nodes == 128
–…
 One can select one of the authorized qos settings in the sbatch command line as follows:
sbatch –qos=umax-128 –nodes=32 xx.sh
– The above command would allow the user to run 4 32 way jobs in parallel, before the
queue would back up his jobs behind other work.
22
© 2014 IBM Corporation