cray_batch_queues

Download Report

Transcript cray_batch_queues

ISTeC Cray
High-Performance Computing System
Richard Casey, PhD
RMRCE
CSU Center for Bioinformatics
System Architecture
Compute blades
(batch compute nodes)
SeaStar 2+
Interconnect
Compute blades
(interactive compute nodes)
Login node;
Boot node;
Lustre file system node
Front
Back
XT6m Compute Node Architecture
Greyhound
DDR3 Channel
6MB L3
Cache
Greyhound
HT3
Greyhound
Greyhound
Greyhound
Greyhound
HT3
Greyhound
Greyhound
Greyhound
Greyhound
6MB L3
Cache
Greyhound
Greyhound
Greyhound
Greyhound
DDR3 Channel
DDR3 Channel
Greyhound
Greyhound
HT3
Greyhound
To Interconnect
•
•
•
•
•
•
•
•
DDR3 Channel
Greyhound
Greyhound
Greyhound
DDR3 Channel
Greyhound
6MB L3
Cache
HT3
6MB L3
Cache
Greyhound
Greyhound
HT3
DDR3 Channel
DDR3 Channel
Greyhound
HT
Each compute node contains 2 processors (2 sockets)
64-bit AMD Opteron “Magny-Cours” 1.9Ghz processors
1 NUMA processor = 6 cores
4 NUMA processors per compute node
24 cores per compute node
4 NUMA processors per compute blade
32 GB RAM (shared) / compute node = 1.664 TB total RAM (ECC DDR3 SDRAM)
1.33 GB RAM / core
DDR3 Channel
Compute Node Status
• Check whether interactive and batch compute nodes are up or down:
– xtprocadmin
NID
12
13
14
15
16
17
18
42
43
44
45
61
62
63
(HEX)
0xc
0xd
0xe
0xf
0x10
0x11
0x12
0x2a
0x2b
0x2c
0x2d
0x3d
0x3e
0x3f
NODENAME
c0-0c0s3n0
c0-0c0s3n1
c0-0c0s3n2
c0-0c0s3n3
c0-0c0s4n0
c0-0c0s4n1
c0-0c0s4n2
c0-0c1s2n2
c0-0c1s2n3
c0-0c1s3n0
c0-0c1s3n1
c0-0c1s7n1
c0-0c1s7n2
c0-0c1s7n3
TYPE
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
compute
STATUS
up
up
up
up
up
up
up
up
up
up
up
up
up
up
MODE
interactive
interactive
interactive
interactive
interactive
interactive
interactive
batch
batch
batch
batch
batch
batch
batch
Naming convention:
CabinetX-Y
Cage-X
Slot-X
Node-X
i.e. Cabinet0-0,Cage0,Slot3,Node0
• Currently
• 1,248 batch compute cores (fluctuates somewhat)
• 192 interactive compute cores (fluctuates somewhat)
Compute Node Status
• Check the state of interactive and batch compute nodes and
whether they are already allocated to other user’s jobs:
– xtnodestat
Current Allocation Status at Tue Apr 19 08:15:02 2011
Cabinet ID
Service
Nodes
Cage X:
Node X
Slots
(=blades)
C0-0
n3 -------B
n2 -------B
n1 -------c1n0 -------n3 SSSaa;-n2
aa;-n1
aa;-c0n0 SSSaa;-s01234567
Batch Compute Nodes
Allocated Batch Compute Nodes
Free Batch Compute Nodes
Interactive Compute Nodes
Allocated Interactive Compute Nodes
Free Interactive Compute Nodes
Legend:
nonexistent node
; free interactive compute node
A allocated, but idle compute node
X down compute node
Z admindown compute node
Available compute nodes:
S
?
Y
service node (login, boot, lustrefs)
free batch compute node
suspect compute node
down or admindown service node
4 interactive,
38 batch
Batch Queues
•
•
Current batch queue configuration
Under re-evaluation - may change in future to fair-share queues
Queue_name
Priority
Max_runtime (wallclock)
small
medium
large
ccm_queue
priority_queue
batch
woodward
woodward_ccm
EFS
high
medium
low
-------------
1 hr.
24 hrs.
168 hrs. (1 week)
-------------
Max_num_jobs_per_user
20
2
1
-------------
Batch Jobs
• PBS/Torque/Moab Batch Queue Management System
– For submission and management of jobs in batch queues
– Use for jobs with large resource requirements (long-running, # of cores, memory, etc.)
• List all available queues:
– qstat –Q (brief)
– qstat –Qf (full)
rcasey@cray2:~> qstat -Q
Queue
Max
Tot
-------------------batch
0
0
Ena
--yes
Str
--yes
Que
--0
Run
--0
Hld
--0
Wat
--0
Trn
--0
Ext T
--- 0 E
• Show the status of jobs in all queues:
– qstat
(all queued jobs)
– qstat – u username (only queued jobs for “username”)
–
(Note: if there are no jobs running in any of the batch queues, this command will show nothing and just return the
Linux prompt).
rcasey@cray2:~/lustrefs/mpi_c> qstat
Job id
Name
User
Time Use S Queue
------------------------- ---------------- --------------- -------- - ----1753.sdb
mpic.job
rcasey
0 R batch
Batch Jobs
• Common Job States
–
–
–
–
Q: job is queued
R: job is running
E: job is exiting after having run
C: job is completed after having run
• Submit a job to the default batch queue:
–
–
–
–
qsub filename
“filename” is the name of a file that contains batch queue commands
Command line directives override batch script directives
i.e. “qsub –N newname script”; “newname” overrides “-N name” in batch script
• Delete a job from the batch queues:
– qdel jobid
– “jobid” is the job ID number as displayed by the “qstat” command. You must
be the owner of the job in order to delete it.
Sample Batch Job Script
#!/bin/bash
#PBS –N jobname
#PBS –j oe
#PBS –l mppwidth=24
#PBS –l mppdepth=1
#PBS –l walltime=1:00:00
#PBS –q small
cd $PBS_O_WORKDIR
date
export OMP_NUM_THREADS=1
aprun –n24 –d1 executable
•
Batch queue directives:
–
–
–
–
–
–
-N
-j oe
-l mppwidth
-l mppdepth
-l walltime
-q
name of the job
combine standard output and standard error in single file
specifies number of cores to allocate to job (MPI tasks)
specifies number of threads per core (OpenMP)
specifies maximum amount of wall clock time for job to run (hh:mm:ss)
specify which queue to submit the job to (if none specified, job is sent to
small queue)
Sample Batch Job Script
• PBS_O_WORKDIR environment variable is generated by Torque/PBS.
Contains absolute path to directory from which you submitted your job.
Required for Torque/PBS to find your executable files.
• Linux commands and environment variables can be included in batch job
script
• The value set in aprun “-n” parameter should match value set in PBS
“mppwidth” directive
•
•
i.e. #PBS –l mppwidth=24
i.e. aprun –n 24 exe
• Request proper resources:
•
•
•
If “-n” or “mppwidth” > 1,248, job will be held in queued state for awhile and then deleted
If “mppwidth” < “-n”, then error message “apsched: claim exceeds reservation's nodecount”
If “mppwidth” > “-n”, then OK
Sample Batch Job Script
•
•
•
For MPI code
ALPS places MPI tasks sequentially on cores within compute node
If mppwidth = n > 24, ALPS places MPI tasks on multiple compute nodes
#PBS
#PBS
#PBS
#PBS
#PBS
-N
–j
-l
-l
–q
mpicode
oe
mppwidth=12
walltime=00:10:00
small
# mppwidth = -n = number of cores
cd $PBS_O_WORKDIR
cc -o mpicode mpicode.c
aprun –n12 ./mpicode
Sample Batch Job Script
•
•
•
•
For OpenMP code
ALPS places OpenMP threads sequentially on cores within compute node
mppdepth = OMP_NUM_THREADS = -d <= 24
If –d exceeds 24 get error message - “apsched: -d value cannot exceed largest
node size”
#PBS
#PBS
#PBS
#PBS
#PBS
-N
–j
-l
-l
-q
openmpcode
oe
mppdepth=6
walltime=00:10:00
small
# mppdepth = OMP_NUM_THREADS = -d <= 24 number of cores
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=6
cc -o openmpcode openmpcode.c
aprun –d6 ./openmpcode
Sample Batch Job Script
•
•
•
For hybrid MPI / OpenMP code
ALPS places MPI tasks sequentially on cores within compute node & launches OpenMP
threads per MPI task
By default, ALPS places one OpenMP thread per MPI task; use mppdepth =
OMP_NUM_THREADS = -d to change number of threads per task
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS
-N
–j
-l
-l
-l
–q
hybrid
oe
mppwidth=6
mppdepth=2
walltime=00:10:00
small
# mppwidth = -n = number of cores
# mppdepth = OMP_NUM_THREADS = -d <= 24 number of OpenMP threads per core
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=2
cc -o hybridcode hybridcode.c
aprun –n6 –d2 ./hybridcode