Installing and Running SGE at DESY (Zeuthen)

Download Report

Transcript Installing and Running SGE at DESY (Zeuthen)

Installing and Running SGE
at DESY (Zeuthen)
Wolfgang Friebel,
15.10.2001
HEPiX Meeting Berkeley
Introduction

Motivations for using a batch system




more effective usage of available computers (e. g. more uniform load)
usage of resources 24h/day
assignment of resources according to policies (who gets how much CPU when)
quicker execution of tasks (system knows most powerful least loaded nodes)

Our goal:
You tell the batch system a script name and what you
need in terms of disk space, memory, CPU time
The batch system guarantees fastest possible turnaround

Could even be used to get xterm windows on least loaded
machines for interactive use
Oct 15, 2001
SGEEE Zeuthen
2
Batch Systems Overview





Condor
NQS
targeted at using idle workstations (not used at DESY)
public domain and commercial versions, basic functionality.
Used for APE100 projects
Loadleveler mostly found on IBM machines, used at DESY
LSF
popular, rich set of features, licensed software, used at DESY
PBS
public domain and commercial versions, origin: NASA
rich set of features, became popular recently, used in H1

Codine/GRD batch system similar to LSF in functionality, used in HERA-B
and for all farms at DESY Zeuthen

SGE/SGEEE Sun Grid Engine (Enterprise Edition), open source
successors of Codine/GRD. Became the only batch system
at Zeuthen (except for the legacy APE 100 batch system)
Oct 15, 2001
SGEEE Zeuthen
3
The old Batch System Concept


Each group runs a separate cluster with separate instances
of GRD or Codine
Project priorities within a group are maintained by
configuring several queues reflecting the priorities





Queue names were named after priority, e.g. long, medium, short, idle, ...
Could also be named according to task, e.g. simulation, production, test, ...
Individuals had to obey group dependent rules to submit jobs
Priorities between different groups were realized by the
cluster size (CPU power)
Urgent tasks were tried to carry out by asking other
groups to temporarily use their cluster


Administrative overhead to enable accounts on machines
Users had to adapt their batch jobs to the new environment
There were always heavily overloaded clusters next to
machines with lots of idle CPU cycles
Oct 15, 2001
SGEEE Zeuthen
4
A new Scheme for Batch
Processing

Two factors led us design a new batch processing scheme



One central batch system for all groups




basically only two types: Queue for ordinary batch jobs and idle queue
most of the scheduling decisions based on other mechanisms (see below)
Resource requests for jobs determine queuing



dynamic allocation of resources according to the current needs of groups
more uniform configuration of batch nodes
Very few queue types


shortcomings of the old system, especially the non uniform usage pattern
licensing situation, our GRD license ended, wanted to go to the open source
successor of GRD
Resource definition based on the concept of complexes (explained later)
User should request resources if the defaults are not well suited for the jobs
Bookkeeping of resources within the batch system
Oct 15, 2001
SGEEE Zeuthen
5
The Sun Grid Engine Components
Components of the system

Queues
contain information on number of jobs and job characteristics that
are allowed on a given host. Jobs need to fit into a queue to get
executed. Queues are bound to specific hosts.

Resources
Features of hosts or queues that are known to SGE. Resource
attributes are defined in so called (global, host, queue and user
defined) complexes

Projects
contain lists of users (usersets) that are working together. The
relative importance to other projects may be defined using shares.

Policies
Algorithms that define, which jobs are scheduled to which queues
and how the priority of running jobs has to be set. SGEEE knows
functional, share based, urgency based and override policies

Shares
SGEEE can use a pool of tickets to determine the importance of
jobs. The pool of tickets owned by a project/job etc. is called share
Oct 15, 2001
SGEEE Zeuthen
6
Benefits Using the SGEEE Batch
System

For users:






jobs get executed on the most suitable (least loaded, fastest)
machine
fair scheduling according to defined sharing policies
no one else can overuse the system and provoke system degradation
users need no knowledge of host names where their jobs can run
quick access to load parameters of all managed hosts
For administrators:




one time allocation of resources to users, projects, groups
no manual intervention to guarantee policies
reconfiguration of the running system (to adapt to changing usage
pattern)
easy monitoring of hosts and jobs
Oct 15, 2001
SGEEE Zeuthen
7
Policies for the Job handling within
SGEEE


Within SGEEE tickets are used to distribute the workload
User based functional policy


Share based policy



Certain fractions of the system resources (shares) can be assigned to projects
and users.
Projects and users receive that shares during a configurable moving time
window (e.g. CPU usage for a month based on usage during the past month)
Deadline policy


Tickets are assigned to projects, users and jobs. More tickets mean higher
priority and faster execution (if concurrent jobs are running on a CPU)
By redistributing tickets the system can assign jobs an increasing weight to
meet a certain deadline. Can be used by authorized users only
Override policy

Sysadmins can give additional tickets to jobs, users or projects to temporarily
adjust their relative importance.
Oct 15, 2001
SGEEE Zeuthen
8
Classes of Hosts and Users

Submit Host
node that is allowed to submit jobs (qsub) and query its
status

Exec Host
Admin Host
Master Host
node that is allowed to run (and submit) jobs


node from which admin commands may be issued
node controlling all SGE activity, collecting status
information, keeping access control lists etc.
A certain host can have any mixture of the roles above
user that is allowed to fully control SGE

Administrator
Operator

Owner
user that is allowed to suspend jobs in queues he owns
or disable owned queues

User
can manipulate only his own jobs

Oct 15, 2001
user with admin privileges, who is not allowed to
change the queue configuration
SGEEE Zeuthen
9
The Zeuthen SGEEE Installation

SGEEE built from the source with AFS support


Another system (SGE with AFS) was built for the HERA-B experiment
Two separate clusters (no mix of operating systems)



95 Linux nodes in default SGEEE cell
Other Linux machines (public login) used as submit hosts
17 HP-UX nodes in cell hp
A cell is a separate pool of nodes controlled by a master node
Oct 15, 2001
SGEEE Zeuthen
10
The Zeuthen SGEEE Installation


In production since 9/2001
Smooth migration from the old system



Two batch systems were running in parallel for a limited time
Coexistence of old queue configuration scheme and the new one
Ongoing tuning of the new system



Initial goal was to reestablish functionality of the old system
Now step by step changes towards a truly homogeneous system
Initially some projects were bound to subgroups of hosts
Oct 15, 2001
SGEEE Zeuthen
11
Our Queue Concept

one queue per CPU with large time limit and low priority





users have to specify at least a CPU time limit (usually much smaller)
Users can request other resources (memory, disk) differing from default values
optionally a second queue that gets suspended as soon as
there are jobs in the first queue (idle queue)
interactive use is possible because of low batch priority
relation between jobs, users and projects is respected
because of sharing policies
Oct 15, 2001
SGEEE Zeuthen
12
Complexes within SGE


Complexes are containers for resource definitions
Resources can be requested by a batch job



The actual value for some resource parameters is known



Amount of available main memory or disk space can be used for decisions
Arbitrary "load sensors" can be written to measure resource parameters
Resources can be reserved for the current job


You can have hard requests that need to be fulfilled (e.g. host architecture)
Soft requests are fulfilled if possible
Parameters can be made "consumable". A portion of a requested resource gets
subtracted from the value of the currently available resource parameter
The most important parameters are known to SGEEE


Parameters like CPU time, virtual free memory etc. are built in already
To be used some of them need to be activated in the configuration
Oct 15, 2001
SGEEE Zeuthen
13
Our Complexes Concept

Users have to specify for a job


Time limit (CPU time)
Users can request for a job
A certain amount of virtual and real free memory
 The existence of one or two scratch disks
(coming soon):
 The available free disk space for a given scratch disk
 To have a guaranteed amount of disk space reserved
 More hardware oriented features like:
Using only machines from a subcluster (farm)
Run on a specific host (not recommended)

Oct 15, 2001
SGEEE Zeuthen
14
Experiences



System is easily useable from a users point of view
System is highly configurable (needs some time to find the
optimum policies to implement)
System is very stable




crashing jobs mostly due to failing token renewal (our plugin
procedure based on arc and batchtkauth)
other failures due to missing (on purpose!) path aliases for the
automounter
System adapts dynamically process priority to meet share
policies or to keep up with changing policies
SGE(EE) maintainers are very active and keep
implementing new ideas

quick incorporation of patches, reported bugs get fixed asap.
Oct 15, 2001
SGEEE Zeuthen
15
Advanced Use of SGEEE

Using the perl API




every aspect of the batch system is accessible through the perl API
the perl API is accessible after use SGE; in perl scripts
there is almost no documentation but a few sample scripts in
/afs/ifh.de/user/f/friebel/public and in
/afs/ifh.de/products/source/gridengine/source/experimental/perlgui
Using the load information reported by SGEEE





each host reports a number of load values to the master host
(qmaster)
there is a default set of load parameters that are always reported
further parameters can be reported by writing load sensors
qhost is a simple interface to display that information
a powerful monitoring system could be built around that feature,
which is based on the "Performance Data Collection" (PDC) built in
subsystem
Oct 15, 2001
SGEEE Zeuthen
16
Conclusions









Ease of installation from source
Access to source code
Chance of integration into a monitoring system
API for C and Perl
Excellent load balancing mechanisms
Managing the requests of concurrent groups
Mechanisms for recovery from machine crashes
Fallback solutions for dying daemons
Weakest point is AFS integration and Token prolongation
mechanism (basically the same code as for Loadleveler
and for older LSF versions)
Oct 15, 2001
SGEEE Zeuthen
17
Conclusions

SGEEE has all ingredients to build a company wide batch
infrastructure





SGEEE is open source maintained by Sun




Allocation of resources according to policies ranging from departmental policies
to individual user policies
Dynamic adjustment of priorities for running jobs to meet policies
Supports interactive jobs, array jobs, parallel jobs
Can be used with Kerberos (4 and 5) and AFS
Getting deeper knowledge by studying the code
Can enhance the code (examples: more schedulers, tighter AFS integration,
monitoring only daemons)
Code is centrally maintained by a core developer team
Could play a more important role in HEP (component of a
grid environment, open industry grade batch system as
recommended solution within HEPiX?)
Oct 15, 2001
SGEEE Zeuthen
18
References






http://gridengine.sunsource.net/servlets/ProjectSource
Download Page for source code of SGE(EE)
http://www.arl.hpc.mil/docs/grd/
lots of docs from raytheon
http://supportforum.Sun.COM/gridengine/
Supportforum, Mailinglists
http://hoover.hpac.tudelft.nl/cugs98cd/S98PROC/AUTHOR
S/FERSTL/INDEX.HTM
GRD on a Conference 1998
http://www-zeuthen.desy.de/computing/services/batch/
Zeuthen pages with URL to the reference manual
http://www-zeuthen.desy.de/…/batch/sge53.pdf
The SGEEE reference manual, user and installation guide
Oct 15, 2001
SGEEE Zeuthen
19
Technical Details of SGEEE
(not presented)






Submitting Jobs
The graphical interface qmon
Job submission and file systems
Sample job script
Advanced usage of qsub
Abnormal job termination
Oct 15, 2001
SGEEE Zeuthen
20
Submitting Jobs

Requirements for submitting jobs




have a valid token (verify with tokens), otherwise obtain a new one (klog)
ensure that in your .[t]cshrc or .zshrc no commands are executed that need a
terminal (tty) (users have often a stty command in their startup scripts)
you are within batch if the env variable JOB_NAME is set or if the env variable
ENVIRONMENT is set to BATCH
Submitting a job
specify what resources you need (-l option) and what script should be executed
qsub -l t=1:00:00 job_script

in the simplest case the job script contains 1 line, the name of the executable
 many more options available
 alternatively use the graphical interface to submit jobs
qmon &

Oct 15, 2001
SGEEE Zeuthen
21
The Submit Window of qmon
Oct 15, 2001
SGEEE Zeuthen
22
Job Submission and File Systems

Current working directory





the directory from where the qsub command was called. STDOUT and STDERR
of a job go into files that are created in $HOME. Because of quota limits and
archiving policies that is not recommended.
With the -cwd option to qsub the files get created in the current working
directory. For performance reasons that should be on a local file system
If cwd is in NFS space, the batch system must not use the real mount point but
be translated according to /usr/SGE/default/common/sge_aliases. As every job
stores the full info from sge_aliases, it is of advantage to get rid of that file and
discourage the use of NFS as current working directory
If required, create your own $HOME/.sge_aliases file
Local file space (Zeuthen policies)



/usr1/tmp is guaranteed to exist on all linux nodes and has typically > 10GB
/data exists on some linux nodes and has typically > 15GB capacity. A job can
request the existence of /data by -l datadir
$TMP[DIR] is a unique directory below /usr1/tmp, that gets erased at the end
of the job. Normal jobs should make use of that mechanism if possible
Oct 15, 2001
SGEEE Zeuthen
23
A Simple Job Script
#!/bin/zsh
otherwise the default shell would be used
#$ -S /bin/zsh
#
#$ -l t=0:30:00
the time limit for this job
#$ -j y
WORKDIR=/usr1/tmp/$LOGNAME/$JOB_ID
DATADIR=/net/ilos/h1data7
echo using working directory $WORKDIR
mkdir -p $WORKDIR
cp $DATADIR/large_input $WORKDIR
cd $WORKDIR
h1_reco
cp large_out $DATADIR
if [ -s large_out = -s $DATADIR/large_out ]; then
cd; rm -r $WORKDIR
fi
Oct 15, 2001
SGEEE Zeuthen
24
Advanced Usage of qsub

Option files
instead of giving qsub options on the command line, users may store those in
.sge_projects files in their $HOME or current working directories
 content of a sample .sge_projects file:
cwd -S /usr/local/bin/perl -j y -l t=24:00:00


Array jobs
SGE allows to schedule n identical jobs with one qsub call using the –t option:
qsub -t 1-10 array_job_script



within the script use the variable SGE_TASK_ID to select different inputs and
write to distinct output files (SGE_TASK_ID is 1...10 in the example above)
Conditional job execution




jobs can be scheduled to wait for dependent jobs to successfully finish (rc=0)
jobs can be submitted in hold state (needs to be released by user or operator)
jobs can be told not to start before a given date
start dependent jobs on the same host (using qalter -q $QUEUE ... within
script)
Oct 15, 2001
SGEEE Zeuthen
25
Abnormal Job Termination

Termination because of CPU limit exceeded



jobs get an XCPU signal that can be catched by the job. In that case
termination procedures can be executed, before the SIGKILL signal is sent
SIGKILL will be sent a few minutes after XCPU was sent. It cannot be
catched.
Restart after execution host crashes


if a host crashes when a given job is running, the job will be restarted. In
that case the variable RESTARTED is set to 1
The job will be reexecuted from the beginning on any free host. If the job
can be restarted using some results achieved so far, then the variable
RESTARTED can be checked. The job can be forced to be executed on the
same host by inserting
qalter -q $QUEUE $JOB_ID
literally in the job script

Signaling the end of the job

with the qsub option -notify a SIGUSR1 signal is sent to the job a few
minutes before the job is suspended or terminated
Oct 15, 2001
SGEEE Zeuthen
26