Transcript Document

Harvesting unused clock
cycles with Condor
Ian C. Smith*
*Advanced Research Computing
The University of Liverpool
what is Condor ?
High Performance versus High Throughput Computing
Condor fundamentals
setting up and running a Condor Pool
The University of Liverpool Condor Pool
example applications
What is Condor ?
a specialized system for delivering High Throughput Computing
a harvester of unused computing resources
developed by Computer Science Dept at University of Wisconsin
in late ‘80s
free and (now) open source software
widely used in academia and increasing in industry
available for many platforms: Linux, Solaris, AIX, Windows
XP/Vista/7, Mac OS
HPC vs HTC (1)
High Performance Computing (HPC)
 delivers large amounts of computing power over relatively short periods of
time (peak FLOPS ratings important)
 can also provide lots of memory, large amounts of fast (parallel) storage
 fairly exotic hardware, may need plenty of TLC
 large capital outlay on hardware
 need to run specialised parallel (MPI) codes to get the benefit (can run
serial codes but these are a poor use of resources)
 users run relatively small numbers of parallel jobs
 essential for certain time-critical applications
HPC vs HTC (2)
High Throughput Computing (HTC)
 allows many computational tasks to be completed over a long period of time
(peak FLOPS ratings not so important)
 users more concerned with running large numbers of jobs over a long time
span than a few short burst computations
 makes use of existing commodity hardware (e.g. desktop PCs)
 small capital outlay on hardware possible
 limited memory and storage available generally
 mostly aimed at running concurrent serial jobs (although MPI and PVM are
supported by Condor)
Types of Condor application
large numbers of independent calculations typically (“pleasantly
data parallel applications – split large datasets into smaller parts
and analyse independently
 biological sequence analysis
 processing of census data
optimisation problems
 microprocessor design and testing
applications based on Monte Carlo methods
 radiotherapy treatment analysis
 epidemiological studies
A “typical” Condor pool
Submit/execute host
Submit host
Central manager
Execute hosts
Execute hosts
A “typical” Condor pool
Submit/execute host
Submit host
Central manager
Execute hosts
Execute hosts
A “typical” Condor pool
Submit/execute host
Submit host
Central manager
Match Info
Match Info
Execute hosts
Match Info
Match Info
Execute hosts
A “typical” Condor pool
Submit/execute host
Submit host
Central manager
Execute hosts
Execute hosts
A “typical” Condor pool
Submit/execute host
Submit host
Central manager
Execute hosts
Execute hosts
ClassAds and Matchmaking
ClassAds are a fundamental part of Condor
similar to classified advertisements in a paper
“Job Ads” represent jobs to Condor (similar to “wanted” ads)
“Machine Ads” represent compute resources in a Condor Pool
(similar to “for sale” ads)
Condor central manager matches Machine Ads to Job Ads and
hence machines to jobs
Job Ads are created using submit description files
Simple submit description file
# simple submit description file
# (anything following a # is comment and is ignored by Condor)
# this would be used for Windows XP based execute hosts
universe = vanilla
executable = example.exe
output = stdout.out$(PROCESS)
log = mylog.log$(PROCESS)
transfer_input_files = common.txt, myinput$(PROCESS).txt
requirements = ( Arch=="Intel") && ( OpSys=="WINNT51" )
queue 2
what to run
job`s standard output
log job`s activities
input files needed
what machines to run on
number of jobs to queue
Requirements and Rank
Requirements expression determines where (and when) a job will
run e.g.
Requirements = (
OpSys==“WINNT51” )
Arch==“Intel” )
Memory >= 2000 )
Disk >= 33554432 )
( ClockMin > 1020 )
ClockMin == 6 ) || (
&& \
&& \
&& \
&& \
|| \
ClockDay == 0) )
Windows XP OS wanted
Intel/compatible processor
want a least 2GB memory and
at least 32 GB of free disk
must have MATLAB installed
only run jobs after 5 pm OR ...
at weekends
Rank is used to express a preference
Rank = Kflops
# run on machines with best floating point performance first
Job submission and monitoring
[einstein@submit ~]$ condor_submit example.sub
Submitting job(s).
2 job(s) submitted to cluster 100.
[einstein@submit ~]$ condor_q
-- Submitter: : <> :
7/22 14:19 172+21:28:36 R 0
22.0 checkprogress.cron
1/13 13:59
0+00:00:00 I 0
0.0 env
1/15 19:18
0+04:29:33 R 0
1/15 19:33
0+00:00:00 R 0
1/15 19:33
0+00:00:00 H 0
1/15 19:34
0+00:00:00 R 0
4/5 13:46
0+00:00:00 I 0
4/5 13:46
0+00:00:00 I 0
4/5 13:52
0+00:00:00 I 0
4/5 13:52
0+00:00:00 I 0
4/5 13:55
0+00:00:00 I 0
0.0 cosmos
557 jobs; 402 idle, 145 running, 1 held
[einstein@submit ~]$
Condor policies
Condor supports a wide range of policies for when to start jobs
 run jobs only outside office hours
 run jobs only if load average on host is small and there has been no recent
 run jobs at any time on one core (at low priority)
 run jobs only submitted by certain users
also a wide choice of what to do when a job is about to be
interrupted e.g.
 suspend the job for a limited time then let it resume
 checkpoint the job and migrate it to another machine
 kill off the job immediately
UNIX or Windows execute hosts ? (1)
 Condor’s natural environment
 not widely installed on desktop machines (but depends on institution...)
 supports the Condor “standard universe” containing many useful features
 checkpointing allows jobs to be migrated from one machine to another without
loss of useful work
 Remote Procedure Calls give transparent access to files on submit host
 streaming of standard output (stdout) from jobs to submit host
 Network filesystems work well making installation and configration much
 leverages large amount of scientific and engineering codes which have
been developed under UNIX
UNIX or Windows execute hosts ? (2)
 world’s most widely installed OS – rich source of execute hosts
 many commercial 3rd party applications run on Windows
 using shared (network) filesystems can be difficult under Condor
 only supports the “vanilla” Condor universe
 no checkpointing – evicted jobs may waste a lot of cycles
 all input and output files need to be transferred to/from execute host
 output streaming not supported
 may be difficult to port “legacy” UNIX codes (although Cygwin and Co-Linux
can make life easier)
 Windows support from the U-W Condor Team tends to lag behind UNIX
Setting up a Condor pool
best to start off small and build up pool slowly
need to understand Condor fundamentals:
 role of Condor processes and how they interact
 life-cycle of jobs
 ClassAds and Matchmaking
avoid firewalls if possible (may be easier said than done ...)
talk to central IT services (particularly network and PC teams)
submit hosts may need to be fairly high spec if large numbers of
jobs are to be run - ideally want
 multi-core/processor machine (quad core at least)
 plenty of memory (say 8 GB or more)
 large fast access filestore (e.g. 1 TB RAID)
Where to go for help
Read The Fine Manual !
log files contain a lot of useful information
take a look at the presentations, tutorials and “how-to recipes”on
the Condor website: (
search the condor-users mail list archive:
subscribe to the condor-users mail list
join the Campus Grids SIG:
commercial support is also available (e.g. Cycle Computing)
University of Liverpool Condor Pool
contains around 400 machines running the University’s Managed
Windows Service (currently XP but moving to Windows 7 soon)
most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80
GB disk, configured with two job slots / machine
single submission point for Condor jobs provided by Sun Solaris
V445 SMP server
policy is to run jobs only if a least 5 minutes of inactivity and low
load average during office hours and at anytime outside of office
job will be killed off if running when a user logs in to a PC
web interface for specific applications
support for running large numbers of MATLAB jobs
Condor service caveats
only suitable for DOS-based applications running in batch mode
no communication between processes possible (“pleasantly
parallel” applications only)
statically linked executables work best (although can cope with
all files needed by application must be present on local disk
(cannot access network drives)
shorter jobs more likely to run to completion (10-20 min seems to
work best)
very long running jobs can accommodated using Condor DAGMan
or user level check-pointing (details available soon on the Condor
Running MATLAB jobs under Condor
many users prefer to create applications using MATLAB rather
than traditional compiled languages (e.g. FORTRAN, C)
need to create standalone application from M-file(s) using
MATLAB compiler
standalone application can run without a MATLAB license
run-time libraries still need to be accessible to MATLAB jobs
nearly all toolbox functions available to standalone applications
simple (but powerful) file I/O makes checkpointing easier
see Liverpool Condor website for more information
Power-saving and Green IT at Liverpool
we have around 2 000 centrally managed classroom PCs across
campus which were powered up overnight, at weekends and
during vacations.
original power-saving policy was to power-off machines after 30
minutes of inactivity, we now hibernate them after 15 minutes of
policy has reduced wasteful inactivity time by ~ 200 000 – 250 000
hours per week (equivalent to 20-25 MWh) leading to an estimated
saving of approx. £125 000 p.a.
3rd party power management software (PowerMAN) prevents
machines hibernating whilst Condor jobs are running
Condor’s own power management features allows machines to be
woken up automatically according to demand
Condor-G and Grid Computing
Condor-G is an extension to Condor allowing job submission to
remote resources using Globus
provides familiar Condor-like interface to users hiding the
underlying middleware complexity
we have used Condor-G to give users grid access to a variety of
HPC resources:
 local HPC clusters (UL-Grid)
 NW-Grid resources at Daresbury Lab, Lancaster and Manchester
 National Grid Service facilities
Grid Computing Server tools provide a batch environment similar
to that of cluster systems (e.g. Sun Grid Engine)
Web portal removes the need for command line use completely
Radiotherapy example
3D model of normal tissue was developed in which complications
are generated when ‘irradiated’ [1]
aim is to provide insight into connection between dosedistribution characteristics, different organ architectures and
complication rates beyond that of analytical methods
code written in MATLAB and compiled into standalone executable
set of 800 simulations took ~ 36 hours to run on Condor pool
would require 4-5 months of computing time on a single PC
several dozen sets of simulations have since been completed
[1] Rutkowska E., Baker C.R. and Nahum A.E. Mechanistic
simulation of normal-tissue damage in
radiotherapy—implications for dose–volume analyses. Phys.
Med. Biol. 55 (2010) 2121–2136.
Personalised Medicine example
project is a Genome-Wide Association Study
 aims to identify genetic predictors of response to anti-epileptic drugs
 try to identify regions of the human genome that differ between individuals
(referred to as SNPs)
 800 patients genotyped at 500 000 SNPs along the entire genome
 test statistically the association between SNPs and outcomes (e.g. time to
withdrawl of drug due to adverse effects)
very large data-parallel problem – ideal for Condor
divide datasets into small partitions so that individual jobs run for
15-30 minutes
batch of 26 chromosomes (2 600 jobs) required ~ 5 hours compute
time on Condor but ~ 5 weeks on a single PC
Epidemiology example
researchers have simulated the consequences
of an incursion of H5N1 avian influenza into
the British poultry flock [2]
Monte Carlo type method - highly parallel
original code written in MATLAB and compiled
into standalone application
individual simulations take only 10-15 minutes
to run – ideal for Condor
require ~ 10 000 - 20 000 simulations per
would have needed several years of compute
time on single machine, on Condor needed a
few weeks
[2] Sharkey, K.J., Bowers R.G., Morgan K.L., Robinson S.E. and Christley R.M.
Epidemiological consequences of an incursion of highly pathogenic H5N1 avian
influenza into the British poultry flock. Proc. R. Soc. B 2008 275, 19-28
Further Information
[email protected]