Condor by Example

Download Report

Transcript Condor by Example

Condor by Example
Douglas Thain
Computer Sciences Department
University of Wisconsin-Madison
October 2000
[email protected]
http://www.cs.wisc.edu/condor
Lecture Format:
› In each lecture:
Lecture to whole group.
Workshop and examples at computer.
› Oops!
Some items are filled in at the last
minute.
Please fill the _______ with notes.
www.cs.wisc.edu/condor
Outline
›
›
›
›
›
Overview
Submitting Jobs, Getting Feedback
Setting Requirements with ClassAds
Which Universe?
Move to Workshop
www.cs.wisc.edu/condor
What is Condor?
› Condor converts a collection of
unrelated workstations into a highthroughput computing facility.
› Condor uses matchmaking to make
sure that everyone is happy.
www.cs.wisc.edu/condor
What is High-Throughput
Computing?
› High-performance: CPU cycles/second
under ideal circumstances.
 “How fast can I run simulation X on this
machine?”
› High-throughput: CPU cycles/day (week,
month, year?) under non-ideal
circumstances.
 “How many times can I run simulation X in the
next week using all available machines?”
www.cs.wisc.edu/condor
What is High-Throughput
Computing?
› Condor does whatever it takes to run
your jobs, even if some machines…
Crash!
Are disconnected
Run out of disk space
Are removed or added from the pool
Are put to other uses
www.cs.wisc.edu/condor
What is Matchmaking?
› Condor uses Matchmaking to make sure
›
that work gets done within the constraints
of both users and owners.
Users (jobs) have constraints:
 “I need an Alpha with 256 MB RAM”
› Owners (machines) have constraints:
 “Only run jobs when I am away from my desk
and never run jobs owned by Bob.”
www.cs.wisc.edu/condor
Who uses Condor?
› Hundreds of universities and companies
›
around the world!
University of Wisconsin, USA
 682 CPUs in one building
 Computer architecture simulations
› National Institute of Physics, Italy
 200 CPUs in many cities
 Reconstruction of collider events
› And many others!
www.cs.wisc.edu/condor
What can Condor
do for me?
Condor can…
› …increase your throughput.
› …do your housekeeping.
› …improve reliability.
› …give performance feedback.
www.cs.wisc.edu/condor
Cluster
Overview
Server
512 MB
800 MHz
20 GB
100 Mb/s network
Client
128 MB
666 MHz
Client
128 MB
666 MHz
Client
128 MB
666 MHz
Client
128 MB
666 MHz
Client
128 MB
666 MHz
10 GB
10 GB
10 GB
10 GB
10 GB
www.cs.wisc.edu/condor
How many machines now?
› The map is out of date!
› The system is always changing.
› First example: What machines (and
of what kind) are in the pool now?
www.cs.wisc.edu/condor
How Many Machines?
% condor_status
Name
OpSys
Arch
lxpc1.na.infn LINUX-GLIBC INTEL
axpd21.pd.inf OSF1
ALPHA
vlsi11.pd.inf SOLARIS26
SUN4u
State
Activity
LoadAv
Mem
Unclaimed
Owner
Claimed
Idle
Idle
Busy
0.000
0.266
0.000
30
96
256
. . .
Machines Owner Claimed Unclaimed Matched Preempting
ALPHA/OSF1
INTEL/LINUX
INTEL/LINUX-GLIBC
SUN4u/SOLARIS251
SUN4u/SOLARIS26
SUN4u/SOLARIS27
SUN4x/SOLARIS26
115
53
16
1
6
1
2
67
18
7
1
2
1
1
46
0
0
0
0
0
0
1
35
9
0
4
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
Total
194
97
46
50
0
1
www.cs.wisc.edu/condor
Machine States
› Most machines will be:
Owner:
• The machine’s owner is busy at the console,
so no Condor jobs may run.
Claimed:
• Condor has selected the machine to run jobs
for other users.
www.cs.wisc.edu/condor
Machine States
› Only a few should be:
Unclaimed:
• The owner is gone, but Condor has not yet
selected the machine.
Matched:
• Between claimed and unclaimed.
Preempting:
• Condor is busy removing a job.
www.cs.wisc.edu/condor
More Things to Try
% condor_status
% condor_status
% condor_status
% condor_status
% condor_status
-help
–avail
–run
–total
–pool condor.cs.wisc.edu
www.cs.wisc.edu/condor
Submitting Jobs
www.cs.wisc.edu/condor
Steps to Running a Job
›
›
›
›
Re-link for Condor.
Submit the job.
Watch the progess.
Receive email when done.
www.cs.wisc.edu/condor
Example Job
Integrate sin(x) from 0 to 10, using 10
million slices.
Simple program takes a few seconds.
% ./integrate 10 10000000
2.0445075
www.cs.wisc.edu/condor
PROGRAM INTEGRATE
CHARACTER STR*10
REAL X, SLICES, LIMIT
CALL
READ
CALL
READ
GETARG(1,STR)
(STR,*) LIMIT
GETARG(2,STR)
(STR,*) SLICES
TOTAL=0
STEP=LIMIT/SLICES
DO X=0, LIMIT, STEP
TOTAL = TOTAL + SIN(X)*STEP
END DO
PRINT *, TOTAL
END
www.cs.wisc.edu/condor
Re-link for Condor
› If you normally compile like this:
g77 integrate.f -o integrate
› Then compile for Condor like this:
condor_compile g77 integrate.f -o integrate
www.cs.wisc.edu/condor
Submit the Job
› Create a submit file:
Executable = integrate
• emacs integrate.submit &
Arguments = 10 10000000
Output = integrate.out
› Submit the job:
Log = integrate.log
• condor_submit integrate.submit
queue
www.cs.wisc.edu/condor
Watch the Progress
% condor_q
-- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> :
ID
5.0
OWNER
thain
Each job gets a
unique number.
SUBMITTED
6/21 12:40
RUN_TIME ST PRI SIZE CMD
0+00:00:15 R 0
2.5 fib 40
Status: Unexpanded,
Running or Idle
Size of program
image (MB)
www.cs.wisc.edu/condor
Receive E-mail When Done
This is an automated email from the Condor system
on machine "axpbo8.bo.infn.it". Do not reply.
Your condor job
/tmp_mnt/usr/users/ccl/thain/test/fib 40
exited with status 0.
Submitted at:
Completed at:
Wed Jun 21 14:24:42 2000
Wed Jun 21 14:36:36 2000
Real Time:
Run Time:
Committed Time:
. . .
0 00:11:54
0 00:06:52
0 00:01:37
www.cs.wisc.edu/condor
Running Many Processes
› 100 processes are almost as easy as !.
› Each condor_submit makes one cluster of
›
›
one or more processes.
Add the number of processes to run to the
Queue statement.
Use the $(PROCESS) variable to give each
process slightly different instructions.
www.cs.wisc.edu/condor
Running Many Processes
› Perform the same program on 50 different
›
intervals.
Output goes in integrate.out.1,
integrate.out.2, and so on…
Executable = integrate
Arguments = $(PROCESS) 10000000
Output = integrate.out.$(PROCESS)
Log = integrate.log
Queue 50
www.cs.wisc.edu/condor
Running Many Processes
% condor_q
-- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038>
ID
9.3
9.6
9.7
. . .
21 jobs;
Cluster
number
OWNER
thain
thain
thain
SUBMITTED
6/23 10:47
6/23 10:47
6/23 10:47
RUN_TIME
0+00:05:40
0+00:05:11
0+00:05:09
ST
R
R
R
PRI
0
0
0
SIZE
2.5
2.5
2.5
2 idle, 19 running, 0 held
Process
number
www.cs.wisc.edu/condor
CMD
fib 3
fib 6
fib 7
Where Are They Running?
› condor_q –run
-
Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> :
ID
9.47
9.48
9.49
OWNER
thain
thain
thain
SUBMITTED
6/23 10:47
6/23 10:47
6/23 10:47
RUN_TIME
0+00:07:03
0+00:06:51
0+00:06:30
HOST(S)
ax4bbt.bo.infn.it
pewobo1.bo.infn.it
osde01.pd.infn.it
Current
Location
www.cs.wisc.edu/condor
Help! I’m buried in Email!
› By default, Condor sends one email
for each completed process.
› Add these to your submit file:
notification = error
notification = never
› To send it to someone else:
notify_user = [email protected]
www.cs.wisc.edu/condor
Removing Processes
› Remove one process:
condor_rm 9.47
› Remove a whole cluster:
condor_rm 9
› Remove everything!
condor_rm -a
www.cs.wisc.edu/condor
Getting Feedback
www.cs.wisc.edu/condor
What have I done?
› The user log file (fib.log) shows a
chronological list of everything important
that happened to a job.
001 (007.035.000) 06/21 17:03:44 Job executing on host: <140.105.6.155:2219>
004 (007.035.000) 06/21 17:04:58 Job was evicted.
009 (007.035.000) 06/21 17:05:10 Job was aborted by the user.
www.cs.wisc.edu/condor
What have I done?
% condor_history
ID
OWNER
9.3
thain
9.40 thain
9.10 thain
9.47 thain
9.7
thain
SUBMITTED
6/23 10:47
6/23 10:47
6/23 10:47
6/23 10:47
6/23 10:47
CPU_USAGE ST
0+00:00:00 C
0+00:00:24 C
0+00:00:00 C
0+00:05:45 C
0+00:00:00 C
COMPLETED CMD
6/23 10:58 fib
6/23 10:59 fib
6/23 11:01 fib
6/23 11:01 fib
6/23 11:01 fib
www.cs.wisc.edu/condor
3
40
10
47
7
Brief I/O Summary
% condor_q –io
-- Schedd: c01.cs.wisc.edu : <128.105.146.101:2016>
ID
OWNER
READ
WRITE
SEEK
XPUT
BUFSIZE
756.15 joe
244.9 KB 379.8 KB
71
1.3 KB/s 512.0 KB
758.24 joe
198.8 KB 219.5 KB
78 45.0 B /s 512.0 KB
758.26 joe
44.7 KB 22.1 KB
2727 13.0 B /s 512.0 KB
3 jobs; 0 idle, 3 running, 0 held
www.cs.wisc.edu/condor
BLKSIZE
32.0 KB
32.0 KB
32.0 KB
Complete I/O Summary
in Email
Your condor job "/usr/joe/records.remote input output" exited
with status 0.
Total I/O:
104.2 KB/s effective throughput
5 files opened
104 reads totaling 411.0 KB
316 writes totaling 1.2 MB
102 seeks
I/O by File:
buffered file /usr/joe/input
opened 2 times
100 reads totaling 398.6 KB
311 write totaling 1.2 MB
101 seeks
(Only since Condor Version 6.1.11)
www.cs.wisc.edu/condor
Complete I/O Summary
in Email
› The summary helps identify
performance problems. Even
advanced users don't know exactly
how their programs and libraries
operate.
www.cs.wisc.edu/condor
Complete I/O Summary in
Email
› Example:
CMSSIM - collider simulation
“Why is this job so slow?”
Data summary:
• read 250 MB from 20 MB file.
Very high SEEK total -> random access.
Solution: Increase buffer to 20 MB.
www.cs.wisc.edu/condor
Who Uses Condor?
% condor_q –global
-- Schedd: to02xd.to.infn.it : <192.84.137.2:1030>
ID
OWNER
SUBMITTED
RUN_TIME ST PRI SIZE CMD
127.0
garzelli
6/21 18:45
1+14:18:16 R 0
17.2 tosti2trisdn
-- Schedd: quark.ts.infn.it : <140.105.6.101:3908>
ID
OWNER
SUBMITTED
RUN_TIME ST
600.0
dellaric
4/10 14:57 55+09:20:31 R
665.0
dellaric
6/2 11:14 20+03:27:30 R
788.0
pamela
6/20 09:27
3+04:41:43 R
PRI
0
0
0
www.cs.wisc.edu/condor
SIZE
9.1
9.2
15.4
CMD
john p2.dat
john p1.dat
montepamela
Who uses Condor?
% condor_status –submitters
Name
Machine
[email protected]
[email protected]
[email protected]
. . .
decux1.pv.
quark.ts.i
to05xd.to.
Running
22
6
21
RunningJobs
[email protected]
[email protected]
[email protected]
Total
IdleJobs
34
1
49
MaxJobsRunning
200
200
200
IdleJobs
0
6
22
1
1
34
59
86
www.cs.wisc.edu/condor
Who Uses Condor?
% condor_userprio
Last Priority Update:
6/23 16:27
User Name
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
-----------------------------Number of users shown: 8
Effective
Priority
--------0.50
0.50
0.50
2.00
3.00
5.81
18.18
19.72
---------
www.cs.wisc.edu/condor
Who Uses Condor?
› The user priority is computed by Condor to
›
›
estimate how much of the pool’s CPU
resources have been used by each
submitter.
Lighter users receive a lower priority: they
will be allocated CPUs before heavy users.
Users consuming the same amount of CPU
will be allocated an equal amount.
www.cs.wisc.edu/condor
Measuring Goodput
› Goodput is the amount of time a
workstation spends making forward
progress on work assigned by Condor.
› This is a big topic all by itself:
http://www.cs.wisc.edu/condor/goodput
www.cs.wisc.edu/condor
Measuring Goodput
% condor_q –goodput
-- Submitter: coral.cs.wisc.edu : <128.105.175.116:45697> : coral.cs.wisc.edu
ID
OWNER
SUBMITTED
RUN_TIME GOODPUT CPU_UTIL
Mb/s
719.74 thain
6/23 07:35
2+20:47:59 100.0%
87.6%
0.00
719.75 thain
6/23 07:35
2+20:38:45
40.5%
99.8%
0.00
719.76 thain
6/23 07:35
2+20:38:16
96.9%
98.7%
0.00
719.77 thain
6/23 07:35
2+21:10:06 100.0%
99.8%
0.00
www.cs.wisc.edu/condor
Setting Requirements
› We believe that Condor must allow
both users (jobs) and owners
(machines) to set requirements.
› This is an absolute necessity in order
to convince people to participate in
the community.
www.cs.wisc.edu/condor
ClassAds
› ClassAds are a simple language for
describing both the properties and
the requirements of jobs and
machines.
› Condor stores nearly everything in
ClassAds -- use the –l option to
condor_q and condor_submit to get
the full details.
www.cs.wisc.edu/condor
ClassAd for a Machine
› condor_status –l axpbo8
MyType = "Machine"
TargetType = "Job"
Name = "axpbo8.bo.infn.it"
START = TRUE
VirtualMemory = 342696
Disk = 28728536
Memory = 160
Cpus = 1
Arch = "ALPHA"
OpSys = "OSF1“
www.cs.wisc.edu/condor
ClassAd for a Job
› condor_q –l 9.49
MyType = "Job"
TargetType = "Machine"
Owner = "thain"
Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib"
Out = “fib.out.49”
Args = “49”
ImageSize = 2544
DiskUsage = 2544
Requirements = (Arch == "ALPHA") && (OpSys == "OSF1") &&
(Disk >= DiskUsage) && (VirtualMemory >= ImageSize)
www.cs.wisc.edu/condor
Default Requirements
› By default, Condor assumes the
requirements for your job are: “I
need a machine with…”
The same operating system and
architecture as my workstation.
Enough disk to store the program.
Enough virtual memory to run the
program.
www.cs.wisc.edu/condor
ClassAd Requirements
› Similar to C/C++/Java expressions:
Symbols: Arch, OpSys, Memory, Mips
Values: 15, 6.5, “LINUX”
Operators:
• ==, <, >, <=, >=
• &&, ||
•()
www.cs.wisc.edu/condor
Adding Requirements
› In the submit file, add a line
beginning with “requirements = “
Executable = fib
Arguments = 40
Output = fib.out
Log = fib.log
Requirements = (Memory > 64)
queue
www.cs.wisc.edu/condor
Example Requirements
›
›
›
›
(Memory>64)
(Machine == “axpbo3.bo.infn.it” )
(Mips>100) || (Kflops>10000)
(Subnet != “131.154.10”)
&& (Disk > 20000000)
www.cs.wisc.edu/condor
Preferences
› Condor assumes that any machines that
›
›
match your requirements are suitable.
However, you may prefer some machines
over others. (100 Mips is better than 10)
To indicate a preference, you may provide
a ClassAd expression which ranks all
matches.
www.cs.wisc.edu/condor
Rank
› The rank expression is evaluated into
a number for every potential
matching machine.
› A machine with a higher number will
be preferred over a machine with a
lower number.
www.cs.wisc.edu/condor
Rank Examples
› Prefer machines with more Mips:
• Rank = Mips
› Prefer machines with a high ratio of
memory to cpu performance:
• Rank = Memory/Mips
› Prefer more memory, but add 100 to
the rank if the machine is Solaris 2.7:
• Rank = Memory + 100*(OpSys==“SOLARIS27)”
www.cs.wisc.edu/condor
Standard
or Vanilla?
www.cs.wisc.edu/condor
Which Universe?
› Each Condor universe provides
different services to different kinds
of programs:
Standard – Relinked UNIX programs
Vanilla – Unmodified UNIX programs
PVM
Scheduler
(Not described here)
Globus
www.cs.wisc.edu/condor
Standard Universe
› Submit a specially-linked UNIX
application to the Condor system.
› Advantages:
 Checkpointing for fault tolerance.
 Remote I/O services:
•
•
•
•
Friendly environment anywhere in the world.
Data buffering and staging.
I/O performance feedback.
User remapping of data sources.
www.cs.wisc.edu/condor
Standard Universe
› Disadvantages:
Must statically link with Condor library.
Limited class of applications:
• Single-process UNIX binaries.
• Certain system calls prohibited.
www.cs.wisc.edu/condor
System Call Limitations
› Standard universe does not allow:
 Multiple processes:
• fork(), exec(), system()
 Inter-process communication:
• semaphores, messages, shared memory
 Complex I/O:
• mmap(), select(), poll(), non-blocking I/O, …
 Kernel-level threads
• (User level threads are OK.)
www.cs.wisc.edu/condor
System Call Limitations
› Too restrictive?
Use the vanilla universe.
www.cs.wisc.edu/condor
Vanilla Universe
› Submit any sort of UNIX program to
the Condor system.
› Advantages:
 No relinking required.
 Any program at all, including
•
•
•
•
Binaries
Shell scripts
Interpreted programs (java, perl)
Multiple processes
www.cs.wisc.edu/condor
Vanilla Universe
› Disadvantages:
No checkpointing.
Very limited remote I/O services.
• Specify input files explicitly.
• Specify output files explicitly.
Condor will refuse to start a vanilla job
on a machine that is unfriendly.
• ClassAds: FilesystemDomain and UIDDomain
www.cs.wisc.edu/condor
Which Universe?
› Standard:
Good for mixed Condor pools, flocked
pools, and the Grid at large.
› Vanilla:
Good for a Condor pool of identical
machines.
www.cs.wisc.edu/condor
Conclusion
› Condor expands your reach to many CPUs –
›
›
›
even those you cannot log in to.
Condor makes it easy to run and manage
large numbers of jobs
Good candidates for the standard universe
are single-process CPU-bound jobs with
simple I/O.
Too restrictive? Use the vanilla universe,
but fewer available machines.
www.cs.wisc.edu/condor
Move to Workshop
Meet again in room ____ at _____.
Bring printouts to follow along.
www.cs.wisc.edu/condor