Condor - Jefferson Lab

Download Report

Transcript Condor - Jefferson Lab

High-Throughput
Computing With
Condor
Peter Couvares
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
http://www.cs.wisc.edu/~pfc
Who Are We?
www.cs.wisc.edu/condor
The Condor Project
(Established ‘85)
Distributed systems CS research performed
by a team that faces:
 software engineering challenges in a
Unix/Linux/NT environment,
 active interaction with users and collaborators,
 daily maintenance and support challenges of a
distributed production environment,
 and educating and training students.
Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
Microsoft and the UW Graduate School
.
www.cs.wisc.edu/condor
The Condor
System
www.cs.wisc.edu/condor
The Condor System
›
›
›
›
›
Unix and NT
Operational since 1986
More than 1300 CPUs at UW-Madison
Available on the web
More than 150 clusters worldwide in
academia and industry
www.cs.wisc.edu/condor
What is Condor?
› Condor converts collections of
distributively owned workstations and
dedicated clusters into a highthroughput computing facility.
› Condor uses matchmaking to make
sure that everyone is happy.
www.cs.wisc.edu/condor
What is High-Throughput
Computing?
› High-performance: CPU cycles/second
under ideal circumstances.
 “How fast can I run simulation X on this
machine?”
› High-throughput: CPU cycles/day (week,
month, year?) under non-ideal
circumstances.
 “How many times can I run simulation X in the
next month using all available machines?”
www.cs.wisc.edu/condor
What is High-Throughput
Computing?
› Condor does whatever it takes to run
your jobs, even if some machines…
Crash! (or are disconnected)
Run out of disk space
Don’t have your software installed
Are frequently needed by others
Are far away & admin’ed by someone
else
www.cs.wisc.edu/condor
What is Matchmaking?
› Condor uses Matchmaking to make sure
›
that work gets done within the constraints
of both users and owners.
Users (jobs) have constraints:
 “I need an Alpha with 256 MB RAM”
› Owners (machines) have constraints:
 “Only run jobs when I am away from my desk
and never run jobs owned by Bob.”
www.cs.wisc.edu/condor
“What can Condor
do for me?”
Condor can…
› …do your housekeeping.
› …improve reliability.
› …give performance feedback.
› …increase your throughput!
www.cs.wisc.edu/condor
Some Numbers: UW-CS Pool
6/98-6/00
4,000,000 hours ~450 years
“Real” Users
CS-Optimization
CS-Architecture
Physics
Statistics
Engine Research Center
Math
Civil Engineering
Business
“External” Users
MIT
Cornell
UCSD
CalTech
1,700,000 hours
610,000
350,000
245,000
80,000
38,000
90,000
27,000
970
hours
hours
hours
hours
hours
hours
hours
hours
165,000 hours
76,000
38,000
38,000
18,000
~260 years
~19 years
hours
hours
hours
hours
www.cs.wisc.edu/condor
Condor & Physics
www.cs.wisc.edu/condor
Current CMS Activity
› Simulation (CMSIM) for CalTech
 provided >135,000 CPU hours to date
 peak day ~ 4000 CPU hours
 via NCSA Alliance, Condor has allocated
1,000,000 hours total to CalTech
› Simulation and Reconstruction
(CMSIM + ORCA) for HEP group at
UW-Madison
www.cs.wisc.edu/condor
INFN Condor Pool - Italy
› Italian National Institute for Research in
Nuclear and Subnuclear Physics
›
›
›
›
19 locations, each running a Condor pool
as few as 1 CPU -- to >100 CPUs
each locally controlled
each “flocks” jobs to other pools when
available
www.cs.wisc.edu/condor
Particle Physics Data Grid
› The PPDG Project is...
a software engineering effort to design,
implement, experiment, evaluate, and
prototype HEP-specific data-transfer
and caching software tools for Grid
environments
› For example...
www.cs.wisc.edu/condor
Condor PPDG Work
› Condor Data Manager
technology to automate & coordinate
data movement from a variety of longterm repositories to available Condor
computing resources & back again
keeping the pipeline full!
SRB (SDSC), SAM (Fermi), PPDG HRM
www.cs.wisc.edu/condor
PPDG Collaborators
California Institute of Technology
Harvey B. Newman, Julian J. Bunn, Koen Holtman,
Asad Samar, Takako Hickey, Iosif Legrand, Vladimir
Litvin, Philippe Galvez, James C.T. Pool, Roy Williams
Argonne National Laboratory
Ian Foster, Steven Tuecke
Lawrence Price, David Malon, Ed May
Berkeley Laboratory
Stewart C. Loken, Ian Hinchcliffe, Doug Olson,
Alexandre Vaniachine
Arie Shoshani, Andreas Mueller, Alex Sim, John Wu
Brookhaven National Laboratory
Fermi National Laboratory
Bruce Gibbard, Richard Baker, Torre Wenaus
University of Florida
San Diego Supercomputer Center
Stanford Linear Accelerator Center
Paul Avery
Thomas Jefferson National
Accelerator Facility
University of Wisconsin
Chip Watson, Ian Bird, Jie Chen
Victoria White, Philip Demar, Donald Petravick
Matthias Kasemann, Ruth Pordes, James Amundson,
Rich Wellner, Igor Terekhov, Shahzad Muzaffar
Margaret Simmons, Reagan Moore,
Richard P. Mount, Les Cottrell, Andrew Hanushevsky,
Davide Salomoni
Miron Livny, Peter Couvares, Tevfik Kosar
www.cs.wisc.edu/condor
National Grid Efforts
› GriPhyN (Grid Physics Network)
› National Technology Grid - NCSA
Alliance (NSF-PACI)
› Information Power Grid - IPG (NASA)
› close collaboration with the Globus
project
www.cs.wisc.edu/condor
I have 600
simulations to run.
How can Condor
help me?
www.cs.wisc.edu/condor
My Application …
Simulate the behavior of F(x,y,z) for 20
values of x, 10 values of y and 3 values
of z (20*10*3 = 600)
F takes on the average 3 hours to compute
on a “typical” workstation (total = 1800 hours)
F requires a “moderate” (128MB) amount of
memory
F performs “moderate” I/O - (x,y,z) is 5
MB and F(x,y,z) is 50 MB
www.cs.wisc.edu/condor
Step I - get organized!
› Write a script that creates 600 input files
›
for each of the (x,y,z) combinations
Write a script that will collect the data
from the 600 output files
› Turn your workstation into a “Personal
Condor”
› Submit a cluster of 600 jobs to your
›
personal Condor
Go on a long vacation … (2.5 months)
www.cs.wisc.edu/condor
600 Condor
jobs
personal
your
workstation
Condor
www.cs.wisc.edu/condor
Step II - build your
personal Grid
› Install Condor on the desktop machine next
›
›
›
›
door
…and on the machines in the classroom.
Install Condor on the department’s Linux
cluster or the O2K in the basement.
Configure these machines to be part of
your Condor pool.
Go on a shorter vacation ...
www.cs.wisc.edu/condor
600 Condor
jobs
personal
Group
your
workstation
Condor
www.cs.wisc.edu/condor
Step III - take advantage
of your friends
› Get permission from “friendly”
Condor pools to access their
resources
› Configure your personal Condor to
“flock” to these pools
› reconsider your vacation plans ...
www.cs.wisc.edu/condor
600 Condor
jobs
personal
Group
your
workstation
Condor
friendly Condor
www.cs.wisc.edu/condor
Think
BIG
Go to the
.
Grid.
www.cs.wisc.edu/condor
Upgrade to Condor-G
A Grid-enabled version of Condor that
uses the inter-domain services of
Globus to bring Grid resources into the
domain of your Personal Condor
Easy to use on different platforms
Robust
Supports SMPs & dedicated schedulers
www.cs.wisc.edu/condor
Step IV - Go for the Grid
› Get access (account(s) + certificate(s))
to a “Computational” Grid
› Submit 599 “Grid Universe” Condorglide-in jobs to your personal Condor
› Take the rest of the afternoon off ...
www.cs.wisc.edu/condor
Globus Grid
600 Condor
jobs
personal
Group
your
workstation
Condor
PBS
599 glide-ins
friendly Condor
Condor
www.cs.wisc.edu/condor
LSF
What Have We Done with
the Grid Already?
› NUG30
quadratic assignment problem
30 facilities, 30 locations
• minimize cost of transferring materials
between them
posed in 1968 as challenge, long unsolved
but with a good pruning algorithm &
high-throughput computing...
www.cs.wisc.edu/condor
NUG30 Personal Condor
Grid
For the run we will be flocking to
-- the main Condor pool at Wisconsin (600 processors)
-- the Condor pool at Georgia Tech (190 Linux boxes)
-- the Condor pool at UNM (40 processors)
-- the Condor pool at Columbia (16 processors)
-- the Condor pool at Northwestern (12 processors)
-- the Condor pool at NCSA (65 processors)
-- the Condor pool at INFN (200 processors)
We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA.
We will use "hobble_in" to access the Chiba City Linux cluster and Origin
2000 here at Argonne.
www.cs.wisc.edu/condor
NUG30 - Solved!!!
Sender: [email protected]
Subject: Re: Let the festivities begin.
Hi dear Condor Team,
you all have been amazing. NUG30 required
Condor Time. In just
seven days !
10.9 years of
More stats tomorrow !!! We are off celebrating !
condor rules !
cheers,
JP.
www.cs.wisc.edu/condor
Conclusion
Computing power
is everywhere,
we try to make it usable by
anyone.
www.cs.wisc.edu/condor
Need more info?
› Condor Web Page
(http://www.cs.wisc.edu/condor)
› Peter Couvares
([email protected])
www.cs.wisc.edu/condor