Schedd On The Side

Download Report

Transcript Schedd On The Side

Routing Jobs to the Grid
Dan Bradley
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
http://www.cs.wisc.edu/condor
What’s a Job Router?
Specialized scheduler operating on schedd’s jobs.
Job 1
Job 2
Job 3
Job 4
Job 5
…
Job 4*
job queue
Schedd
Job Router
a.k.a.
Schedd
On The
Side
www.cs.wisc.edu/condor
Adapted Quill Technology
› Using Quill library to mirror job
queue in memory
o Efficient - just “tails” the log
o Independent - mirror without clogging
schedd command queue
› Modifying the job queue is another
matter - must interact with schedd
www.cs.wisc.edu/condor
Usage Case
Routing: Vanilla -> Grid
www.cs.wisc.edu/condor
Condor Farm Story
condor_submit
•Now that this is working, how
can I use my collaborator’s
resources too?
Random
Random
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Seed
Seed
job queue
Schedd
Application
Startd
Resources
www.cs.wisc.edu/condor
Option #1: Merge Farms
› Combine machines with collaborator into
one Condor resource pool.
o Everything works just like it did before.
o Excellent option for small to medium clusters.
o Requires bidirectional connectivity to all
startds, or equivalent via GCB.
o Requires some administrative coordination (e.g.
upgrades, negotiator policy, security, etc.)
www.cs.wisc.edu/condor
Option #1b: submit to
multiple pools
› condor_submit -remote …
› Works
› Ok for small scale
› Have to manually partition jobs
www.cs.wisc.edu/condor
Option #2: Flocking Together
•full featured
(std universe etc)
•automatic matchmaking
•easy to configure
Random
Random
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Seed
Seed
Schedd
Local
Startds
Remote
Startds
www.cs.wisc.edu/condor
•requires bidirectional
connectivity
•both sites must run
condor
Option #3: Grid Universe
Random
Random
Random
Seed
Seed
Seed
Random
Random
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Seed
Seed
vanilla
site X
Gatekeeper
Schedd
X
Startds
•easier to live with private networks
•may use non-Condor resources
•restricted Condor feature set
(e.g. no std universe over grid)
•must pre-allocating jobs
between vanilla and grid universe
www.cs.wisc.edu/condor
Option #4: Routing Jobs
•dynamic allocation of jobs
between vanilla and grid universes.
•not every job is appropriate for
transformation into a grid job.
Random
Random
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Seed
Seed
Schedd
On The
Side
Random
Random
Random
Seed
Random
Seed
Random Random
Seed
Seed Seed
Seed
vanilla site X
site Y
Schedd
Z
Gatekeeper
site Z
X
Local
Startds
Y
www.cs.wisc.edu/condor
Example Routing Table
[GridResource = “gt2 gatekeeper.site1/jobmanager-pbs”;
MaxJobs = 500;
MaxIdle = 50;
set_GlobusRSL = “(…)”
]
[GridResource = “condor schedd.site2 collector.site2”;
MaxJobs = 700;
MaxIdle = 100;
Requirements = other.ImageSize < 500
]
…
www.cs.wisc.edu/condor
What About I/O?
› Jobs must be sandboxable (i.e.
specifying input/output via transferfiles mechanism).
› Routing of standard universe is not
supported.
› Must have enough storage space at
site for input/output files!
www.cs.wisc.edu/condor
What Types of Grids?
› Routing table may contain any
combination of grid types supported
by Condor’s grid universe.
› Example: Condor-C
Schedd
On The
Side
Random
Random
Random
Seed
Random
Seed
Seed
Seed
site X
Schedd
•for two Condor sites, schedd-to-schedd
submission requires no additional software
•however, still not as trivial to use as flocking
Schedd X
www.cs.wisc.edu/condor
Source Routing
› Routing the old-fashioned way:
universe = Grid
GridResource = condor site1 …
remote_universe = Grid
remote_GridResource = condor site2 …
remote_remote_universe = Grid
remote_remote_GridResource = pbs
www.cs.wisc.edu/condor
Routing At the Site
•navigate internal firewalls
•provide custom routes
for special users
•improve scalability
•However, keep in mind
I/O requirements etc.
Schedd X3
Gatekeeper
X
Schedd
On The
Side
Schedd
X2
www.cs.wisc.edu/condor
Multicast in Future?
› Currently: route one job to one site
› Multicast: route one job to many sites
› Thin out all but first to germinate
› … or all but first to yield fruit.
www.cs.wisc.edu/condor
Future Glidein Factory
Random
Random
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Random
Seed
Seed
Seed
home
glidein jobs
Schedd
On The
Side
Gatekeeper
site X
Schedd
X
Startds
•true late binding of jobs to resources
•may run on top of non-Condor sites
•supports full feature-set of Condor
(e.g. standard universe)
•requires GCB for private networks
www.cs.wisc.edu/condor
Glideing in the Factory
Random
Random
Random
Seed
Random
Seed
Random
Seed
Seed
Seed
Schedd
On The
Side
site X
Schedd
schedd-to-schedd
•hierarchical strategy for scalability
and reliability
•better match for private networks
glidein factory
schedd-to-gatekeeper
•may require some additional horsepower
from gatekeeper machine, perhaps a
dedicated element for “edge services”.
www.cs.wisc.edu/condor
Pluggable Router
› Beyond simple ClassAd transforms
› Pluggins would fire when job matches
entry in routing table
› Don’t yet understand semantics
› There is work to do!
www.cs.wisc.edu/condor
Thanks
Interested?
Let us know.
We are currently
using job routing
for specific users
at UW.
Future development
will focus on more
use-cases.
Dan Bradley
[email protected]
www.cs.wisc.edu/condor