Slides (PPT).

Download Report

Transcript Slides (PPT).

Predicting Queue Waiting Time in
Batch Controlled Systems
Rich Wolski, Dan Nurmi, John Brevik, Graziano
Obertelli
Computer Science Department
University of California, Santa Barbara
Problem: Predicting Delay in Batch Queues
•
•
Time in queue is experienced as application delay
•
Much research in this area over the past 20 years, but few
solutions
Sounds like an easy problem, but
— Distribution of load from users is a matter of some debate
— Scheduling policy is partially hidden
— Sites need to change the policies dynamically and without warning
— Job execution times are difficult to predict
— Current commercial systems provide high variance estimates
— Most sites simply disable this feature
Hard Problem
SDSC Datastar High Queue
March 2004 to October 2005
10000000
Delay (second)
1000000
100000
10000
1000
100
10
1
Time
For Scheduling: It’s all about the big Q
•
Predictions of the form
•
Requires two estimates if certainty is to be quantified
— “What is the maximum time my job will wait with X% certainty?”
— “What is the minimum time my job will wait with X% certainty?”
— Estimate the (1-X) quantile for the distribution of availability =>
Qx
•
— Estimate the upper or lower X% confidence bound on the statistic
Qx => Q(x,lb)
If the estimates are unbiased, and the distribution is
stationary, future availability duration will be larger than Q(x,lb)
X% of the time, guaranteed
New Predictive Methodology
•
New quantile estimator invention based on Binomial distribution
• Requires carefully engineered numerical system to deal with largescale combinatorics
•
New changepoint detector
• Binomial method in a time series context is difficult
• Need a system to determining
• Stationary regions in the data
• Minimum statistically meaningful history in each region
•
New clustering methodology (coming soon)
• More accurate estimates are possible if predictions are made from
jobs with similar characteristics
• Takes dynamic policy changes into account more effectively
Ten Years of Supercompuuting
Percentage of Wait Times Correctly Predicted
LLNL SP
with 95% Confidence
1/02 - 10/02
0.98
Percentage of Correct Predictions
NERSC SP
3/01 - 3/03
LANL O2K
12/99 - 4/00
SDSC SP
4/98 - 4/00
TACC RS
1/04 - 5/04
SDSC PGN
1/96 - 1/97
0.97
0.96
0.95
standby
q256s
q128l
q64l
q32l
q16s
q4s
q1l
systest
development
high
normal
low
low
high
express
normal
schammpq
scavenger
ir-shared
shared
chammpq
small
blue
interactive
premium
low
debug
reg-long
regular
0.93
low
0.94
See it In Action
•
http://pompone.cs.ucsb.edu/~rgarver/bqindex.php
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Predicting Things Upside Down
•
Deadline scheduling: My job needs to start in the next X
seconds for the results to be meaningful.
— Amitava Mujumdar, Tharaka Devaditha, Adam Birnbaum (SDSC)
– Need to run a 4 minute image reconstruction that completes in
the next 8 minutes
•
Given a
•
•
What is the probability that a job will meet the deadline?
— Machine
— Queue
— Processor count
— Run time
— Deadline
http://pompone.cs.ucsb.edu/~rgarver/invbqueue.php
How Well Does it Work with an Application?
Refine
Electron Micrograph
EMAN
Preliminary
3D model
Final 3D
model
Preliminary
3D Model
Particles
EMAN has been developed at Baylor College of Medicine by
Research group of Wah Chiu and Steven Ludtke {wah,sludtke}@bcm.tmc.edu
VGrADS EMAN Batch Scheduler
•
EMAN emulator
•
Experiment:
•
Results: mean observed and mean predicted makespans are not
significantly different at alpha = 0.05
— Run the EMAN scheduler to determine a job launch sequence
— Launch the jobs by submitting them to the queues specified by the
scheduler
— When an EMAN job acquires the processors, exit and “sleep” the
emulator for the predicted execution time
– Saves system allocation time
— Record the overall makespan
— Chicago TeraGrid, SDSC TeraGrid, NCSA TeraGrid and CNSI Dell
at UCSB
— 57 separate runs
95% Upper Bound on Median
EMAN Schedule Measurements and
Predictions
Makespan (seconds)
1000000
100000
10000
Measured
Predicted
1000
avg. measured = 23906
avg. predicted = 15593
100
t = 1.29
p-value = 0.1
10
1
0
10
20
30
40
Execution Instance
50
60
Clustering
•
RMS ratio of Binomial with Clustering to without
— Both achieve 95% correctness
— Measures “tightness” improvement through clustering
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Batch Queue Prediction for Grid Systems
•
•
A good point-valued prediction remains elusive
•
Automatic schedulers are coming
Grid users certainly can use bounds instead
— Early job completion is okay, typically
— Bounds give a good intuitive feel for which queue will be quickest
— EMAN doesn’t use ranges…it should
— VGrADS is developing new schedulers (workflow)
— NEESGrid and ISI are in development (workflow)
— Large-scale sensor network simulation
What’s Next?
•
Open questions:
•
Virtual resource reservations (VGrADS)
•
Thanks
•
Us: [email protected], [email protected]
— Does the availability of predictions affect load?
– Rolling out production tools now and we will be monitoring
– Job cancellation does not affect results
— If it does, will allocations be stable?
– Grid economies
— Conditional prediction and resubmission
— Virtual Cluster??
— NSF SCI, VGrADS, SDSC, TACC