The Effect of Heavy-Tailed Job Size Distributions on System Design Mor Harchol-Balter

Download Report

Transcript The Effect of Heavy-Tailed Job Size Distributions on System Design Mor Harchol-Balter

The Effect of Heavy-Tailed
Job Size Distributions
on System Design
Mor Harchol-Balter
MIT Laboratory for Computer Science
1
Distribution
of job sizes
System
Design
2
Problem 1:
CPU Load Balancing in a N.O.W.
[Harchol-Balter, Downey --- Sigmetrics ‘96;
Transactions on Computer Systems ‘97]
3
CPU Load Balancing in a NOW
p
p
p
p
p
p
p
host C
host A
host D
host B
2 types of migration
Non-preemptive migration (NP)
Preemptive migration(P)
(a.k.a. placement, remote execution)
(a.k.a. active process migration)
The Problem: Policy Decisions
1. Should we bother with P migration, or is NP enough?
2. Which processes should we migrate? ** migration policy **
4
Unix process CPU lifetime measurements [HD96]
Fraction of jobs with CPU duration > x
log-log plot
Pr{Life > x} =
1
x
Duration (x secs)
• We measured over 1 million UNIX processes.
• Instructional, research, and sys. admin. machines.
• Job of cpu age x has probability 1/2 of using another x.
5
Compare with exponential distribution
Fraction of jobs with size > x
Title:
expvs pareto.eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
6
So what?
• 1/x distribution measured previously by [Leland,Ott86] at
Bellcore.
• [LO86] result rarely referenced and ignored when
referenced.
• WHY?
1) People not convinced lifetime distribution
matters.
2) Belief that 1/x distribution is useless in
developing migration policy because not
analytically tractable.
7
Why lifetime distribution matters
• Exponential distribution (memoryless) => migrate newborns
DFR (decreasing failure rate) distrib. => old jobs live longer
May pay to
migrate old jobs.
• 70% of our jobs had lifetime < .1s < cost of NP migration
=> NP not viable without namelists
8
How to design a Preemptive Migration policy?
CPU Lifetime distribution
migrate “older”
But how old is old enough?
Not obvious: Expected remaining CPU lifetime
of every job is infinity!
We showed how to use lifetime distribution
to derive policy which guarantees that
the migrant’s slowdown improves in expectation.
p’s age
cost (p)
> migration
# source - # target - 1
9
Trace-driven simulation comparison
Our Preemptive Policy
Non-preemptive Policy
P
NP
Eligibility criterion:
age > migration cost
#source - #target - 1
vs.
with optimized name-list
Eligibility criterion:
process name on name-list
For comparison: 2 strategies purposely as simple & similar as possible.
• Simulated network of 6 hosts.
• Process start times and durations from real machine traces.
• 8 experiments, each 1 hour. Ranged from 15,000 - 30,000
jobs. System utilization: .27 - .54.
10
Trace-driven simulation results
Mean Slowdown
3.5
3
• NP improvement 20%
• P improvement 50%
2.5
2
1.5
1
0
1
2
3
4
5
6
7
run number
No load balancing
Non-preemptive policy NP
Preemptive policy P
11
Mean slowdown as a function of
P mean migration cost
Mean
slowdown
Title:
pmignolabel.eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
P mean migration cost
(seconds)
(assuming NP cost = .3 sec)
12
Question: Why was P more effective than NP ?
Answer: P better than NP at detecting long jobs.
• mean lifetime of migrant in NP is 1.5 - 2.1 secs
• mean lifetime of migrant in P is 4.5 - 5.7 secs
Question: Why does it matter if NP misses
migrating a few big jobs?
Answer: In 1/x distribution, big jobs have all the CPU.
• P only migrates 4% of all jobs, but they account for
55% of CPU (“heavy-tailed property”)
13
Distribution
of job sizes
System
Design
-- DFR (Decreasing Failure Rate)
-- Heavy-tailed property
14
Problem 2:
Task Assignment in a Distributed
Server
[Harchol-Balter, Crovella, Murta --- Performance Tools ‘98]
15
Distributed Servers
1
Large
# jobs
L.B.
2
3
4
Load Balancer employs TAP
(Task Assignment Policy): rule
for assigning jobs to hosts
Applications: distributed web servers, database servers,
batch computing servers, etc.
Age-old Question: What’s a good TAP ?
16
The Model
FCFS
Large
# jobs
L.B.
FCFS
FCFS
FCFS
• Size of job is known.
• Jobs are not preemptible.
• Arriving job is immediately queued at a host
and remains there.
• Jobs queued at a host are processed in FCFS
order.
Motivation for model: Distributed batch computing
server for computationally-intensive possibly parallel jobs.
17
Which TAP is best (given model)?
1. Round-Robin
1
2
L.B.
3
2. Random
4
3. Size-based scheme
(e.g., 30 min queue; 2 hr queue; 4 hr queue)
4. Shortest-Queue
Go to queue with fewest jobs
5. Dynamic
Go to queue with least work left (work = sum of task sizes)
“best” -- minimize mean waiting time
18
To answer “Which TAP is best” question,
need to understand distribution of task sizes.
19
Pareto (heavy-tailed) distribution
Pr{Size > x} ~ x-a , 0 < a < 2
1
0
Properties:
• Decreasing Failure Rate
• Infinite variance!
• Heavy-tail property -Miniscule fraction (<1%)
of the very largest jobs
comprise half the load.
job size
a : degree of variability
0 ------ a ------ 2
more
less
variable
& more
heavy-tailed
variable
& less
heavy-tailed
20
We assume Bounded Pareto Distribution, B( k , p, a )
f ( x) =
a ka
1 - ( k / p)
a
x -a
-1
k  x p
,
1
0
k
p
• All finite moments
• Still very high
variance
• Still heavy-tailed
property
job size
X ~ B( k, p, a ) with fixed mean 3000.
p : fixed at 1010
a : ranges between 0 and 2, to show impact of variability
k : about 103, varies a bit as needed to keep mean constant.
21
Which TAP is best (given model)?
1. Round-Robin
1
2
L.B.
3
2. Random
4
3. Size-based scheme
(e.g., 30 min queue; 2 hr queue; 4 hr queue)
4. Shortest-Queue
Go to queue with fewest jobs
5. Dynamic
Go to queue with least work left (work = sum of task sizes)
“best” -- minimize mean waiting time
22
Our Size-Based TAP: SITA-E
SITA-E : Size Interval Task Assignment with Equal Load
S
M
x  f ( x)
L.B.
L
task
assignment
based on
task size.
XL
x0 x1
x2
x3
x4
task size
• Assign range of task sizes to each host.
• Task size distribution determines cutoff points
so load remains balanced, in expectation.
23
Simulations: Mean Waiting Time
• Random & RR similar.
Title:
s immeanw ait.eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
• Under less variability
(high a), Dynamic
best.
• Under high variability
(lower a), SITA-E
beats Dynamic by a
factor of 100, and beats
RR by a factor of
10000.
24
Simulations: Mean Queue Length
• Random & RR similar.
Title:
s immeanqueue.eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
• Under less variability
(high a), Dynamic
best.
• Under high variability
(lower a), SITA-E
beats Dynamic by a
factor of 100, and beats
RR by a factor of
1000.
• In SITA-E, queue
length = 2-3, always.
25
Simulations: Mean Slowdown
• Random & RR similar.
Title:
s immeanslow .eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
• Under less variability
(high a), Dynamic
best.
• Under high variability
(lower a), SITA-E
beats Dynamic by a
factor of 100, and beats
RR by a factor of
10000.
• In SITA-E, slowdown
= 2-3 always.
26
WHY?
27
Recall, P-K formula for M/G/1 queue
FCFS
2
{
X
}
lE
Mean Waiting Time: E{W} =
2 (1-r)
Second moment
of Task Size
Distribution
28
Analysis of Random and Round-Robin
Each host sees same
E{X2} as incoming
distribution.
Random TAP:
1
B(k,p,a)
task size.
Poisson
arrivals
4
1
4
L.B.
1
1
4
4
1
2
Each host:
M / B(k,p,a)/ 1
3
4
Round-Robin TAP:
Each host: Eh / B(k,p,a) / 1 : Not Much Better.
29
Analysis of Distributed Servers with
Random, Round-Robin, and Dynamic TAPs
E{ WRandom} is directly proportional to E{ X2}.
E{ WRound-Robin} is directly proportional to E{ X2}.
We prove: E{ WDynamic} is directly proportional to E{ X2}.
E{X2} for B(k,p,a) distribution
Title:
fig2.eps
Creator:
MATLAB, The Mathw orks , Inc .
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
30
Analysis of SITA-E
B(k,p,a)
task size.
Poisson
arrivals
L.B.
p1
S
p2
M
p3
L
p4
XL
x  f ( x)
x0 x1
x2
x3
x4
task size
We prove SITA-E is fully analyzable under B(k,p,a) distribution.
Closed-form expressions for the xi’s and pi’s.
Formulas for all performance metrics.
31
3-fold effect of SITA-E
1. Reduced E{X2} at lower hosts. In fact, analysis shows
for 1< a < 2, coefficient of variation is less than 1 for all hosts
but last host.
2. Low numbered hosts count more in mean, because more jobs
go there. Intensified by heavy-tailed property.
3. Smaller jobs go to hosts with lower waiting times.
=> low mean slowdown.
32
Analytic Results:
Analytically-derived
Mean Sowdown
(0 < a < 2)
Title:
predmeans low .eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
Analytically-derived
Mean Sowdown
(1 < a < 2)
Title:
predmeanslow upclos e.eps
Creator:
MATLAB, The Mathw orks , Inc .
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
33
Reflections
• “Best” TAP depends on task size distribution.
-- What would “best” have been under exponential distribution?
• Improving the (instantaneous) load balance is not
necessarily the best heuristic for improving performance.
• Important characteristics of task size distribution:
very high variance and heavy-tailed property.
• Additional advantage of SITA-E: Caching benefits
34
NEW Generalized Model
FCFS
Large
# jobs
L.B.
FCFS
FCFS
FCFS
• Size of job is known.
No a priori knowledge
• Jobs are not preemptible.
• Arriving job is immediately queued at a host
and remains there.
• Jobs queued at a host are processed in FCFS
order.
35
Distribution
of job sizes
System
Design
-- Very high variability
-- Heavy-tailed property
36
Problem 3:
The case for SRPT scheduling
in Web servers*
Joint work with: Crovella, Frangioso, Medina, Sengupta
* in progress ...
37
Theoretical Motivation
• It’s well known that SRPT minimizes mean flow time.
jobs
(load r < 1)
How about using SRPT in Web servers as opposed
to the traditional processor-sharing (PS) type
scheduling ?
38
Immediate Objections
• Can’t assume known job size
• Is there enough of a performance gain?
• Starvation, starvation, starvation!
• A real Web server is much more complicated
than a single processor
39
Immediate Objections
• Can’t assume known job size
Actually, for many servers most requests are for static files.
Approximation: Size of job is proportional to size of file requested.
Size of file has Pareto (1.1 < a < 1.3) tail [CTB97].
• Is there enough of a performance gain?
Analysis of single queue with Poisson arrivals shows, under load r = .9,
SRPT mean flowtime is 1/3 of PS.
SRPT mean slowdown is 1/10 of PS.
Trace-driven simulation results for a single queue are even more
dramatic.
40
Immediate Objections
• Starvation, starvation, starvation!
Analysis of single queue shows starvation under SRPT is a problem
for many distributions, however NOT for Web workloads.
Even jobs in 99%-tile experience mean slowdown far less than PS.
Q: What makes Web workloads special wrt starvation?
A: heavy-tailed property
Web job size distribution --> Bounded Pareto (a = 1).
Largest 1% of all tasks account for more than 50% load.
In exponential distribution,
Largest 1% of all tasks account for 5% total load.
41
Immediate Objections
• A real Web server is much more complex
than a single processor
• Many devices -- where to do the scheduling?
• One job at a time execution model is no longer
appropriate.
Goal: Develop system that maintains high throughput
while maximally favoring small jobs.
42
Server Designs for SRPT
2 projects
1) Coarse approximation to SRPT, familiar programming model.
• Thread-per-request.
• Thread priority based on size, updatable occasionally.
• Implemented as a Web server on Solaris.
2) Closer approximation to SRPT, unusual programming model.
• No longer thread-per-request. Instead, thread-per-queue.
• Read, write queues managed in SRPT order.
• Implemented as a Web server on Linux.
43
Preliminary Results
Mean Response Time as a function of Server Load
Title:
latenc y.eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
44
Distribution
of job sizes
System
Design
-- Heavy-tailed property
45
Talk Conclusions
Distribution
of job sizes
System
Design
3 Examples described during this talk:
-- Migration policies for N.O.W.
-- Task Assignment in distributed servers
-- Scheduling in Web servers
My research approach in general:
Practical,
common
problems
in system
design
Careful modeling
of problem to
incorporate
important details
like job size
distribution.
Analysis,
Simulation,
Implementation
Not afraid
to look for
surprising
solutions
46
Which TAP is best (given model)?
1. Round-Robin
1
2
2. Random
L.B.
3. Shortest-Queue
3
4
Go to queue with fewest jobs
-- Idea: Equalize number jobs in each queue
4. Dynamic
Go to queue with least work left (work = sum of task sizes)
-- Idea: Best individual performance.
-- Instantaneous load balance
5. Size-based scheme
(e.g., 30 min queue; 2 hr queue; 4 hr queue)
-- Idea: No “biggies” ahead of you
“best” -- minimize mean waiting time, mean queue length, mean slowdown
47
Preliminary Results
Mean Response Time as a function of Server Load
Title:
s rptps.eps
Creator:
MATLAB, The Mathw orks, Inc.
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
48
Which TAP is best (given model)?
1. Round-Robin
1
2
L.B.
2. Random
3
4
3. Shortest-Queue
Go to queue with fewest jobs
4. Dynamic
Go to queue with least work left (work = sum of task sizes)
5. Size-based scheme
(e.g., 30 min queue; 2 hr queue; 4 hr queue)
“best” -- minimize mean waiting time, mean queue length, mean slowdown
49