PPS - Mosharaf Chowdhury

Download Report

Transcript PPS - Mosharaf Chowdhury

Varys
Efficient Coflow Scheduling
Mosharaf Chowdhury,
Yuan Zhong, Ion Stoica
UC Berkeley
Communication is Crucial
Performance
Facebook analytics jobs spend 33% of their runtime in
communication1
As in-memory systems proliferate,
the network is likely to become the primary
bottleneck
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
A sequence of packets
between two endpoints
Flow
Independent unit of
allocation, sharing, load
balancing, and/or
prioritization
Optimizing
Communication
Performance:
Networking
Approach
“Let systems figure it out”
Optimizing
Communication
Performance:
Systems
Approach
1.0.1
Spark
# Comm.
*
Params
6
Hadoop 1.0.4
10
2.3.0
YARN
20
“Let users figure it out”
*Lower
bound. Does not include many parameters that can
indirectly impact communication; e.g., number of reducers etc.
Also excludes control-plane communication/RPC parameters.
Optimizing
Communication
Performance:
Systems
A collection
of parallel flows
Approach
Distributed
endpoints
Optimizing
Communication
Performance:
Networking
Completion time
Approach
depends on the last flow
Each flow is independent
“Let users figure it out”
to complete
“Let systems figure it out”
Coflow
1
A collection of parallel flows
Distributed endpoints
Each flow is independent
1. Coflow: A Networking Abstraction for Cluster Applications, HotNets’2012
Completion time
depends on the last flow
to complete
1
1
2
2
Coflow
1
How to
.
schedule
A collection of parallel flows
coflows
…
Distributed endpoints .
… for faster
#1 completion
of coflows?
.
… to
meet
time
. Completion
depends
on the last flow
#2 more
to complete
.
deadlines?
.
Each flow is independent
N
N
DC Fabric
Varys
Enables coflows in
data-intensive
clusters
1. Simpler
Zero user-side configuration
Frameworks
using a simple coflow API
2. Better performance Faster and more predictable
transfers through coflow
scheduling
Benefits ofInter-Coflow Scheduling
Coflow 2
Coflow 1
6
Units
3-ε
Units
Link 2
3 Units
Link 1
Flow-level Prioritization1,2
Fair Sharing
The Optimal
L2
L2
L2
L1
L1
L1
2
time
4
6
2
time
Coflow1 comp. time = 6
Coflow2 comp. time = 6
4
Coflow1 comp. time =
6
Coflow2 comp. time =
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
6
2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
6
2
time
4
Coflow1 comp. time = 3
Coflow2 comp. time = 6
6
Benefits ofInter-Coflow Scheduling
Coflow 2
Coflow 1
6
Units
3-ε
Units
Link 2
Link 1
3 Units
Flow-level Prioritization1
Fair Sharing
Concurrent Open Shop
Scheduling1
L2
L1
L2
• Tasks on independent machines
L1
• Examples
include
job scheduling
2
2
6
4
time 4
time
and caching blocks
Coflow1
time = 6 heuristic
Coflow1 comp. time =
• Usecomp.
a ordering
Coflow2 comp. time = 6
6
Coflow2 comp. time =
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
6
The Optimal
L2
L1
6
2
time
4
Coflow1 comp. time = 3
Coflow2 comp. time = 6
1. A
note onMinimal
the complexity
of the Datacenter
concurrent open
shop SIGCOMM’2013.
problem, Journal of Scheduling, 9(4):389–396, 2006
2.
pFabric:
Near-Optimal
Transport,
6
Inter-Coflow Scheduling
Coflow 2
Coflow 1
6
Units
3-ε
Units
Link 2
Link 1
3 Units
Concurrent Open Shop
Scheduling1
• Tasks on independent machines
• Examples include job scheduling
and caching blocks
• Use a ordering heuristic
6
3
Ingress Ports
(Machine Uplinks)
Egress Ports
(Machine Downlinks)
3
3
2
2
3-ε
1
DC Fabric
1. A note on the complexity of the concurrent open shop problem, Journal of Scheduling, 9(4):389–396, 2006
1
is NP-Hard
Inter-Coflow Scheduling
Coflow 2
Coflow 1
6
Units
3-ε
Units
Link 2
3 Units
Link 1
with coupled resources
Concurrent Open Shop
Scheduling
^
• Flows on dependent links
• Consider ordering and matching
Characterized
constraints COSS-CR
Proved that list scheduling might
not result in optimal solution
6
3
Ingress Ports
(Machine Uplinks)
Egress Ports
(Machine Downlinks)
3
3
2
2
3-ε
1
DC Fabric
1
Varys
1. Ordering heuristic
Employs a two-step
algorithm to minimize
coflow completion
times
Keeps an ordered list of coflows to
be scheduled, preempting if
needed
2. Allocation algorithm Allocates minimum required
resources to each coflow to finish
: SEBF
Ordering Heuristic
4
1
1
2
4
2
2
3
4
C1
ends
P1
P1
P2
P2
P3
P3
3
3
Tim
e
C1 C
2
Length
Width
Size
3
2
5
4
3
1
2
C2
ends
5
9
Shortest-First
Narrowest-First
Smallest-First
C2
ends
C1
ends
4
9
Tim
e
SmallestEffectiveBottleneckFirst
: SEBF
Ordering Heuristic
Allocation
Algorithm
4
4
2
3
4
1
1
2
2
P1
P2
P2
P3
P3
Tim
e
2
3
2
5
4
3
1
2
C2
ends
P1
3
3
C1 C
Length
Width
Size
C1
ends
5
9
Shortest-First
Narrowest-First
Smallest-First
C2
ends
C1
ends
4
9
Tim
e
SmallestEffectiveBottleneckFirst
Allocation Algorithm
A coflow
cannot
finish
before its
very last
flow
Finishing flows
faster than the
bottleneck
cannot
decrease a
coflow’s
completion time
MADD
Ensure
minimum
allocation to
each flow for it
to
finish at the
desired
duration;
for example,
Varys
Enables
frameworks to
take advantage of
coflow scheduling
1. Exposes the coflow API
2. Enforces through a centralized scheduler
Evaluation
A 3000-node trace-driven
simulation matched
against a 100-node EC2
deployment
1. Does it improve performance?
2. Can it beat non-preemptive solutions?
YES
Faster Jobs
Comm. Heavy1
Comm. Improv. Job Improv.
Avg.
3.16X
1.85X
2.50X
1.25X
th
95
1.74X
3.84X
1.15X
2.94X
1. 26% jobs spend at least 50% of their duration in communication stages.
Better than Non-Preemptive
Solutions
w.r.t. FIFO1
Avg.
5.65X
th
95
7.70X
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
NO
What
About
Perpetual
Starvation?
#
1
Coflow
Dependencies
#
2
Unknown Flow
Information
#
3
Decentralized
Four
Challenges
Varys
Multi-stage jobs
Pipelining between stages
Master failure
in thestages
Context ofTask
Multipoint-to-Multipoint
Coflows
Multi-wave
failures and restarts
Low-latency
analytics
#
4
Theory Behind
“Concurrent Open Shop Scheduling
with Coupled Resources”
Varys
Greedily schedules
coflows without
worrying about flowlevel metrics
• Consolidates network optimization of data-intensive frameworks
• Improves job performance by addressing the COSS-CR problem
• Increases predictability through informed admission control
http://varys.net/
Mosharaf Chowdhury - @mosharaf