Coflow Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley Big Data The volume of data businesses want to make sense of.

Download Report

Transcript Coflow Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley Big Data The volume of data businesses want to make sense of.

Coflow
Mending the Application-Network Gap
in Big Data Analytics
Mosharaf Chowdhury
UC Berkeley
Big Data
The volume of data businesses want to make sense of is increasing
Increasing variety of sources
• Web, mobile, wearables, vehicles, scientific, …
Cheaper disks, SSDs, and memory
Stalling processor speeds
Big Datacenters for Massive
Parallelism
BlinkDB
Storm
Pregel GraphLab
DryadLINQ
MapReduce Hadoop
2005
Dryad
Spark-Streaming
GraphX
Spark Dremel
Hive
2010
2015
Data-Parallel Applications
Multi-stage dataflow
• Computation interleaved with
communication
Computation Stage (e.g., Map, Reduce)
• Distributed across many machines
• Tasks run in parallel
Reduce Stage
A communication stage cannot
Shuffle
complete until all the data have been
transferred
Communication Stage (e.g., Shuffle)
Map Stage
• Between successive computation stages
Communication is Crucial
Performance
Facebook jobs spend ~25% of runtime on average in intermediate
comm.1
As SSD-based and in-memory systems proliferate,
the network is likely to become the primary bottleneck
1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.
Transfers data from a source
to a destination
Flow
Independent unit of
allocation, sharing, load
balancing, and/or
prioritization
Faster
Communication
Stages:
Networking
Approach
“Configuration should be
handled at the system level”
Existing Solutions
WFQ
GPS
1980s
D3
CSFQ
RED
ECN
1990s
XCP
2000s
Per-Flow Fairness
RCP
2005
DCTCP
2010
DeTail
PDQ
D2TCP
pFabric
FCP
2015
Flow Completion Time
Independent flows cannot capture the collective communication
behavior common in data-parallel applications
Why Do They Fall Short?
r1
s1
r2
s2
1
1
2
2
s3
3
Datacenter
Network
Input Links
3
Output Links
Why Do They Fall Short?
r1
s1
r2
s2
s3
s1
1
1
r1
s2
2
2
r2
s3
3
Datacenter
Network
3
Why Do They Fall Short?
s1
1
1
r1
s2
2
2
r2
s3
3
Datacenter
Network
3
Per-Flow Fair Sharing
Link to r1
3
3
Link to r2
3
3
2
time
Shuffle
Completion
Time = 5
5
5
4
6
Avg. Flow
Completion
Time = 3.66
Solutions focusing on flow
completion time cannot further
decrease the shuffle completion
time
Improve Application-Level
1
Performance
s1
1
1
r1
s2
2
2
r2
s3
3
Datacenter
Network
3
Data-Proportional
Allocation
Per-Flow Fair Sharing
Per-Flow Fair Sharing
Link to r1
3
3
Link to r2
3
3
2
time
Shuffle
Completion
Time = 5
5
5
4
Slow down faster
flows to
accelerate
slower flows
6
Avg. Flow
Completion
Time = 3.66
Link to r1
4
4
4
Link to r2
4
4
4
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.
2
time
Shuffle
Completion
Time = 4
4
6
Avg. Flow
Completion
Time = 4
Faster
Communication
Stages:
Systems
Approach
“Configuration should be
handled by the end users”
Applications know
their performance
goals, but they have
no means to let the
network know
Faster
Communication
Stages:
Systems
Approach
M
MIND THE GAP
E
Faster
Communication
Stages:
Networking
Approach
Holistic Approach
“Configuration should be
handled by the end users”
“Configuration should be
handled at the system level”
Applications and the Network Working Together
Coflow
1. Minimize completion times,
2. Meet deadlines, or
3. Perform fair allocation.
Communication
abstraction for dataparallel applications to
express their performance
goals
Broadcast
Single Flow
Aggregation
All-to-All
Shuffle
Parallel Flows
… for
faster
1
#1 completion
of
2
coflows?
1
How to
schedule
coflows
online …
2
.
.
.
.
.
.
… to meet
#2 more
deadlines?
N
N
Datacenter
… for fair
Varys
1
Enables coflows in
data-intensive
clusters
1. Coflow Scheduler
Faster, application-aware data transfers
throughout the network
2. Global Coordination
Consistent calculation and enforcement
of scheduler decisions
3. The Coflow API
Decouples network optimizations from
applications, relieving developers and
end users
1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014.
Coflow
Communication
abstraction for dataparallel applications to
express their performance
goals
1. The size of each flow,
2. The total number of flows, and
3. The endpoints of individual flows.
Benefits ofInter-Coflow Scheduling
Coflow 2
Coflow 1
6
Units
2 Units
Link 2
3 Units
Link 1
Smallest-Flow First1,2
Fair Sharing
The Optimal
L2
L2
L2
L1
L1
L1
2
time
4
6
2
time
Coflow1 comp. time = 5
Coflow2 comp. time = 6
4
Coflow1 comp. time =
5
Coflow2 comp. time =
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
6
2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
6
2
time
4
Coflow1 comp. time = 3
Coflow2 comp. time = 6
6
NP-Hard
Benefits ofInter-CoflowisScheduling
Coflow 2
Coflow 1
6
Units
2 Units
Link 2
Link 1
3 Units
Flow-level Prioritization1
Fair Sharing
The Optimal
Concurrent Open Shop
Scheduling1
L2
L2
L2
L1
L1
L1
• Examples
include
job scheduling
2
2
6
time 4
time 4
and caching blocks
Coflow1
comp. time
= 6a ordering Coflow1 comp. time =
• Solutions
use
Coflow2 comp. time = 6
6
heuristic
Coflow2 comp. time =
6
2
time
4
Coflow1 comp. time = 3
Coflow2 comp. time = 6
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
6
1. A
Note on
the Complexity
of theDatacenter
ConcurrentTransport,
Open Shop
Problem, Journal of Scheduling, 9(4):389–396, 2006
2.
pFabric:
Minimal
Near-Optimal
SIGCOMM’2013.
6
is NP-Hard
Inter-Coflow Scheduling
Coflow 2
Coflow 1
6
Units
2 Units
Link 2
Link 1
3 Units
Concurrent Open Shop
Scheduling with Coupled
Resources
• Examples include job scheduling
and caching blocks
• Solutions use a ordering
heuristic
• Consider matching constraints
6
Input Links
Output Links
1
1
2
2
3
3
2
3
Datacenter
Varys
1. Ordering heuristic
2. Allocation algorithm
Employs a two-step
algorithm to minimize
coflow completion
times
Keep an ordered list of coflows to be
scheduled, preempting if needed
Allocates minimum required resources to
each coflow to finish in minimum time
Ordering Heuristic
6
3
1
1
C2
ends
C1
ends
O1
C3
ends
O2
5
3
5
3
2
2
3
3
Datacenter
O3
3
Tim
e
13
19
Shortest-First
(Total CCT = 35)
Length
C1 C2 C3
3 5 6
Ordering Heuristic
6
3
5
3
5
3
1
1
2
2
3
3
Datacenter
C2
ends
C1
ends
O1
C3
ends
C3
ends
C1
ends
O1
O2
O2
O3
O3
3
C2
ends
Tim
e
13
19
Shortest-First(35)
6
Tim
e
16 19
Narrowest-First
(Total CCT = 41)
Width
C1 C2 C3
3 2 1
Ordering Heuristic
6
3
5
3
5
3
Size
1
1
2
2
3
3
Datacenter
C1 C2 C3
9 10 6
C2
ends
C1
ends
O1
C3
ends
C3
ends
C1
ends
O1
O2
O2
O3
O3
3
C2
ends
13
Tim
e
19
Shortest-First(35)
C3
ends
C1
ends
C2
ends
O1
O2
O3
6 9
Tim
e
19
Smallest-First(34)
6
Tim
e
16 19
Narrowest-First(41)
Ordering Heuristic
6
3
5
3
5
3
Bottleneck
1
1
2
2
3
3
Datacenter
C1 C2 C3
3 10 6
C2
ends
C1
ends
O1
C1
ends
O1
O2
O2
O3
O3
3
C2
ends
C3
ends
C3
ends
13
Tim
e
Shortest-First(35)
C3
ends
C1
ends
6
19
C2
ends
Tim
e
Narrowest-First(41)
O1
C1
ends
O1
O2
O2
O3
O3
6 9
Tim
e
19
Smallest-First(34)
16 19
3
C3
ends
9
Tim
e
C2
ends
19
(31)
Smallest-Bottleneck
Allocation Algorithm
A coflow
cannot
finish
before its
very last
flow
Finishing flows
faster than the
bottleneck
cannot
decrease a
coflow’s
completion time
Allocate
minimum flow
rates such that
all flows of a
coflow finish
together on
time
Varys
Enables coflows in
data-intensive
clusters
1. Coflow Scheduler
Faster, application-aware data transfers
throughout the network
2. Global Coordination
Consistent calculation and enforcement
of scheduler decisions
Decouples network optimizations from
applications, relieving developers and
end users
3. The Coflow API
The Need for Coordination
4
C1
ends
1
1
C2
ends
O1
O2
3
5
4
2
2
3
3
Bottleneck
C1 C2
4 5
O3
4
9
Tim
e
Scheduling
with
Coordination
(Total CCT = 13)
The Need for Coordination
4
3
5
4
C1
ends
1
1
2
2
3
3
C2
ends
O1
O1
O2
O2
O3
O3
4
9
Tim
e
C2
ends
C1
ends
7
Tim
e
12
Scheduling
with
Coordination
Scheduling
without
Coordination
(Total CCT = 13)
(Total CCT = 19)
Uncoordinated local decisions interleave coflows, hurting
performance
Varys Architecture
Centralized master-slave
architecture
• Applications use a client library to
communicate with the master
Actual timing and rates are
determined by the coflow
scheduler
1. Download from http://varys.net
Sender
Receiver
Driver
Put
Get
Reg
Varys
Daemon
Topolog
y
Monitor
Varys
Daemon
Usage
Estimato
r
Coflow Scheduler
Varys Master
Varys
Daemon
Network Interface
(Distributed) File System
TaskName Comp. Tasks calling
f
Varys Client Library
Varys
Enables coflows in
data-intensive
clusters
1. Coflow Scheduler
Faster, application-aware data transfers
throughout the network
2. Global Coordination
Consistent calculation and enforcement
of scheduler decisions
Decouples network optimizations from
applications, relieving developers and
end users
3. The Coflow API
1. NO changes to user
jobs
2. NO storage
management
The Coflow API
@driver
b register(BROADCAST)
s
register(SHUFFLE)
• register
reduce
rs
shuffle
• put
• get
b.put(content)
b.unregister()
s.unregister()
broadcas
t
• unregister
id
…
mappers
driver
(JobTracker
)
@mapper
@reducer
b.get(id)
s.get(idsl)
…
…
ids1 s.put(content)
…
Evaluation
A 3000-machine tracedriven simulation matched
against a 100-machine
EC2 deployment
1. Does it improve performance?
2. Can it beat non-preemptive
solutions?
3. Do we really need coordination?
YES
Better than Per-Flow Fairness
Comm. Heavy
Comm. Improv. Job Improv.
EC2
1.85X
3.16X
1.25X
2.50X
Sim.
3.21X
4.86X
1.11X
3.39X
Preemption is Necessary [Sim.]
25
Overhead Over Varys
22.07
NO
20
What
About
Starvation
15
10
5.65
5
5.53
3.21
1.10
1.00
0
Varys
Varys
Per-Flow
Fair
Fairness
1
FIFO
2,3
4
Per-Flow
Priority FIFO-LM
Prioritization
Varys
NC
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
Lack of Coordination Hurts [Sim.]
25
Overhead Over Varys
22.07
20
Smallest-flow-first (per-flow
priorities)
15
• Minimizes flow completion time
10
5.65
5
5.53
3.21
1.10
1.00
0
Varys
Varys
Per-Flow
Fair
Fairness
1
FIFO
2,3
4
Per-Flow
Priority FIFO-LM
Prioritization
FIFO-LM4 performs decentralized
coflow scheduling
Varys
NC
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012
3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013
4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014
• Suffers due to local decisions
• Works well for small, similar
coflows
Coflow
Communication
abstraction for dataparallel applications to
express their performance
goals
 Pipelining between
1. The size of each flow,
stages
2. The total number of flows, and
 Speculative executions
3. The endpoints of individual
 Task failures and
flows.
restarts
How to Perform Coflow Scheduling
Without Complete Knowledge?
Implications
Minimize Avg. Comp. Time
Flows in a Single Link
Coflows in an Entire Datacenter
With complete knowledge
Smallest-Flow-First
Ordering by Bottleneck Size +
Data-Proportional Rate Allocation
Without complete knowledge
Least-Attained Service (LAS)
?
Revisiting Ordering Heuristics
6
3
5
3
5
3
1
1
2
2
3
3
C2
ends
C1
ends
O1
C1
ends
O1
O2
O2
O3
O3
3
C2
ends
C3
ends
C3
ends
13
Tim
e
Shortest-First(35)
C3
ends
C1
ends
6
19
C2
ends
Tim
e
Narrowest-First(41)
O1
C1
ends
O1
O2
O2
O3
O3
6 9
Tim
e
19
Smallest-First(34)
16 19
3
C3
ends
✖
9
Tim
e
C2
ends
19
(31)
Smallest-Bottleneck
Coflow-Aware LAS (CLAS)
Set priority that decreases with how much a coflow has already
sent
• The more a coflow has sent, the lower its priority
• Smaller coflows finish faster
Use total size of coflows to set priorities
• Avoids the drawbacks of full decentralization
Coflow-Aware LAS (CLAS)
Continuous priority reduces to fair
sharing when similar coflows coexist
Coflow 1
Coflow 2
• Priority oscillation
2
time
4
6
Coflow1 comp. time = 6
Coflow2 comp. time = 6
FIFO works well for similar coflows
• Avoids the drawbacks of full
decentralization
2
time
4
6
Coflow1 comp. time = 3
Coflow2 comp. time = 6
Discretized Coflow-Aware LAS (DCLAS)
LowestPriority
Queue
Priority discretization
• Change priority when total size
exceeds predefined thresholds
Scheduling policies
• FIFO within the same queue
• Prioritization across queue
Weighted sharing across queues
• Guarantees starvation avoidance
FIFO QK
…
FIFO Q2
FIFO Q1
HighestPriority
Queue
How to Discretize Priorities?
Exponentially spaced thresholds
• K : number of queues
• A : threshold constant
• E : threshold exponent
LowestPriority
Queue
FIFO QK
∞
EK-1A
…
Loose coordination suffices to calculate
global coflow sizes
• Slaves make independent decisions in
between
FIFO Q2
E2A
E1A
FIFO Q1
Small coflows (smaller than E1A) do not
experience coordination overheads!
E1A
0
HighestPriority
Queue
Closely Approximates Varys [Sim. &
EC2]
25
Overhead Over Varys
22.07
20
15
10
5.65
5
5.53
3.21
1.10
1.00
0
Varys
Varys
Per-Flow
Fair
Fairness
1
FIFO
2,3
4
Varys
Per-Flow
Priority FIFO-LM
NC
w/o Complete
Prioritization
Knowledge
1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012
3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013
4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014
My Contributions
Spark
Sinbad
NSDI’12
Top-Level Apache Project
SIGCOMM’13
Merged at Facebook
Orchestr
a
SIGCOMM’11
Merged with
Spark
FairClou
d
SIGCOMM’12
@HP
Varys
Coflow
SIGCOMM’14
Open-Source
HARP
SIGCOMM’12
@Microsoft Bing
Aalo
SIGCOMM’15
Open-Source
ViNEYar
d
ToN’12
Open-Source
Network-Aware
Applications
Application-Aware
Network Scheduling
Datacenter
Resource Allocation
Communication-First Big Data
Systems
In-Datacenter Analytics
• Cheaper SSDs and DRAM, proliferation of optical networks, and
resource disaggregation will make network the primary bottleneck
Inter-Datacenter Analytics
• Bandwidth-constrained wide-area networks
End User Delivery
• Faster and responsive delivery of analytics results over the Internet for
better end user experience
Systems
M
MIND THE GAP
E
Networking
Better capture application-level performance goals using coflows
Coflows improve application-level performance and usability
• Extends networking and scheduling literature
Coordination – even if not free – is worth paying for in many cases
[email protected]
http://mosharaf.com
Improve Flow Completion Times
s1
1
1
r1
s2
2
2
r2
s3
3
Datacenter
3
Smallest-Flow First1,2
Per-Flow Fair Sharing
Shuffle
Completion
Time = 5
Link to r1
Link to r2
2
time
4
6
Avg. Flow
Completion
Time = 3.66
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.
Link to r1 1
2
Link to r2 1
2
Shuffle
Completion
Time = 6
4
6
2
time
4
6
Avg. Flow
Completion
Time = 2.66
1
0.8
0.6
0.4
0.2
0
1.E+06 1.E+08 1.E+10 1.E+12 1.E+14
Coflow Size (Bytes)
Frac. of Coflows
1
0.8
0.6
0.4
0.2
0
1.E+06 1.E+08 1.E+10 1.E+12 1.E+14
Coflow Length (Bytes)
Frac. of Coflows
Frac. of Coflows
Frac. of Coflows
Distributions of Coflow
Characteristics
1
0.8
0.6
0.4
0.2
0
1.E+00
1.E+04
1.E+08
Coflow Width (Number of Flows)
1
0.8
0.6
0.4
0.2
0
1.E+06 1.E+08 1.E+10 1.E+12 1.E+14
Coflow Bottleneck Size (Bytes)
Traffic Sources
1. Ingest and replicate new data
2. Read input from remote
machines, when needed
3. Transfer intermediate data
4. Write and replicate output
Percentage of
Traffic by
Category at
Facebook
10
30
46
14
Distribution of Shuffle Durations
Performance
Fraction of Jobs
Facebook jobs spend ~25% of runtime on average in intermediate
comm. 1
0.8
Month-long trace from a
3000-machine MapReduce
production cluster at
Facebook
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Fraction of Runtime Spent in
Shuffle
1
320,000 jobs
150 Million tasks
Theoretical Results
Structure of optimal schedules
• Permutation schedules might not always lead to the optimal solution
Approximation ratio of COSS-CR
• Polynomial-time algorithm with constant approximation
64ratio (
3
)1
The need for coordination
• Fully decentralized schedulers can perform arbitrarily worse than the
optimal
1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University,
1. NO changes to user
jobs
2. NO storage
management
The Coflow API
@driver
b register(BROADCAST,
numFlows)
s
register(SHUFFLE, numFlows, {b
• register
reduce
rs
shuffle
• put
• get
b.put(content, size)
b.unregister()
s.unregister()
broadcas
t
• unregister
id
…
mappers
driver
(JobTracker
)
@mapper
@reducer
b.get(id)
s.get(idsl)
…
…
ids1 s.put(content,
size)
…
Varys
1. Admission control
2. Allocation algorithm
Employs a two-step
algorithm to support
coflow deadlines
Do not admit any coflows that cannot be
completed within deadline without violating
existing deadlines
Allocate minimum required resources to each
coflow to finish them at their deadlines
More Predictable
EC2 Deployment
100
75
% of Coflows
% of Coflows
100
Facebook Trace
Simulation
50
25
0
Varys
Varys
EDF
(Earliest-Deadline First)
Met Deadline
50
25
0
1
Fair
75
Not Admitted
1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012
Varys
Fair
Missed Deadline
Optimizing
Communication
Performance:
Systems
Approach
Spark-v1.1.1
# Comm.
*
Params
6
Hadoop-v1.2.1
10
YARN-v2.6.0
20
“Let users figure it out”
*Lower
bound. Does not include many parameters that can
indirectly impact communication; e.g., number of reducers etc.
Also excludes control-plane communication/RPC parameters.
Experimental Methodology
Varys deployment in EC2
• 100 m2.4xlarge machines
• Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC
• ~900 Mbps/machine during all-to-all communication
Trace-driven simulation
• Detailed replay of a day-long Facebook trace (circa October 2010)
• 3000-machine,150-rack cluster with 10:1 oversubscription