Transcript Slide 1

Orchestra
Managing Data Transfers in
Computer Clusters
Mosharaf Chowdhury, Matei Zaharia, Justin Ma,
Michael I. Jordan, Ion Stoica
UC Berkeley
Moving Data is Expensive
Typical MapReduce jobs in Facebook spend 33% of job
running time in large data transfers
Application for training a spam classifier on Twitter
data spends 40% time in communication
2
Limits Scalability
Scalability of Netflix-like recommendation system is
bottlenecked by communication
Iteration time (s)
250
Communication
200
Computation
Did not scale beyond 60 nodes
150
» Comm. time increased faster than
comp. time decreased
100
50
0
10
30
60
90
Number of machines
3
Transfer Patterns
Broadcast
Transfer: set of all flows
transporting data between
two stages of a job
Map
Shuffle
Reduce
Incast*
» Acts as a barrier
Completion time: Time for
the last receiver to finish
4
Contributions
1. Optimize at the level of transfers
instead of individual flows
2. Inter-transfer coordination
5
Orchestra
ITC
Inter-Transfer
Fair sharing
Controller (ITC)
FIFO
Priority
TCShuffle
(shuffle)
TCBroadcast
(broadcast)
TCBroadcast
(broadcast)
Transfer
Hadoop
shuffle
Controller
WSS (TC)
HDFS
Transfer
Tree (TC)
Controller
Cornet
HDFS
Transfer
Tree (TC)
Controller
Cornet
shuffle
broadcast 1
broadcast 2
6
Outline
ITC
ITC
Fair
sharing
Cooperative
broadcast (Cornet)
»FIFO
Infer and utilize topology information
Priority
Weighted Shuffle Scheduling (WSS)
TC (shuffle)
Shuffle
Hadoop shuffle
TC
WSS
TC (broadcast)
(broadcast)
» Assign flowTC
rates
to optimize shuffle
Broadcast
Broadcast
HDFS
completion timeHDFS
Tree
TC
Tree
TC
Cornet
Cornet
Inter-Transfer
Controller
» Implement weighted fair sharing
between transfers
End-to-end performance
7
Cornet: Cooperative broadcast
Broadcast same data to every receiver
» Fast, scalable, adaptive to bandwidth, and resilient
Peer-to-peer mechanism optimized for cooperative
environments
Observations
Cornet Design Decisions
1. High-bandwidth, low-latency network
 Large block size (4-16MB)
2. No selfish or malicious peers
 No need for incentives (e.g., TFT)
 No (un)choking
 Everyone stays till the end
3. Topology matters
 Topology-aware broadcast
8
Cornet performance
1GB data to 100 receivers on EC2
Status quo
4.5x to 5x improvement
9
Topology-aware Cornet
Many data center networks employ tree topologies
Each rack should receive exactly one copy of broadcast
» Minimize cross-rack communication
Topology information reduces cross-rack data transfer
» Mixture of spherical Gaussians to infer network topology
10
Topology-aware Cornet
200MB data to 30 receivers on DETER
3 inferred
clusters
~2x faster than vanilla Cornet
11
Status quo in Shuffle
r1
s1
s2
r2
s3
s4
s5
Links to r1 and r2 are full: 3 time units
Link from s3 is full: 2 time units
Completion time: 5 time units
12
Weighted Shuffle Scheduling
r1
Allocate rates to each flow
using weighted fair sharing,
where the weight of a flow
between a sender-receiver pair
is proportional to the total
amount of data to be sent
1
s1
1
s2
r2
2
2
s3
1
s4
1
s5
Completion time: 4 time units
Up to 1.5X improvement
13
Inter-Transfer Controller
aka Conductor
Weighted fair sharing
» Each transfer is assigned a weight
» Congested links shared proportionally to transfers’ weights
Implementation: Weighted Flow Assignment (WFA)
» Each transfer gets a number of TCP connections
proportional to its weight
» Requires no changes in the network nor in
end host OSes
14
Benefits of the ITC
Shuffle using 30 nodes on EC2
% of active flows
100%
80%
Low Priority Job 0
High Priority Job 1
60%
High Priority Job 2
High Priority Job 3
40%
20%
0%
Two priority classes
0
5
10
15
20
25
30
35
40
45
40
45
Time(s)
» FIFO within each class
Without Inter-transfer Scheduling
Low priority transfer
High priority transfers
» 250MB per reducer
% of active flows
» 2GB per reducer
100%
80%
60%
40%
20%
0%
0
5
10
15
20
25
30
35
Time(s)
Priority Scheduling in Conductor
43% reduction in high priority xfers
6% increase of the low priority xfer
End-to-end evaluation
Developed in the context of Spark – an iterative, inmemory MapReduce-like framework
Evaluated using two iterative applications developed
by ML researchers at UC Berkeley
» Training spam classifier on Twitter data
» Recommendation system for the Netflix challenge
16
Faster spam classification
Communication reduced from 42% to
28% of the iteration time
Overall 22% reduction in iteration time
17
Scalable recommendation system
Before
250
After
250
Communication
Computation
200
Iteration time (s)
Iteration time (s)
200
Communication
Computation
150
100
150
100
50
50
0
0
10
30
60
Number of machines
90
10
30
60
Number of machines
90
1.9x faster at 90 nodes
18
Related work
DCN architectures (VL2, Fat-tree etc.)
» Mechanism for faster network, not policy for better sharing
Schedulers for data-intensive applications (Hadoop
scheduler, Quincy, Mesos etc.)
» Schedules CPU, memory, and disk across the cluster
Hedera
» Transfer-unaware flow scheduling
Seawall
» Performance isolation among cloud tenants
19
Summary
Optimize transfers instead of individual flows
» Utilize knowledge about application semantics
Coordinate transfers
» Orchestra enables policy-based transfer management
» Cornet performs up to 4.5x better than the status quo
» WSS can outperform default solutions by 1.5x
No changes in the network nor in end host OSes
http://www.mosharaf.com/
20
BACKUP SLIDES
21
MapReduce logs
Weeklong trace of
188,000 MapReduce jobs
from a 3000-node cluster
1
CDF
0.8
0.6
0.4
Maximum number of
concurrent transfers is
several hundreds
0.2
0
0
0.2
0.4
0.6
0.8
Fraction of job lifetime spent in shuffle phase
33% time in shuffle on average
22
Monarch (Oakland’11)
Real-time spam classification
from 345,000 tweets with urls
» Logistic Regression
» Written in Spark
Spends 42% of the iteration
time in transfers
» 30% broadcast
» 12% shuffle
100 iterations to converge
23
Collaborative Filtering
Does not scale beyond 60 nodes
250
Communication
Iteration time (s)
200
Netflix
challengeComputation
150
» Predict users’ ratings for
100
movies they haven’t seen
50
based on their ratings for
0 other movies
10
30
60
90
Number of machines
385MB data broadcasted
in each iteration
24
Cornet performance
1GB data to 100 receivers on EC2
4.5x to 6.5x improvement
25
Shuffle bottlenecks
At a sender
At a receiver
In the network
An optimal shuffle schedule must keep at least one link
fully utilized throughout the transfer
26
Current implementations
Shuffle 1GB to 30 reducers on EC2
27