Transcript [pptx]
Phurti: Application and NetworkAware Flow Scheduling for MultiTenant MapReduce Clusters
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck
Le
Systems Research Group
Distributed Protocols Research Group
1
Outline
•
•
•
•
•
Introduction
System Architecture
Scheduling Algorithm
Evaluation
Summary
2
Multi-tenancy in MapReduce Clusters
MapReduce
Jobs
MapReduce Cluster
Users
• Better ROI, high utilization.
• How to share resources?
• Network is the primary bottleneck.
3
Problem Statement
How to schedule network traffic
to improve completion time for
MapReduce jobs?
4
Application-Awareness in Scheduling
Job 1 Traffic
Job 2 Traffic
6 units
Link 1
2 units
3 units
Link 2
Fair Sharing1
Shortest Flow First2
Application Aware
L1
L1
L1
L2
L2
L2
0
1
2
2
4
6
time
0
2
4
6
time
0
2
4
Job 1 Completion time = 5
Job 1 Completion time = 5
Job 1 Completion time = 3
Job 2 Completion time = 6
Job 2 Completion time = 6
Job 2 Completion time = 6
6
time
Such as DCTCP
Such as PDQ
5
Network-Awareness in Scheduling
Path 1
N1
S1
N2
Path 2
N4
Path 2
Job 1 Traffic
Path 1
N3
S2
Job 2 Traffic
3 units
3 units
6
Network-Awareness in Scheduling
Job 1 Traffic
Job 2 Traffic
3 units
Path 1
3 units
Path 2
Network-Aware
Network-Agnostic
P1
P1
P2
P2
0
2
4
6
time
0
2
4
6
Job 1 Completion time = 6
Job 1 Completion time = 3
Job 2 Completion time = 6
Job 2 Completion time = 6
time
Takeaway: Do not schedule interfering flows of concurrent jobs together
7
Related Work
• Traditional flow-scheduling
– PDQ [SIGCOMM ‘12], Hedera [NSDI ‘10]
– Only improve network-level metrics
• Application and Network-Aware Task Schedulers
– Cross-Layer Scheduling [IC2E 2015], Tetris [SIGCOMM ’14]
– Schedule tasks instead of network traffic
• Application-Aware traffic schedulers
– Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14]
– Unaware of network topology
8
Phurti: Contributions
• Improves Job Completion Time
• Fairness and Starvation Protection
• Scalable
• API Compatibility
• Hardware Compatibility
9
Outline
•
•
•
•
•
Introduction
System Architecture
Scheduling Algorithm
Evaluation
Summary
10
Phurti Framework
Hadoop Nodes
N1
N2
Northbound
API
N4
N3
N5
N6
Phurti
Scheduling
Framework
Southbound
API
SDN
Switches
S1
S2
11
Outline
•
•
•
•
•
Introduction
System Architecture
Scheduling Algorithm
Evaluation
Summary
12
Phurti Algorithm – Intuition
Job 1 Flows
Job 2 Flows
1
3
1
3
2
4
2
4
Max. Sequential Traffic: 4 units
Max. Sequential Traffic: 5 units
P1
P1
P2
P2
0
2
4
Job 1 Completion time = 4
6
time
0
2
4
6
time
Job 2 Completion time = 5
Takeaway: Job completion time is determined by maximum
sequential traffic.
13
Phurti Algorithm – Intuition (cont.)
Job 1 Traffic
Max. Sequential Traffic: 4 units
3
1
2
Job 2 Traffic
Max. Sequential Traffic: 5 units
4
If Job 2 scheduled first
If Job 1 scheduled first
P1
P1
P2
P2
0
4
2
6
Job 1 Completion time = 4
Job 2 Completion time = 8
8
time
0
4
2
6
Job 1 Completion time = 8
Job 2 Completion time = 5
8
time
Observation: It is better to schedule the job with smaller maximum
sequential traffic first.
14
Phurti Algorithm
Assign priorities to
jobs based on Max
Sequential Traffic
N1
Latency
Improvement
N4
N1
N1
N4
s3
Let flows of the
highest priority job
transfer
Let other lower
priority flows transfer
at a small rate
N3
s1
s2
Let non-interfering
flows of the lower
priority jobs transfer
N2
Job
J1
J2
N2
N3
N4
Flow
N1N4
N4N1
N2N3
Throughput
Maximization
Size
Max Seq.
Traffic
Priorit
y
2
LOW
1
HIGH
Starvation
Protection
15
Evaluation
• Baseline: Fair Sharing (Default in MapReduce)
• Testbed: 6 nodes, 2 SDN switches
• SWIM workload: workload generated from Facebook Hadoop trace
Job Size Bin
% of total jobs
% of total bytes in
shuffled data
Small
62%
5.5%
Medium
16%
10.3%
Large
22%
84.2%
16
Job Completion Time
1.2
Negative values mean
Phurti performs better.
95% of jobs have
better job completion
time under Phurti.
1
0.8
0.6
0.4
0.2
0
-800
-600
-400
-200
0
Difference in Job Completion Time (sec)
200
17
Job Completion Time
Fractional Improvement
13% improvement in 95th Average
percentile job
0.25 time showing
completion
starvation protection.
95th percentile
Much better for smaller
jobs since they typically
have higher priority
0.2
0.15
0.1
0.05
0
Overall
Small
Medium
Job Type
Large
18
Flow Scheduling Overhead
Simulate a fat-tree topology with 128 hosts.
6
Scheduling Time (milliseconds)
5
Even in unlikely event of 100 simultaneous
incoming flows, scheduling time is 4.5ms which
is negligible scheduling overhead.
4
3
2
1
0
20
40
60
80
Number of Simutaneous Flow Arrivals
100
19
Flow Scheduling Overhead
Scheduling time for a new flow with 10 ongoing flows in the network
Scheduling overhead grows much
slower than linear rate showing
that it is scalable with increasing
number of hosts.
20
Phurti vs Varys
Simulate 128-hosts fat-tree topology with core network having 1x, 5x and 10x capacity
compared to access links
1x
5x
10x
1.2
Outperforms Varys significantly
when the core network has much
less capacity (oversubscribed).
1
0.8
Better than
Varys in
every case.
0.6
0.4
0.2
0
-120
-100
-80
-60
-40
-20
0
Difference in Shuffle Completion Time (sec)
20
21
Phurti: Contributions
• Improves completion time for 95% of the jobs, decreases the
average completion time by 20% for all jobs.
• Fairness and Starvation Protection. Improves tail job
completion time by 13%.
• Scalable. Shown to scale to 1024 hosts and 100 simultaneous
flow arrivals.
• API Compatibility
• Hardware Compatibility
22
BACKUP SLIDES
23
Effective Transmit Rate
1.2
1
80% of jobs have effective
transmit rate larger than 0.9
showing minimal throttling.
CDF
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Effective Transmit Rate
1
1.2
24