Transcript [pptx]
Phurti: Application and NetworkAware Flow Scheduling for MultiTenant MapReduce Clusters Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le Systems Research Group Distributed Protocols Research Group 1 Outline • • • • • Introduction System Architecture Scheduling Algorithm Evaluation Summary 2 Multi-tenancy in MapReduce Clusters MapReduce Jobs MapReduce Cluster Users • Better ROI, high utilization. • How to share resources? • Network is the primary bottleneck. 3 Problem Statement How to schedule network traffic to improve completion time for MapReduce jobs? 4 Application-Awareness in Scheduling Job 1 Traffic Job 2 Traffic 6 units Link 1 2 units 3 units Link 2 Fair Sharing1 Shortest Flow First2 Application Aware L1 L1 L1 L2 L2 L2 0 1 2 2 4 6 time 0 2 4 6 time 0 2 4 Job 1 Completion time = 5 Job 1 Completion time = 5 Job 1 Completion time = 3 Job 2 Completion time = 6 Job 2 Completion time = 6 Job 2 Completion time = 6 6 time Such as DCTCP Such as PDQ 5 Network-Awareness in Scheduling Path 1 N1 S1 N2 Path 2 N4 Path 2 Job 1 Traffic Path 1 N3 S2 Job 2 Traffic 3 units 3 units 6 Network-Awareness in Scheduling Job 1 Traffic Job 2 Traffic 3 units Path 1 3 units Path 2 Network-Aware Network-Agnostic P1 P1 P2 P2 0 2 4 6 time 0 2 4 6 Job 1 Completion time = 6 Job 1 Completion time = 3 Job 2 Completion time = 6 Job 2 Completion time = 6 time Takeaway: Do not schedule interfering flows of concurrent jobs together 7 Related Work • Traditional flow-scheduling – PDQ [SIGCOMM ‘12], Hedera [NSDI ‘10] – Only improve network-level metrics • Application and Network-Aware Task Schedulers – Cross-Layer Scheduling [IC2E 2015], Tetris [SIGCOMM ’14] – Schedule tasks instead of network traffic • Application-Aware traffic schedulers – Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] – Unaware of network topology 8 Phurti: Contributions • Improves Job Completion Time • Fairness and Starvation Protection • Scalable • API Compatibility • Hardware Compatibility 9 Outline • • • • • Introduction System Architecture Scheduling Algorithm Evaluation Summary 10 Phurti Framework Hadoop Nodes N1 N2 Northbound API N4 N3 N5 N6 Phurti Scheduling Framework Southbound API SDN Switches S1 S2 11 Outline • • • • • Introduction System Architecture Scheduling Algorithm Evaluation Summary 12 Phurti Algorithm – Intuition Job 1 Flows Job 2 Flows 1 3 1 3 2 4 2 4 Max. Sequential Traffic: 4 units Max. Sequential Traffic: 5 units P1 P1 P2 P2 0 2 4 Job 1 Completion time = 4 6 time 0 2 4 6 time Job 2 Completion time = 5 Takeaway: Job completion time is determined by maximum sequential traffic. 13 Phurti Algorithm – Intuition (cont.) Job 1 Traffic Max. Sequential Traffic: 4 units 3 1 2 Job 2 Traffic Max. Sequential Traffic: 5 units 4 If Job 2 scheduled first If Job 1 scheduled first P1 P1 P2 P2 0 4 2 6 Job 1 Completion time = 4 Job 2 Completion time = 8 8 time 0 4 2 6 Job 1 Completion time = 8 Job 2 Completion time = 5 8 time Observation: It is better to schedule the job with smaller maximum sequential traffic first. 14 Phurti Algorithm Assign priorities to jobs based on Max Sequential Traffic N1 Latency Improvement N4 N1 N1 N4 s3 Let flows of the highest priority job transfer Let other lower priority flows transfer at a small rate N3 s1 s2 Let non-interfering flows of the lower priority jobs transfer N2 Job J1 J2 N2 N3 N4 Flow N1N4 N4N1 N2N3 Throughput Maximization Size Max Seq. Traffic Priorit y 2 LOW 1 HIGH Starvation Protection 15 Evaluation • Baseline: Fair Sharing (Default in MapReduce) • Testbed: 6 nodes, 2 SDN switches • SWIM workload: workload generated from Facebook Hadoop trace Job Size Bin % of total jobs % of total bytes in shuffled data Small 62% 5.5% Medium 16% 10.3% Large 22% 84.2% 16 Job Completion Time 1.2 Negative values mean Phurti performs better. 95% of jobs have better job completion time under Phurti. 1 0.8 0.6 0.4 0.2 0 -800 -600 -400 -200 0 Difference in Job Completion Time (sec) 200 17 Job Completion Time Fractional Improvement 13% improvement in 95th Average percentile job 0.25 time showing completion starvation protection. 95th percentile Much better for smaller jobs since they typically have higher priority 0.2 0.15 0.1 0.05 0 Overall Small Medium Job Type Large 18 Flow Scheduling Overhead Simulate a fat-tree topology with 128 hosts. 6 Scheduling Time (milliseconds) 5 Even in unlikely event of 100 simultaneous incoming flows, scheduling time is 4.5ms which is negligible scheduling overhead. 4 3 2 1 0 20 40 60 80 Number of Simutaneous Flow Arrivals 100 19 Flow Scheduling Overhead Scheduling time for a new flow with 10 ongoing flows in the network Scheduling overhead grows much slower than linear rate showing that it is scalable with increasing number of hosts. 20 Phurti vs Varys Simulate 128-hosts fat-tree topology with core network having 1x, 5x and 10x capacity compared to access links 1x 5x 10x 1.2 Outperforms Varys significantly when the core network has much less capacity (oversubscribed). 1 0.8 Better than Varys in every case. 0.6 0.4 0.2 0 -120 -100 -80 -60 -40 -20 0 Difference in Shuffle Completion Time (sec) 20 21 Phurti: Contributions • Improves completion time for 95% of the jobs, decreases the average completion time by 20% for all jobs. • Fairness and Starvation Protection. Improves tail job completion time by 13%. • Scalable. Shown to scale to 1024 hosts and 100 simultaneous flow arrivals. • API Compatibility • Hardware Compatibility 22 BACKUP SLIDES 23 Effective Transmit Rate 1.2 1 80% of jobs have effective transmit rate larger than 0.9 showing minimal throttling. CDF 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Effective Transmit Rate 1 1.2 24