Coflow Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley Big Data The volume of data businesses want to make sense of.
Download ReportTranscript Coflow Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley Big Data The volume of data businesses want to make sense of.
Coflow Mending the Application-Network Gap in Big Data Analytics Mosharaf Chowdhury UC Berkeley Big Data The volume of data businesses want to make sense of is increasing Increasing variety of sources • Web, mobile, wearables, vehicles, scientific, … Cheaper disks, SSDs, and memory Stalling processor speeds Big Datacenters for Massive Parallelism BlinkDB Storm Pregel GraphLab DryadLINQ MapReduce Hadoop 2005 Dryad Spark-Streaming GraphX Spark Dremel Hive 2010 2015 Data-Parallel Applications Multi-stage dataflow • Computation interleaved with communication Computation Stage (e.g., Map, Reduce) • Distributed across many machines • Tasks run in parallel Reduce Stage A communication stage cannot Shuffle complete until all the data have been transferred Communication Stage (e.g., Shuffle) Map Stage • Between successive computation stages Communication is Crucial Performance Facebook jobs spend ~25% of runtime on average in intermediate comm.1 As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster. Transfers data from a source to a destination Flow Independent unit of allocation, sharing, load balancing, and/or prioritization Faster Communication Stages: Networking Approach “Configuration should be handled at the system level” Existing Solutions WFQ GPS 1980s D3 CSFQ RED ECN 1990s XCP 2000s Per-Flow Fairness RCP 2005 DCTCP 2010 DeTail PDQ D2TCP pFabric FCP 2015 Flow Completion Time Independent flows cannot capture the collective communication behavior common in data-parallel applications Why Do They Fall Short? r1 s1 r2 s2 1 1 2 2 s3 3 Datacenter Network Input Links 3 Output Links Why Do They Fall Short? r1 s1 r2 s2 s3 s1 1 1 r1 s2 2 2 r2 s3 3 Datacenter Network 3 Why Do They Fall Short? s1 1 1 r1 s2 2 2 r2 s3 3 Datacenter Network 3 Per-Flow Fair Sharing Link to r1 3 3 Link to r2 3 3 2 time Shuffle Completion Time = 5 5 5 4 6 Avg. Flow Completion Time = 3.66 Solutions focusing on flow completion time cannot further decrease the shuffle completion time Improve Application-Level 1 Performance s1 1 1 r1 s2 2 2 r2 s3 3 Datacenter Network 3 Data-Proportional Allocation Per-Flow Fair Sharing Per-Flow Fair Sharing Link to r1 3 3 Link to r2 3 3 2 time Shuffle Completion Time = 5 5 5 4 Slow down faster flows to accelerate slower flows 6 Avg. Flow Completion Time = 3.66 Link to r1 4 4 4 Link to r2 4 4 4 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011. 2 time Shuffle Completion Time = 4 4 6 Avg. Flow Completion Time = 4 Faster Communication Stages: Systems Approach “Configuration should be handled by the end users” Applications know their performance goals, but they have no means to let the network know Faster Communication Stages: Systems Approach M MIND THE GAP E Faster Communication Stages: Networking Approach Holistic Approach “Configuration should be handled by the end users” “Configuration should be handled at the system level” Applications and the Network Working Together Coflow 1. Minimize completion times, 2. Meet deadlines, or 3. Perform fair allocation. Communication abstraction for dataparallel applications to express their performance goals Broadcast Single Flow Aggregation All-to-All Shuffle Parallel Flows … for faster 1 #1 completion of 2 coflows? 1 How to schedule coflows online … 2 . . . . . . … to meet #2 more deadlines? N N Datacenter … for fair Varys 1 Enables coflows in data-intensive clusters 1. Coflow Scheduler Faster, application-aware data transfers throughout the network 2. Global Coordination Consistent calculation and enforcement of scheduler decisions 3. The Coflow API Decouples network optimizations from applications, relieving developers and end users 1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014. Coflow Communication abstraction for dataparallel applications to express their performance goals 1. The size of each flow, 2. The total number of flows, and 3. The endpoints of individual flows. Benefits ofInter-Coflow Scheduling Coflow 2 Coflow 1 6 Units 2 Units Link 2 3 Units Link 1 Smallest-Flow First1,2 Fair Sharing The Optimal L2 L2 L2 L1 L1 L1 2 time 4 6 2 time Coflow1 comp. time = 5 Coflow2 comp. time = 6 4 Coflow1 comp. time = 5 Coflow2 comp. time = 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 6 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013. 6 2 time 4 Coflow1 comp. time = 3 Coflow2 comp. time = 6 6 NP-Hard Benefits ofInter-CoflowisScheduling Coflow 2 Coflow 1 6 Units 2 Units Link 2 Link 1 3 Units Flow-level Prioritization1 Fair Sharing The Optimal Concurrent Open Shop Scheduling1 L2 L2 L2 L1 L1 L1 • Examples include job scheduling 2 2 6 time 4 time 4 and caching blocks Coflow1 comp. time = 6a ordering Coflow1 comp. time = • Solutions use Coflow2 comp. time = 6 6 heuristic Coflow2 comp. time = 6 2 time 4 Coflow1 comp. time = 3 Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 6 1. A Note on the Complexity of theDatacenter ConcurrentTransport, Open Shop Problem, Journal of Scheduling, 9(4):389–396, 2006 2. pFabric: Minimal Near-Optimal SIGCOMM’2013. 6 is NP-Hard Inter-Coflow Scheduling Coflow 2 Coflow 1 6 Units 2 Units Link 2 Link 1 3 Units Concurrent Open Shop Scheduling with Coupled Resources • Examples include job scheduling and caching blocks • Solutions use a ordering heuristic • Consider matching constraints 6 Input Links Output Links 1 1 2 2 3 3 2 3 Datacenter Varys 1. Ordering heuristic 2. Allocation algorithm Employs a two-step algorithm to minimize coflow completion times Keep an ordered list of coflows to be scheduled, preempting if needed Allocates minimum required resources to each coflow to finish in minimum time Ordering Heuristic 6 3 1 1 C2 ends C1 ends O1 C3 ends O2 5 3 5 3 2 2 3 3 Datacenter O3 3 Tim e 13 19 Shortest-First (Total CCT = 35) Length C1 C2 C3 3 5 6 Ordering Heuristic 6 3 5 3 5 3 1 1 2 2 3 3 Datacenter C2 ends C1 ends O1 C3 ends C3 ends C1 ends O1 O2 O2 O3 O3 3 C2 ends Tim e 13 19 Shortest-First(35) 6 Tim e 16 19 Narrowest-First (Total CCT = 41) Width C1 C2 C3 3 2 1 Ordering Heuristic 6 3 5 3 5 3 Size 1 1 2 2 3 3 Datacenter C1 C2 C3 9 10 6 C2 ends C1 ends O1 C3 ends C3 ends C1 ends O1 O2 O2 O3 O3 3 C2 ends 13 Tim e 19 Shortest-First(35) C3 ends C1 ends C2 ends O1 O2 O3 6 9 Tim e 19 Smallest-First(34) 6 Tim e 16 19 Narrowest-First(41) Ordering Heuristic 6 3 5 3 5 3 Bottleneck 1 1 2 2 3 3 Datacenter C1 C2 C3 3 10 6 C2 ends C1 ends O1 C1 ends O1 O2 O2 O3 O3 3 C2 ends C3 ends C3 ends 13 Tim e Shortest-First(35) C3 ends C1 ends 6 19 C2 ends Tim e Narrowest-First(41) O1 C1 ends O1 O2 O2 O3 O3 6 9 Tim e 19 Smallest-First(34) 16 19 3 C3 ends 9 Tim e C2 ends 19 (31) Smallest-Bottleneck Allocation Algorithm A coflow cannot finish before its very last flow Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time Allocate minimum flow rates such that all flows of a coflow finish together on time Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler Faster, application-aware data transfers throughout the network 2. Global Coordination Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users 3. The Coflow API The Need for Coordination 4 C1 ends 1 1 C2 ends O1 O2 3 5 4 2 2 3 3 Bottleneck C1 C2 4 5 O3 4 9 Tim e Scheduling with Coordination (Total CCT = 13) The Need for Coordination 4 3 5 4 C1 ends 1 1 2 2 3 3 C2 ends O1 O1 O2 O2 O3 O3 4 9 Tim e C2 ends C1 ends 7 Tim e 12 Scheduling with Coordination Scheduling without Coordination (Total CCT = 13) (Total CCT = 19) Uncoordinated local decisions interleave coflows, hurting performance Varys Architecture Centralized master-slave architecture • Applications use a client library to communicate with the master Actual timing and rates are determined by the coflow scheduler 1. Download from http://varys.net Sender Receiver Driver Put Get Reg Varys Daemon Topolog y Monitor Varys Daemon Usage Estimato r Coflow Scheduler Varys Master Varys Daemon Network Interface (Distributed) File System TaskName Comp. Tasks calling f Varys Client Library Varys Enables coflows in data-intensive clusters 1. Coflow Scheduler Faster, application-aware data transfers throughout the network 2. Global Coordination Consistent calculation and enforcement of scheduler decisions Decouples network optimizations from applications, relieving developers and end users 3. The Coflow API 1. NO changes to user jobs 2. NO storage management The Coflow API @driver b register(BROADCAST) s register(SHUFFLE) • register reduce rs shuffle • put • get b.put(content) b.unregister() s.unregister() broadcas t • unregister id … mappers driver (JobTracker ) @mapper @reducer b.get(id) s.get(idsl) … … ids1 s.put(content) … Evaluation A 3000-machine tracedriven simulation matched against a 100-machine EC2 deployment 1. Does it improve performance? 2. Can it beat non-preemptive solutions? 3. Do we really need coordination? YES Better than Per-Flow Fairness Comm. Heavy Comm. Improv. Job Improv. EC2 1.85X 3.16X 1.25X 2.50X Sim. 3.21X 4.86X 1.11X 3.39X Preemption is Necessary [Sim.] 25 Overhead Over Varys 22.07 NO 20 What About Starvation 15 10 5.65 5 5.53 3.21 1.10 1.00 0 Varys Varys Per-Flow Fair Fairness 1 FIFO 2,3 4 Per-Flow Priority FIFO-LM Prioritization Varys NC 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011 Lack of Coordination Hurts [Sim.] 25 Overhead Over Varys 22.07 20 Smallest-flow-first (per-flow priorities) 15 • Minimizes flow completion time 10 5.65 5 5.53 3.21 1.10 1.00 0 Varys Varys Per-Flow Fair Fairness 1 FIFO 2,3 4 Per-Flow Priority FIFO-LM Prioritization FIFO-LM4 performs decentralized coflow scheduling Varys NC 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012 3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014 • Suffers due to local decisions • Works well for small, similar coflows Coflow Communication abstraction for dataparallel applications to express their performance goals Pipelining between 1. The size of each flow, stages 2. The total number of flows, and Speculative executions 3. The endpoints of individual Task failures and flows. restarts How to Perform Coflow Scheduling Without Complete Knowledge? Implications Minimize Avg. Comp. Time Flows in a Single Link Coflows in an Entire Datacenter With complete knowledge Smallest-Flow-First Ordering by Bottleneck Size + Data-Proportional Rate Allocation Without complete knowledge Least-Attained Service (LAS) ? Revisiting Ordering Heuristics 6 3 5 3 5 3 1 1 2 2 3 3 C2 ends C1 ends O1 C1 ends O1 O2 O2 O3 O3 3 C2 ends C3 ends C3 ends 13 Tim e Shortest-First(35) C3 ends C1 ends 6 19 C2 ends Tim e Narrowest-First(41) O1 C1 ends O1 O2 O2 O3 O3 6 9 Tim e 19 Smallest-First(34) 16 19 3 C3 ends ✖ 9 Tim e C2 ends 19 (31) Smallest-Bottleneck Coflow-Aware LAS (CLAS) Set priority that decreases with how much a coflow has already sent • The more a coflow has sent, the lower its priority • Smaller coflows finish faster Use total size of coflows to set priorities • Avoids the drawbacks of full decentralization Coflow-Aware LAS (CLAS) Continuous priority reduces to fair sharing when similar coflows coexist Coflow 1 Coflow 2 • Priority oscillation 2 time 4 6 Coflow1 comp. time = 6 Coflow2 comp. time = 6 FIFO works well for similar coflows • Avoids the drawbacks of full decentralization 2 time 4 6 Coflow1 comp. time = 3 Coflow2 comp. time = 6 Discretized Coflow-Aware LAS (DCLAS) LowestPriority Queue Priority discretization • Change priority when total size exceeds predefined thresholds Scheduling policies • FIFO within the same queue • Prioritization across queue Weighted sharing across queues • Guarantees starvation avoidance FIFO QK … FIFO Q2 FIFO Q1 HighestPriority Queue How to Discretize Priorities? Exponentially spaced thresholds • K : number of queues • A : threshold constant • E : threshold exponent LowestPriority Queue FIFO QK ∞ EK-1A … Loose coordination suffices to calculate global coflow sizes • Slaves make independent decisions in between FIFO Q2 E2A E1A FIFO Q1 Small coflows (smaller than E1A) do not experience coordination overheads! E1A 0 HighestPriority Queue Closely Approximates Varys [Sim. & EC2] 25 Overhead Over Varys 22.07 20 15 10 5.65 5 5.53 3.21 1.10 1.00 0 Varys Varys Per-Flow Fair Fairness 1 FIFO 2,3 4 Varys Per-Flow Priority FIFO-LM NC w/o Complete Prioritization Knowledge 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012 3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014 My Contributions Spark Sinbad NSDI’12 Top-Level Apache Project SIGCOMM’13 Merged at Facebook Orchestr a SIGCOMM’11 Merged with Spark FairClou d SIGCOMM’12 @HP Varys Coflow SIGCOMM’14 Open-Source HARP SIGCOMM’12 @Microsoft Bing Aalo SIGCOMM’15 Open-Source ViNEYar d ToN’12 Open-Source Network-Aware Applications Application-Aware Network Scheduling Datacenter Resource Allocation Communication-First Big Data Systems In-Datacenter Analytics • Cheaper SSDs and DRAM, proliferation of optical networks, and resource disaggregation will make network the primary bottleneck Inter-Datacenter Analytics • Bandwidth-constrained wide-area networks End User Delivery • Faster and responsive delivery of analytics results over the Internet for better end user experience Systems M MIND THE GAP E Networking Better capture application-level performance goals using coflows Coflows improve application-level performance and usability • Extends networking and scheduling literature Coordination – even if not free – is worth paying for in many cases [email protected] http://mosharaf.com Improve Flow Completion Times s1 1 1 r1 s2 2 2 r2 s3 3 Datacenter 3 Smallest-Flow First1,2 Per-Flow Fair Sharing Shuffle Completion Time = 5 Link to r1 Link to r2 2 time 4 6 Avg. Flow Completion Time = 3.66 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013. Link to r1 1 2 Link to r2 1 2 Shuffle Completion Time = 6 4 6 2 time 4 6 Avg. Flow Completion Time = 2.66 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Size (Bytes) Frac. of Coflows 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Length (Bytes) Frac. of Coflows Frac. of Coflows Frac. of Coflows Distributions of Coflow Characteristics 1 0.8 0.6 0.4 0.2 0 1.E+00 1.E+04 1.E+08 Coflow Width (Number of Flows) 1 0.8 0.6 0.4 0.2 0 1.E+06 1.E+08 1.E+10 1.E+12 1.E+14 Coflow Bottleneck Size (Bytes) Traffic Sources 1. Ingest and replicate new data 2. Read input from remote machines, when needed 3. Transfer intermediate data 4. Write and replicate output Percentage of Traffic by Category at Facebook 10 30 46 14 Distribution of Shuffle Durations Performance Fraction of Jobs Facebook jobs spend ~25% of runtime on average in intermediate comm. 1 0.8 Month-long trace from a 3000-machine MapReduce production cluster at Facebook 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Fraction of Runtime Spent in Shuffle 1 320,000 jobs 150 Million tasks Theoretical Results Structure of optimal schedules • Permutation schedules might not always lead to the optimal solution Approximation ratio of COSS-CR • Polynomial-time algorithm with constant approximation 64ratio ( 3 )1 The need for coordination • Fully decentralized schedulers can perform arbitrarily worse than the optimal 1. Due to Zhen Qiu, Cliff Stein, and Yuan Zhong from the Department of Industrial Engineering and Operations Research, Columbia University, 1. NO changes to user jobs 2. NO storage management The Coflow API @driver b register(BROADCAST, numFlows) s register(SHUFFLE, numFlows, {b • register reduce rs shuffle • put • get b.put(content, size) b.unregister() s.unregister() broadcas t • unregister id … mappers driver (JobTracker ) @mapper @reducer b.get(id) s.get(idsl) … … ids1 s.put(content, size) … Varys 1. Admission control 2. Allocation algorithm Employs a two-step algorithm to support coflow deadlines Do not admit any coflows that cannot be completed within deadline without violating existing deadlines Allocate minimum required resources to each coflow to finish them at their deadlines More Predictable EC2 Deployment 100 75 % of Coflows % of Coflows 100 Facebook Trace Simulation 50 25 0 Varys Varys EDF (Earliest-Deadline First) Met Deadline 50 25 0 1 Fair 75 Not Admitted 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012 Varys Fair Missed Deadline Optimizing Communication Performance: Systems Approach Spark-v1.1.1 # Comm. * Params 6 Hadoop-v1.2.1 10 YARN-v2.6.0 20 “Let users figure it out” *Lower bound. Does not include many parameters that can indirectly impact communication; e.g., number of reducers etc. Also excludes control-plane communication/RPC parameters. Experimental Methodology Varys deployment in EC2 • 100 m2.4xlarge machines • Each machine has 8 CPU cores, 68.4 GB memory, and 1 Gbps NIC • ~900 Mbps/machine during all-to-all communication Trace-driven simulation • Detailed replay of a day-long Facebook trace (circa October 2010) • 3000-machine,150-rack cluster with 10:1 oversubscription