Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†,Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft.

Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula, Albert Greenberg, Ion Stoica†,Yi Lu, Bikas Saha, Ed Harris* † UC Berkeley * Microsoft.

Transcript Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula, Albert Greenberg, Ion Stoica†,Yi Lu, Bikas Saha, Ed Harris* † UC Berkeley * Microsoft.

Reining in the Outliers in MapReduce Jobs using

Mantri

Ganesh Ananthanarayanan † , Srikanth Kandula*, Albert Greenberg*, Ion Stoica † , Yi Lu*, Bikas Saha*, Ed Harris*

† UC Berkeley * Microsoft

MapReduce Jobs

 Basis of analytics in modern Internet services ◦ E.g., Dryad, Hadoop  Job  {Phase}  {Task}  Graph flow consists of pipelines as well as strict blocks 2

Map.1

Example Dryad Job Graph

Distr. File System EXTRACT AGGREGATE_PAR TITION Distr. File System EXTRACT AGGREGATE_PAR TITION Reduce.1

FULL_AGGREGATE FULL_AGGREGATE Map.2

Reduce.2

PROCESS COMBINE Join Phase Pipeline PROCESS Distr. File System 3

Log Analysis from Production

 Logs from production cluster with thousands of machines, sampled over six months  10,000+ jobs, 80PB of data, 4PB network transfers ◦ Task-level details ◦ Production and experimental jobs 4

Outliers hurt!



Tasks that run longer than the rest in the phase

  Median phase has 10% outliers, running for >10x longer Slow down jobs by 35% at median  Operational Inefficiency ◦ Unpredictability in completion times affect SLAs ◦ Hurts development productivity ◦ Wastes compute-cycles 5

Why do outliers occur?

Read Input Execute Input Unavailable Network Congestio n Local Contention Workload Imbalance Mantri: A system that mitigates outliers based on root-cause analysis

Mantri’s Outlier Mitigation  Avoid Recomputation  Network-aware Task Placement  Duplicate Outliers  Cognizant of Workload Imbalance 7

Recomputes: Illustration

(a) Barrier phases (b) Cascading Recomputes Actual Actual Inflation Ideal Inflation Ideal Normal task Recompute task

What causes recomputes? [1]

 Faulty machines ◦ Bad disks, non-persistent hardware quirks (4%)

Set of faulty machines varies with time, not constant

What causes recomputes? [2]

 Transient machine load ◦ Recomputes correlate with machine load ◦ Requests for data access dropped 10

Replicate

costly outputs

MR: Recompute Probability of a machine Task1 Task 2 MR 2 Replicate (T Rep ) Task 3 MR 3 T Recomp = ((MR (MR 3 3 *(1-MR + * MR 2 2 )) * T ) (T 3 +T 2 3 ) T Rep < T Recomp REPLICATE

Transient Failure Causes

  Recomputes manifest in clutches Machine prone to cause recomputes till the problem is fixed ◦ Load abates, critical process restart etc.

 Clue: At least r recomputes within t time window on a machine 12

Task

Speculative Recomputes

 Anticipatorily recompute tasks whose outputs are unread Input Data (Read Fail) Speculative Recompute Speculative Recompute Unread Data 13

Mantri’s Outlier Mitigation  Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.

 Network-aware Task Placement  Duplicate Outliers  Cognizant of Workload Imbalance 14

Reduce Tasks

  Tasks access output of tasks from previous phases Reduce phase (74% of total traffic) Distr. File System Local Map

Network

Reduce Outlier!

Variable Congestion

Reduce task Map output Rack

Smart placement smoothens hotspots

Traffic-based Allotment

Goal: Minimize phase completion time For every rack: ◦ d : data ◦ u : available uplink bandwidth ◦ v : available downlink bandwidth Solve for task allocation fractions, a

Local Control is a good approx.

Goal: Minimize phase completion time For every rack: ◦ d : data, D: data over all racks Link utilizations average out in long term, are steady on the short term   Let rack i have a

◦ Time uploading, T

◦ fraction of tasks = d

Time downloading, T

(1 - a

) / u

= (D – d

) a

/ v

i Time i = max {T u , T d }

Mantri’s Outlier Mitigation  Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.

 Network-aware Task Placement ◦ Traffic on link proportional to bandwidth  Duplicate Outliers  Cognizant of Workload Imbalance 19

Contentions cause outliers

 Tasks contend for local resources ◦ Processor, memory etc.

 Duplicate tasks elsewhere in the cluster ◦ Current schemes duplicate towards end of the phase (e.g., LATE [OSDI 2008])  Duplicate outlier or schedule pending task?

Resource-Aware Restart

Running task Potential restart (

t new

) now

t rem

time Save time and resources: P(c t

new

< (c + 1) t

rem

) Continuously observe and kill wasteful copies 21

Mantri’s Outlier Mitigation  Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.

 Network-aware Task Placement ◦ Traffic on link proportional to bandwidth  Duplicate Outliers ◦ Resource-Aware Restart  Cognizant of Workload Imbalance 22

Workload Imbalance

  A quarter of the outlier tasks have more data to process ◦ Unequal key partitions for reduce tasks Ignoring these better than duplication  Schedule tasks in descending order of data to process ◦ Time α ( Data to Process) ◦ [Graham ‘69] At worse, 33% of optimal 23

Mantri’s Outlier Mitigation  Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.

Proactive

  Network-aware Task Placement ◦  Predict to act early

Reactive

Traffic on link proportional to bandwidth  Be resource-aware Duplicate Outliers ◦ Resource-Aware Restart  Cognizant of Workload Imbalance ◦ Schedule in descending order of size 24

Results

 Deployed in production Bing clusters  Trace-driven simulations ◦ Mimic workflow, failures, data skew ◦ Compare with existing and idealized schemes 25

Jobs in the Wild

Jobs faster by 32% at median, consuming lesser resources

   Act Early: Duplicates issued when task 42% done (77% for Dryad) Light: Issues fewer copies (.47X as many as Dryad) Accurate: 2.8x higher success rate of copies 26

Recomputation Avoidance

Eliminates most recomputes with minimal extra resources (Replication + Speculation) work well in tandem 27

Network-Aware Placement

Bandwidth approximations Mantri well approximates the ideal 28

Summary

 From measurements in a production cluster, ◦ Outliers are a significant problem ◦ Are due to an interplay between storage, network and map-reduce   Mantri, a cause-, resource-aware mitigation Deployment shows encouraging results  “Reining in the Outliers in MapReduce Clusters using Mantri”, USENIX OSDI 2010 29

Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†,Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft.

Transcript Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†,Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft.