Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†,Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft.
Download ReportTranscript Reining in the Outliers in MapReduce Jobs using Mantri Ganesh Ananthanarayanan†, Srikanth Kandula*, Albert Greenberg*, Ion Stoica†,Yi Lu*, Bikas Saha*, Ed Harris* † UC Berkeley * Microsoft.
Reining in the Outliers in MapReduce Jobs using
Mantri
Ganesh Ananthanarayanan † , Srikanth Kandula*, Albert Greenberg*, Ion Stoica † , Yi Lu*, Bikas Saha*, Ed Harris*
† UC Berkeley * Microsoft
1
MapReduce Jobs
Basis of analytics in modern Internet services ◦ E.g., Dryad, Hadoop Job {Phase} {Task} Graph flow consists of pipelines as well as strict blocks 2
Map.1
Example Dryad Job Graph
Distr. File System EXTRACT AGGREGATE_PAR TITION Distr. File System EXTRACT AGGREGATE_PAR TITION Reduce.1
FULL_AGGREGATE FULL_AGGREGATE Map.2
Reduce.2
PROCESS COMBINE Join Phase Pipeline PROCESS Distr. File System 3
Log Analysis from Production
Logs from production cluster with thousands of machines, sampled over six months 10,000+ jobs, 80PB of data, 4PB network transfers ◦ Task-level details ◦ Production and experimental jobs 4
Outliers hurt!
Tasks that run longer than the rest in the phase
Median phase has 10% outliers, running for >10x longer Slow down jobs by 35% at median Operational Inefficiency ◦ Unpredictability in completion times affect SLAs ◦ Hurts development productivity ◦ Wastes compute-cycles 5
Why do outliers occur?
Read Input Execute Input Unavailable Network Congestio n Local Contention Workload Imbalance Mantri: A system that mitigates outliers based on root-cause analysis
6
Mantri’s Outlier Mitigation Avoid Recomputation Network-aware Task Placement Duplicate Outliers Cognizant of Workload Imbalance 7
Recomputes: Illustration
(a) Barrier phases (b) Cascading Recomputes Actual Actual Inflation Ideal Inflation Ideal Normal task Recompute task
8
What causes recomputes? [1]
Faulty machines ◦ Bad disks, non-persistent hardware quirks (4%)
Set of faulty machines varies with time, not constant
9
What causes recomputes? [2]
Transient machine load ◦ Recomputes correlate with machine load ◦ Requests for data access dropped 10
Replicate
costly outputs
MR: Recompute Probability of a machine Task1 Task 2 MR 2 Replicate (T Rep ) Task 3 MR 3 T Recomp = ((MR (MR 3 3 *(1-MR + * MR 2 2 )) * T ) (T 3 +T 2 3 ) T Rep < T Recomp REPLICATE
11
Transient Failure Causes
Recomputes manifest in clutches Machine prone to cause recomputes till the problem is fixed ◦ Load abates, critical process restart etc.
Clue: At least r recomputes within t time window on a machine 12
Task
Speculative Recomputes
Anticipatorily recompute tasks whose outputs are unread Input Data (Read Fail) Speculative Recompute Speculative Recompute Unread Data 13
Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.
Network-aware Task Placement Duplicate Outliers Cognizant of Workload Imbalance 14
Reduce Tasks
Tasks access output of tasks from previous phases Reduce phase (74% of total traffic) Distr. File System Local Map
Network
Reduce Outlier!
15
Variable Congestion
Reduce task Map output Rack
Smart placement smoothens hotspots
16
Traffic-based Allotment
Goal: Minimize phase completion time For every rack: ◦ d : data ◦ u : available uplink bandwidth ◦ v : available downlink bandwidth Solve for task allocation fractions, a
i
17
Local Control is a good approx.
Goal: Minimize phase completion time For every rack: ◦ d : data, D: data over all racks Link utilizations average out in long term, are steady on the short term Let rack i have a
i
◦ Time uploading, T
u
◦ fraction of tasks = d
i
Time downloading, T
d
(1 - a
i
) / u
i
= (D – d
i
) a
i
/ v
i Time i = max {T u , T d }
18
Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.
Network-aware Task Placement ◦ Traffic on link proportional to bandwidth Duplicate Outliers Cognizant of Workload Imbalance 19
Contentions cause outliers
Tasks contend for local resources ◦ Processor, memory etc.
Duplicate tasks elsewhere in the cluster ◦ Current schemes duplicate towards end of the phase (e.g., LATE [OSDI 2008]) Duplicate outlier or schedule pending task?
20
Resource-Aware Restart
Running task Potential restart (
t new
) now
t rem
time Save time and resources: P(c t
new
< (c + 1) t
rem
) Continuously observe and kill wasteful copies 21
Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.
Network-aware Task Placement ◦ Traffic on link proportional to bandwidth Duplicate Outliers ◦ Resource-Aware Restart Cognizant of Workload Imbalance 22
Workload Imbalance
A quarter of the outlier tasks have more data to process ◦ Unequal key partitions for reduce tasks Ignoring these better than duplication Schedule tasks in descending order of data to process ◦ Time α ( Data to Process) ◦ [Graham ‘69] At worse, 33% of optimal 23
Mantri’s Outlier Mitigation Avoid Recomputation ◦ Preferential Replication + Speculative Recomp.
Proactive
Network-aware Task Placement ◦ Predict to act early
Reactive
Traffic on link proportional to bandwidth Be resource-aware Duplicate Outliers ◦ Resource-Aware Restart Cognizant of Workload Imbalance ◦ Schedule in descending order of size 24
Results
Deployed in production Bing clusters Trace-driven simulations ◦ Mimic workflow, failures, data skew ◦ Compare with existing and idealized schemes 25
Jobs in the Wild
Jobs faster by 32% at median, consuming lesser resources
Act Early: Duplicates issued when task 42% done (77% for Dryad) Light: Issues fewer copies (.47X as many as Dryad) Accurate: 2.8x higher success rate of copies 26
Recomputation Avoidance
Eliminates most recomputes with minimal extra resources (Replication + Speculation) work well in tandem 27
Network-Aware Placement
Bandwidth approximations Mantri well approximates the ideal 28
Summary
From measurements in a production cluster, ◦ Outliers are a significant problem ◦ Are due to an interplay between storage, network and map-reduce Mantri, a cause-, resource-aware mitigation Deployment shows encouraging results “Reining in the Outliers in MapReduce Clusters using Mantri”, USENIX OSDI 2010 29