Efficient Failure Resilience for Big

Transcript Efficient Failure Resilience for Big

Understanding the Effects and Implications of Compute Node Failures in Florin Dinu T. S. Eugene Ng

Computing in the Big Data Era

• Big Data – Challenging for previous systems • Big Data Frameworks – MapReduce @ Google – Dryad @Microsoft – Hadoop @ Yahoo & Facebook 2

Is Widely Used and many more …..

Protein Sequencing Image Web Processing Machine Learning Advertising Analytics Log Storage and Analysis 3

Building Around SIGMOD 2010

Building On Top Of Building on core Hadoop functionality

The Danger of Compute-Node Failures

“ In each cluster’s first year, it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur” Jeff Dean – Google I/O 2008 “ Average worker deaths per job: 5.0 ” Jeff Dean – Keynote I – PACT 2006 • •

Causes: large scale use of commodity components

The Danger of Compute-Node Failures

Amazon, SOSP 2009

In the cloud compute node failures are the norm NOT the exception

Failures From Hadoop’s Point of View

Situations indistinguishable from compute node failures: • Switch failures • Longer-term dis-connectivity • Unplanned reboots • Maintenance work (upgrades) • Quota limits • Challenging environments • Spot markets (price driven availability) • Volunteering systems • Virtualized environments

Important to understand effect of compute-node failures on Hadoop

The Problem

• • Hadoop is widely used Compute node failures are common

Hadoop needs to be failure resilient resilient in an efficient way

• • Minimize impact on job running times Minimize resources needed 9

Contribution

• First in-depth analysis of the impact of failures on Hadoop – Uncover several inefficiencies • Potential for future work – Immediate practical relevance – Basis for realistic modeling of Hadoop 10

Quick Hadoop Background

Background – the Tasks Give me work !

Master M R TaskTracker MGR M R JobTracker NameNode 2 waves of R 2 waves of M M

Map task

More work ?

R M DataNode

Reducer task

M R Background – Data Flow

HDFS

M R M R

Map Tasks Shuffle Reducer Tasks HDFS 13

Background – Speculative Execution M M M

Ideal case: Similar progress rates 0 <= Progress Score <= 1 Progress Rate = (Progress Score/time) Ex: 0.05/sec 14

Background – Speculative Execution (SE) M M

Goal of SE: Detect underperforming nodes Duplicate the computation Reality: Varying progress rates Reasons for underperforming tasks Node overload, network congestion, etc.

Underperforming tasks (outliers) in Hadoop: > 1 STD slower than mean progress rate 15

How does Hadoop detect failures?

Failures of the Distributed Processes TaskTracker Master M R MGR

Heartbeats

DataNode Timeouts, Heartbeats & Periodic Checks

Timeouts, Heartbeats & Periodic Checks Time AHA !

It failed

Declare failure after a number of checks Periodically check for changes Failure interrupts heartbeat stream

Conservative approach – last line of defense

Failures of the Individual Tasks (Maps) M M M 1 2 3 R R MGR Give me data!

Δt Δt R R Infer map failures from notifications Conservative – not react to temporary failures R

Failures of the Individual Tasks (Reducers) M M R R

• R complains too much?

(failed/ succ. attempts) • R stalled for too long?

(no new succ. attempts)

M Give me data!

Give me data!

M MGR M does not answer !!

R Notifications also help infer reducer failures

Do these mechanisms work well?

•

Methodology

Focus on failures of distributed components (TaskTracker and DataNode) • Inject these failures separately

TaskTracker M R

• – Single failures Enough to catch many shortcomings – Identified mechanisms responsible – Relevant to multiple failures too

DataNode

Mechanisms Under Task Tracker Failure?

• OpenCirrus 220s running time without failures Findings also relevant to larger jobs • Sort 10GB • 15 nodes • 14 reducers • Inject fail at random time

LARGE, VARIABLE, UNPREDICTABLE job running times Poor performance under failure

Clustering Results Based on Cause

Few reducers impacted.

Notification mechanism ineffective Timeouts fire.

Failure has no impact

Not

due to notifications

70% cases – notification mechanism ineffective

Clustering Results Based on Cause

More reducers impacted Notification mechanism detects failure Timeouts do not fire.

• •

Notification mechanism detects failure in: Few cases Specific moment in the job

Side Effects: Induced Reducer Death Failures propagate to healthy tasks

Negative Effects: • Time and resource waste for re-execution • Job failure - a small number of runs fail completely • R complains too much?

(failed/ total attempts) e.g. 3 out of 3 failed

Give me data!

M MGR R Unlucky reducers die early

Side Effects: Induced Reducer Death

• R stalled for too long?

(no new succ. attempts)

M Give me data!

MGR M does not answer !!

R All reducers may eventually die

Fundamental problem: • Inferring task failures from connection failures • Connection failures have many possible causes • Hadoop has no way to distinguish the cause (src? dst?) 27

More Reducers: 4/Node = 56 Total Job running time spread out even more More reducers = more chances for explained effects

Effect of DataNode Failures TaskTracker M R DataNode

Timeouts When Writing Data M R Write Timeout (WTO)

Timeouts When Writing Data M R Connect Timeout (CTO)

Effect on Speculative Execution

Outliers in Hadoop: >1 STD slower than mean progress rate

High PR

Very high PR

AVG AVG Low PR AVG – 1*STD

Outliers

AVG – 1*STD

Delayed Speculative Execution

Avg(PR)- STD(PR)

9 11 M M

50s Waiting for mappers

9 11 M M

100s Map outputs read

9 11

150s Reducer write output 33

Delayed Speculative Execution

Very low Finally low enough

9 !

11 9 11 9 11 11 WTO WTO WTO WTO WTO M M

200s Failure occurs Reducers timeout R9 speculatively exec > 200s New R9 skews stats 400s R11 finally speculatively exec.

Delayed Speculative Execution

• Hadoop’s assumptions about progress rates invalidated Very low • Stats skewed by very fast speculated task • Significant impact on job running time

52 reducers – 1 Wave

Reducers stuck in WTO Delayed speculative execution CTO after WTO Reconnect to failed DataNode 36

•

Delayed SE – A General Problem

Failures and timeouts are

not

the only cause • To suffer from delayed SE : •

Slow tasks

that benefit from SE • I showed the ones stuck in a WTO • Other: slow or heterogeneous nodes, slow transfers (heterogeneous networks) •

Fast advancing tasks

• I showed varying data input availability • Other: varying task input size varying network speed

Statistical SE algorithms need to be carefully used

Conclusion - Inefficiencies Under Failures

• Task Tracker failures – Large, variable and unpredictable job running times – Variable efficiency depending on reducer number – Failures propagate to healthy tasks – Success of TCP connections not enough • Data Node failures – Delayed speculative execution – No sharing of potential failure information (details in paper) 38

Ways Forward

• Provide dynamic info about infrastructure to applications (at least in the private DCs) • Make speculative execution cause aware – Why is a task slow at runtime?

– – Move beyond statistical SE algorithms Estimate PR of tasks (use envir, data characteristics) • Share some information between tasks – – In Hadoop tasks rediscover failures individually Lots of work on SE decisions (when, where to SE) – This decisions can be invalidate by such runtime inefficiencies 39

Thank you

Backup slides

Experiment: Results Group G2 Group G1 Group G4 Group G3 Group G5 Group G6 Group G7 Large

variability

in job running times 42

Group G1 – few reducers impacted M1 copied by all reducers before failure.

After failure R1_1 cannot access M1.

R1_1 needs to send 3 notifications ~ 1250s Task Tracker declared dead after 600-800s

M2 M3 R1 R2 R3 M1 X R1_1 Slow

recovery when few reducers impacted

Job Tracker

Notif (M1) 43

Group G2 – timing of failure 200s difference between G1 and G2.

Job end 200s 170s

Time 600s

170s 600s Timing of failure relative to Job Tracker checks impacts job running time Time 44

Group G3 – early notifications • G1 notifications sent after 416s • G3 early notifications => map outputs declared lost • • Causes: Code-level race conditions

Timing of a reducer’s shuffle attempts M5 M6 M5 M6 R2

M5-1 M6-1 M5-2 M6-2 M5-3 M6-3 M5-4 M6-4

X R2

0 1 2 3 4 5 6 M6-1 M5-1 M6-2 M5-2 M6-3 M5-3 M6-4 M5-4 M6-5

0 1 2 3 4 5 6 Early notifications increase job running time variability 45

Group G4 & G5 – many reducers impacted G4 Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared dead G5 Same as G4 but early notifications are sent

M2 M3 R1 X R2 R3 Job Tracker

Notif (M1,M2,M3, M4,M5)

M1 R1_1

Job running time under failure varies with nr of reducers impacted 46

Task Tracker Failures

Gew reducers impacted.

Not enough notifications.

Timeouts fire.

Many reducers impacted.

Enough notifications sent Timeouts do not fire

LARGE, VARIABLE, UNPREDICTABLE job running times Efficiency varies with number of affected reducers

Node Failures: No RST Packets No RST -> No Notifications -> Timeouts always fire

Not Sharing Failure Information

Different SE algorithm (OSDI 08) Tasks SE even before failure.

Delayed SE not the cause.

Both initial and SE task connect to failed node No sharing of potential failure information

Delayed Speculative Execution M M

t Outlier: avg(PR(all)) – std(PR(all)) > PR(t)

limit R9 R11 9 WTO WTO 11 9 Stats skewed by very fast speculative tasks.

Hadoop’s assumptions about prog. rates invalidated

Delayed Speculative Execution M M

• Timeline: ~50s reducers wait for map outputs • ~100s reducers get map outputs

• ~200s failure => reducers timeout • ~200s R9 speculatively executed huge progress rate statistics skewed

WTO WTO

• ~400s R11 finally speculatively executed

11 9 Stats skewed by very fast speculative tasks.

Hadoop’s assumptions about prog. rates invalidated