Improving MapReduce Performance Using Smart Speculative

Download Report

Transcript Improving MapReduce Performance Using Smart Speculative

Improving MapReduce
Performance Using Smart
Speculative Execution
Strategy
Qi Chen, Cheng Liu, and Zhen
Xiao
Oct 2013
To appear in IEEE Transactions on
Computers
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Introduction

The new era of Big Data is coming!
– 20 PB per day

(2008)
– 30 TB per day

(2009)
– 60 TB per day

(2010)
–petabytes per day


What does big data mean?

Important user information

significant business value
MapReduce

What is MapReduce?

most popular parallel computing model proposed by Google
Select, Join, Group
Page rank,
Inverted index,
Log analysis
Clustering,
machine translation,
Recommendation
database
operation
Search
engine
Machine
learning
Applications
Scientific
computation
…
Cryptanalysi
s
Straggler

What is straggler in MapReduce?



Nodes on which tasks take an unusually long time to finish
It will:

Delay the job execution time

Degrade the cluster throughput
How to solve it?

Speculative execution
 Slow
task is backed up on an alternative machine with
the hope that the backup one can finish faster
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Architecture
Assig
n
Map
Master
Part
1
Part
2
Split 1
Split 2
…
Split M
Map
Map
Map Stage
Reduc
e
Part
1
Part
2
…
Input files
Assig
n
Part
1
Part
2
Output1
Output2
Reduc
e
Reduce
Stage
Output files
Programming model


Input : (key, value) pairs
Output : (key*, value*) pairs
Phase
Map
Combine
Map:
Stage
Reduce:
List(K1,V1)
→
Copy
Sort
List(K2,
List(V2))
→
List(K2,V2)
→
List(K2,
List(V2))
Reduce
Ordered (K2,
List(V2))
→
List(K3,V3)
Causes of Stragglers
Internal factors
External factors
 resource capacity of worker
nodes is heterogeneous
 resource competition due to
co-hosted applications
 resource competition due to
other MapReduce tasks
running on the same worker
node
 input data skew
 remote input or output
source is too slow
 hardware faulty
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Previous work


Google and Dryad

When a stage is close to completion

Backup an arbitrary set of the remaining tasks
Hadoop Original



Backup task whose progress falls behind the average by a fixed gap
LATE (OSDI’08)

Backup task: 1) longest remaining time, 2) progress rate below threshold

Identify worker with its performance score below threshold as slow
Mantri (OSDI’10)

Saving cluster computing resource

Backup up outliers when they show up

Kill-restart when cluster is busy, lazy duplicate when cluster is idle
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Pitfalls in Selecting Slow Tasks

Using average progress rate to identify slow tasks
and estimate task remaining time

Hadoop and LATE assumes that:


Tasks of the same type process almost the same amount
of input data

Progress rate must be either stable or accelerated during
a task’s lifetime
There are some scenarios that the assumptions will
break down
Input data skew

Sort benchmark on 10GB input
data following the Zipf
distribution ( =1.0)
Phase Percentage varies
Speed is varying
across different
phases
Different jobs
have different
phase duration
ratio
Job in different
environments has
different phase
duration ratio
Reduce Tasks Start Asynchronously
Tasks in different
phases can not be
compared directly
Take a Long Time to Identify
Straggler
Cannot identify
straggler in time
Pitfalls in Selecting Backup Node


Identifying Slow Worker Node

LATE: Sum of progress of all the completed and running tasks on the
node

Hadoop: Average progress rate of all the completed tasks on the node
Some worker nodes may do more time-consuming tasks and get
lower performance score unfairly


Choosing Backup Worker Node


e.g. doing more tasks with larger amount of data to process or non-local
map tasks
LATE and Hadoop: Ignore data locality
Our observation: a data-local map task can be over three times
faster than that of a non-local map task
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Selecting Backup Candidates

Using Per-Phase Process Speed

Dividing each task into multiple phases
Map Task:

Map
Combine
Reduce Task:
Copy
Sort
Using phase process speed to identify slow
tasks and estimate task remaining time
compare
Map
Task
Map
Combine
Map
Combine
Reduce
Selecting Backup Candidates

Using EWMA to Predict Process Speed
𝑍(𝑡) = 𝛼 ∗ 𝑌(𝑡) + (1 − 𝛼) ∗ 𝑍(𝑡 − 1), 0 < 𝛼 ≤ 1
Selecting Backup Candidates

Estimating Task Remaining Time and Backup Time
𝑟𝑒𝑚_𝑡𝑖𝑚𝑒 = 𝑟𝑒𝑚_𝑡𝑖𝑚𝑒𝑐𝑢𝑟_𝑝ℎ𝑎𝑠𝑒 + 𝑟𝑒𝑚_𝑡𝑖𝑚𝑒𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔_𝑝ℎ𝑎𝑠𝑒𝑠
𝑟𝑒𝑚_𝑑𝑎𝑡𝑎𝑐𝑢𝑟_𝑝ℎ𝑎𝑠𝑒
=
+
𝑒𝑠𝑡_𝑡𝑖𝑚𝑒𝑝 ∗ 𝑓𝑎𝑐𝑡𝑜𝑟𝑑
𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ𝑐𝑢𝑟_𝑝ℎ𝑎𝑠𝑒
𝑝∈𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔_𝑝ℎ𝑎𝑠𝑒𝑠
𝑏𝑎𝑐𝑘𝑢𝑝_𝑡𝑖𝑚𝑒 =
𝑒𝑠𝑡_𝑡𝑖𝑚𝑒𝑝 ∗ 𝑓𝑎𝑐𝑡𝑜𝑟𝑑
𝑝
𝑓𝑎𝑐𝑡𝑜𝑟𝑑 =
𝑑𝑎𝑡𝑎𝑖𝑛𝑝𝑢𝑡
𝑑𝑎𝑡𝑎𝑎𝑣𝑔

use the phase average process speed to estimate the
remaining time of a phase 𝑒𝑠𝑡_𝑡𝑖𝑚𝑒𝑝

To avoid process speed to be fast at the beginning and drop
later in copy phase, we estimate the remaining time of copy
phase as follows:
𝑓𝑖𝑛𝑖𝑠ℎ_𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑚𝑎𝑝 − 𝑓𝑖𝑛𝑖𝑠ℎ_𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑐𝑜𝑝𝑦
𝑟𝑒𝑚_𝑡𝑖𝑚𝑒𝑐𝑜𝑝𝑦 =
𝑝𝑟𝑜𝑐𝑒𝑠𝑠_𝑠𝑝𝑒𝑒𝑑𝑐𝑜𝑝𝑦
Selecting Backup Candidates

Maximizing Cost Performance

Cost: the computing resources occupied by tasks

Performance: the shortening of job execution time and the
increase of the cluster throughput
𝑝𝑟𝑜𝑓𝑖𝑡𝑏𝑎𝑐𝑘𝑢𝑝 = 𝛼 ∗ (𝑟𝑒𝑚_𝑡𝑖𝑚𝑒 − 𝑏𝑎𝑐𝑘𝑢𝑝_𝑡𝑖𝑚𝑒) − 𝛽 ∗ 2 ∗ 𝑏𝑎𝑐𝑘𝑢𝑝_𝑡𝑖𝑚𝑒
𝑝𝑟𝑜𝑓𝑖𝑡𝑛𝑜𝑡_𝑏𝑎𝑐𝑘𝑢𝑝 = 𝛼 ∗ 0 − 𝛽 ∗ 𝑟𝑒𝑚_𝑡𝑖𝑚𝑒
𝑝𝑟𝑜𝑓𝑖𝑡𝑏𝑎𝑐𝑘𝑢𝑝 > 𝑝𝑟𝑜𝑓𝑖𝑡𝑛𝑜𝑡_𝑏𝑎𝑐𝑘𝑢𝑝
𝛼
→𝛾
𝛽

𝑟𝑒𝑚_𝑡𝑖𝑚𝑒
𝛼 + 2𝛽
⇔
>
𝑏𝑎𝑐𝑘𝑢𝑝_𝑡𝑖𝑚𝑒
𝛼+𝛽
𝑟𝑒𝑚_𝑡𝑖𝑚𝑒
1 + 2𝛾
>
𝑏𝑎𝑐𝑘𝑢𝑝_𝑡𝑖𝑚𝑒
1+𝛾
We hope that:

when a cluster is idle, the cost for speculative execution is less a
concern

When the cluster is busy, the
cost is an important consideration
#𝑝𝑒𝑛𝑑𝑖𝑛𝑔_𝑡𝑎𝑠𝑘𝑠
𝛾 = 𝑙𝑜𝑎𝑑_𝑓𝑎𝑐𝑡𝑜𝑟 =
#𝑓𝑟𝑒𝑒_𝑠𝑙𝑜𝑡𝑠
Selecting Proper Backup Nodes

Assign backup tasks to the fast nodes

How to measure the performance of nodes?


use predicted process bandwidth of data-local map tasks
completed on the node to represent its performance
Consider the data-locality

Note: the process speed of data-local map tasks can be 3 times that of
non-local map tasks

Therefore, we keep the process speed statistics of data-local, racklocal, and non-local map tasks for each node

For nodes that do not process any map task on a specific locality level,
we use the average process speed of all nodes on this level as an
estimate

Launch backup on node i ? If remain time > backup time on node i
Summary

A task will be backed up when it meets the
following conditions:

it has executed for a certain amount of time (i.e., the
speculative lag)

both the progress rate and the process bandwidth in the
current phase of the task are sufficiently low

the profit of doing the backup outweighs that of not doing it

its estimated remaining time is longer than the predicted time to
finish on a backup node

it has the longest remaining time among all the tasks satisfying
the conditions above
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Experiment Environment



Two scale:

Small: 30 virtual machines on 15 physical machines

Large: 100 virtual machines on 30 physical machines
Each physical machine:

dual-Processors (2.4GHz Intel(R) Xeon(R) E5620
processor with 16 logic core), 24GB of RAM and two
150GB disks

Organized in three racks connected by 1Gbps
Ethernet
Each virtual machine:


2 virtual core, 4GB RAM and 40GB of disk space
Benchmark:

Sort, Wordcount, Grep, Girdmix
Scheduling in Heterogeneous
Environments

Load of each host in heterogeneous
environments
Load
Hosts
VMs
1VMs/host
3
3
2VMs/host
11
22
5VMs/host
1
5
Total
15
30
Scheduling in Heterogeneous
Environments

Working With Different Workloads
Workloads
Job completion
time
Improvement
Cluster throughput
Improvement
WordCount
10%
5%
Sort
19%
15%
Grep
39%
38%
Gridmix
13%
15%
Scheduling in Heterogeneous
Environments

Analysis (using Word Count and Grep)
Precision
Strategy
HadoopLATE
Average find
time
Recall
Map
Reduc
e
Map
Reduce
Map
Reduce
37.6%
3%
100%
100%
70s
66s
93.3%
87.1%
100%
56s
32s
Hadoop- 45.2%
MCP
Improvemen
t
+Accurate
prediction
+Cost
performance
All
Execution
Speed
27%
31%
39%
Cluster
Throughput
29%
32%
38%
Scheduling in Heterogeneous
Environments
Execution

Handling Data Skew (Sort)
Speed
+17%
Execution
Speed
+37%
Cluster
Throughput
+44%
Cluster
Throughput
+19%
Scheduling in Heterogeneous
Environments

Competing with other applications

Run some I/O intensive processes on some servers


dd process which creates large files in a loop to write
random data on some physical machines
MCP can run 36% faster than Hadoop-LATE
and increase the cluster throughput by 34%.
Large scale Experiment


Load distribution
Load
Hosts
VMs
3VMs/host
27
81
5VMs/host
4
20
Total
31
101
MCP finishes jobs 21% faster than
Hadoop-LATE and improves the cluster
throughput by 16%
Scheduling in Homogeneous
Environments

Small scale cluster with each host running 2 VMs

There is no straggler node in the cluster

MCP finishes jobs 6% faster than Hadoop-LATE
and 2% faster than Hadoop-None.

Hadoop-LATE behaves worse than Hadoop-None
due to too many unnecessary reduce backups

MCP improves reduce backup precision by 40%

MCP can achieve better data locality for map tasks
Scheduling Cost

We measure the average time that MCP and
Hadoop-LATE spend on speculative scheduling
in a job with 350 map tasks and 110 reduce tasks

MCP spends about 0.54ms

LATE spends 0.74ms
O(n)
O(nlogn)
Outlines
0
1. Introduction
2. Background
3. Previous work
4. Pitfalls
5. Our Design
6. Evaluation
7. Conclusion
Conclusion

We provide an analysis of the pitfalls of current speculative
execution strategies in MapReduce



Scenarios: data skew, tasks that start asynchronously, improper
configuration of phase percentage etc.
We develop a new strategy MCP to handle these scenarios:

Accurate slow task prediction and remaining time estimation

Take the cost performance of computing resources into account

Take both data locality and data skew into consideration when
choosing proper worker nodes
MCP fits well in both heterogeneous and homogeneous
environments

handle data skew case well, quite scalable, and less overhead
Thank You!