a4academics.com

Download Report

Transcript a4academics.com

Extending Map-Reduce for
Efficient Predicate-Based
Sampling
Raman Grover
Michael J. Carey
Department of Computer Science,
University of California, Irvine
Work started as facebooksummer 2010
Contents
•
Problem statement
•
Map Reduce overview
•
Inefficiencies in native Implementation
•
An incremental processing mechanism
•
Experimental evaluation
•
Conclusion
•
References
Problem Statement
•
•
Collect and process as much as you can!
Collected data may accumulate to tera/peta bytes!
Sampling
Sampled data required to additionally satisfy a
given set of predicates– “Predicate Based Sampling”
SELECT age, profession, salary
FROM CENSUS
WHERE AGE > 25 and AGE <=30
AND GENDER = `FEMALE`
AND STATE = `CALIFORNIA`
LIMIT 1000
Tera/Peta
bytes
We needed…
(a)
Pseudo Random Sample
(b)
Response time to not be a
function of the size of the
input.
Similar
Sampling
queries
common at
Facebook
What were the challenges?
•
Absence of Indexes
•
Wide variety of predicates
•
Size of the data
Paper Contributions!
•
Mechanism for Incremental Processing
•
Policies for Incremental Processing
•
Implementation over Hadoop/Hive
•
Efficient Predicate-Based Sampling
Map Reduce overview
•
Map-Reduce treats data as a list of (key, value) pairs
•
expresses a computation in terms of two functions:
•
map and reduce
map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2)) -> list(k3, v4)
•
The map function, defined by the user, takes as
input a key value pair outputs a list of intermediate
key-value pairs
Predicate-Based Sampling:
A Map-Reduce based
Solution
K= required
sample size
N=number of map tasks.
Select first ‘K’
Reduce
Collect first ‘K‘
pairs that satisfy
the predicate(s)
Evaluate
predicate
on (key,
value)
pairs in
input
Map 2
Map 1
H
With N mappers,
output contains at
most N*K (key,
value) pairs
Map 3
Map
N
F
D
Input Data
S
We are
processing
all of input
data!
Inefficiencies in native
Implementation
•
•
•
delay in the completion of the Map phase
all input splits must be processed for the job to
achieve its goal
in the case of ‘predicate-based’ sampling, its
benefits can be lost if significant time and resources
are wasted in doing futile work that was not truly
required to compute the desired result
What Happens at runtime?
Map Outputs
I
N
P
U
T
S
P
L
I
T
S
Reduce
O
u
t
p
u
t
Desired Sample
‘K’ records, each of
which satisfy some
given predicate(s)
Did we really need to process the whole of input to
produced the desired sample?
Input data could be in range of tera/peta bytes
Hadoop with Incremental Processing
Do we need to
process more
input?
Add Input
Map
Outputs
Map Task
Report
Input Provider
(configurable)
No
we are good!
Reduce
The Job produces the desired output, but does less
work and finishes earlier.
o
u
t
p
u
t
Hadoop with Incremental Processing
Input splits
Selected input
splits
Map task
reports
Map
Map
INPUT PROVIDER
Add more
input
“Input
Available”
DFS
Yes
Is more
input
needed
?
No further
addition of
input!
No
Not enough information “wait and
See”.
“End of
Input”
We Have a Mechanism; We Need a Policy!
How do we control intake In controlling intake of input, decisions need
to take into account
•
the capacity of the cluster
•
the current (or expected) load on the cluster
•
the priority of the job (acceptable response time and resource
consumption)
Can be played with to
form different policies
Defining a Policy
•
Grab Limit
•
Work Threshold
•
Evaluation Interval
Forming Policies
TS=total map slots in the cluster
AS=available map slots in the cluster
IS=number of input splits
Policies
Decreasing
Degree of
Aggressive
ness
Hadoop : Hadoop Default
HA : Highly Aggressive
MA : Mid Aggressive
LA : Less Aggressive
C : Conservative
15%IS
Grab Limit
Work
Threshold
Infinity
N/A
max(0.5*TS, AS )
0
AS !=0 ? 0.5 * AS : 0.2 * TS
5%IS
AS !=0 ? 0.2 * AS : 0.1 * TS 10%IS
0.1 * AS
Evaluation Interval: kept at 4 seconds under each policy
EXPERIMENTAL
EVALUVATION
Does This Work?
Queries & Test data
LINEITEM table from TPCH dataset
SELECT <attribute> FROM
LINEITEM WHERE <predicate>
LIMIT 10000
Hardware
10 node cluster , each
node has 12gb RAM, 4
cores/disks
Modeling Data Skew (Zipf Distribution)
Single User Workload
SELECT ORDERKEY, PARTKEY,
SUPPKEY
FROM LINEITEM
WHERE predicate LIMIT 10000
Idle Hadoop
Cluster
Single User Workload
Uniform Distribution (Z=0)

Varying dataset size
Single User Workload
High Skew (Z=2)

Varying dataset size
Single User Workload
Moderate Skew: Partitions Processed
Homogenous Workload
SELECT <attribute(s)>
FROM LINEITEM
WHERE predicate
LIMIT 10000
•
10 concurrent users, each submit query, wait for
completion and submit again
Single User: Not a Realistic Setting!
Sampling
SELECT <attribute(s)>
FROM LINEITEM
WHERE predicate
LIMIT 10000
•
Other
(possibly
large
jobs/querie
s)
•
•
Hadoop Cluster
Scheduling
Policy
Resource
Usage
How do we
maximize
throughput
Homogenous Workload
Heterogeneous Workload
SELECT<attribute(s)>
Class A
FROM LINEITEM
Sampling
WHERE predicate
LIMIT 10000
Class B
NonSampling
SELECT <attribute(s)>
FROM LINEITEM
WHERE predicate
ORDER BY <attribute>
•
Vary the % of users in class A: sampling
•
Vary the policy used by users in class A.
•
Measure the impact of the policy in use on the throughput for other
class of users.
Heterogeneous Workload
Default
Scheduler
Heterogeneous Workload
(Fair Scheduler)
Fair Scheduler
Scheduler Impact
•
•
•
So far we have seen experimental results using the
default Hadoop scheduler.
Alternate Scheduler– Fair Scheduler
Scheduler decides allocation of resources amongst
running jobs. Incremental processing allows a job
to ask for less resources.
Analysis
Analysis (continued)
CPU Utilization: Fair Scheduler (10 pools)
CPU Utilization
%
100
90
80
70
60
50
40
30
20
10
0
Time
CPU Utilzation: Default Scheduler
12
10
8
CPU Utilization
%
6
4
2
0
Time
So What Do We Conclude?
•
•
•
It is trivial to express Predicate-based sampling as a
single Map-Reduce job.
Hadoop’s default approach to job execution is not
well-suited to dynamic jobs such as our predicatebased sampling problem.
We modified the execution model to support
incremental processing. Framework allows user to
plug in the logic for Input Provider and the
execution policy.
Conclusions (continued)
•
•
Experiments on idle cluster with no concurrent jobs
showed that a conservative policy (such as C) does
not utilize the cluster to its capacity. In contrast in a
multi user setting, a conservative policy was seen to
consume less resources and give higher throughput.
LA policy emerged as a good overall policy in both
homogenous and heterogeneous workload setting.
My Contribution!
•
Input provider extension
•
Dynamic selection of polices
Reference
•
A. Thusoo, Z. Shao et al. “Data warehousing and
analytics infrastructure at Facebook”
•
Hadoop. http://hadoop.apache.org
•
Hive Website. http://hadoop.apache.org/hiv
•
S. Babu,” Towards automatic optimization of
mapreduce programs”