Performance Modeling and Optimization for Pig program

Download Report

Transcript Performance Modeling and Optimization for Pig program

Meeting Service Level Objectives
of Pig Programs
Zhuoyao Zhang, Ludmila Cherkasova,
Abhishek Verma, Boon Thau Loo
University of Pennsylvania
Hewlett-Packard Labs
Cloud Environment
• Advantages
▫ Large amount of resources
▫ Elasticity
▫ Pay-as-you-go pricing model
• Challenges
▫ Distributed resources
▫ Error-prone
MapReduce and Pig
• MapReduce: Simple and fault tolerant
framework for data processing in the cloud
• Pig
▫ Advanced MapReduce based platform
▫ Widely used: Yahoo!, Twitter, LinkedIn
▫ PigLatin: A high-level declaratice language for
expressing data analysis tasks as Pig programs
j2
j4
j6
j1
j3
j5
j7
Motivation
• Latency-sensitive applications
▫ Personalized advertising
▫ Spam and fraud detection
▫ Real-time log analysis
• How much resource does an application need
to meet their deadlines?
Contributions
• Performance modeling for Pig programs
▫ Given a Pig grogram, estimates its completion
time as a function of assigned resource
• Deadline driven resource allocation
estimates for Pig programs
▫ Given a completion time target, determine the
amount of resources for a Pig program to
achieve it
Outline
• Introduction
• Building block
▫ Performance model for single MapReduce jobs
• Resource allocation for Pig programs
• Evaluation
• Conclusion and ongoing work
Theoretical Makespan Bounds
• Bounds- based makespan estimates
▫ n tasks, k servers
▫ avg: average duration of the n tasks
▫ max: maximum duration of the n tasks
• Lower bound
Tlow
• Upper bound
n
 avg 
k
(n  1)
Tup  avg 
 max
k
Illustration
Schedule 1: 1 4 3 2 3 1 2
1
2
3
Makespan = 4
Lower bound = 4
4
Schedule 2: 3 1 2 3 2 1 4
1
2
3
4
Makespan = 7
Upper bound = 8
Estimate Completion Time for Single MR Job
• Estimate the bounds of the job completion
time based on job profile
▫ Most production jobs are executed routinely on
new data sets
▫ Job profile based on previous running
 Map stage: Mavg, Mmax, AvgInputSize, Selectivity
 Reduce stage: Shavg, Shmax, Ravg, Rmax, Selectivity
▫ Predict the completion time for future running with
the profile
Estimate Completion Time for Single MR Job
• Estimating bounds on the duration of map and
reduce stages
• Map stage duration depends on:
▫ NM -- the number of map tasks
▫ SM -- the number of map slots
NM
SM
( N 1)
TM up  M avg  M
 M max
SM
TM low  M avg 
• Reduce stage duration depends on:
▫ NR -- the number of reduce tasks
▫ SR -- the number of reduce slots
• Job duration TJlow , TJup , Tjavg
▫ Sum of the map and reduce stage duration
10
Resource Allocation for Single MR Job
• Given a deadline D and the job profile, find the
minimal resource to complete the job within D
Statistics from
job profile
Given number of
map/reduce tasks
Find the value of SMJ, SRJ with minimum value of
SMJ+ SRJ using Lagrange's multipliers
Outline
• Introduction
• Building block
▫ Performance model for single MapReduce jobs
• Resource allocation for Pig programs
• Evaluation
• Conclusion and ongoing work
Performance Model for Pig Programs
• Let P = {J1, J2,….JN } , extract the job profile of
each job contained in P
▫ Assign unique name for each job within a program
• The program completion time  sum of the
completion time of all the jobs contained in P
TP  1i  N Ti
Resource Allocation for Pig Programs
• Possible strategy: find out an appropriate pair of
map and reduce slots for each job in the program














A
S
A
S
1
1
M
2
2
M
A
S
N
N
M
B

S
B

S

B

S
1
1
C

1
R
2
2
C
2

R
N
N
R
C
N


d1 


d 2 




dN 


with

1i  N
di  D
• Problem: difficult to implement and manage by
the scheduler
Resource Allocation for Pig Programs
• A simpler and more elegant solution
▫ Allocate the same set of resource to the entire
program instead of to each job
• Rewrite the previous equations into
TP 
 1i N Ai
S
P
M

 1i N Bi
S
P
R
  1iN Ci  D
Find the minimum set of map and reduce slots
( SMP , SRP ) for the entire Pig program
Experiment Setup
• 66 nodes cluster in 2 racks
▫ 4 AMD 2.39GHz cores
▫ 8 GB RAM,
▫ two 160GB hard disks
• Configuration
▫ 1 jobtracker, 1 namenode, 64 worker nodes
▫ 2 map slots and 1 reduce slot for each node
Benchmark
• Pigmix benchmark
▫ 17 programs
▫ 8 tables as the input data
• Dataset
▫ Test dataset
 Generated with the Pig mix data generator
 Total size around 1TB.
▫ Experimental dataset
 Same layout as the test dataset
 20% larger in size
Model Accuracy
• How well of our performance model captures Pig
program completion time?
Normalized results for predicted and measured completion time
Meeting Deadlines
• Are we meeting deadlines with our resource
allocation mode?
Pigmix executed on experimental data set : do we meet deadlines?
Conclusion
• Conclusion
▫ The performance model can accurately estimate the
completion time of MapReduce workflow
▫ Enables automatic resource provisioning for
MapReduce workflow with deadlines
• Ongoing work
▫ Refine the performance model for workflow with
concurrent jobs
▫ Incorporating failure scenarios in the current model
Thank you