Join Algorithms In MapReduce

Download Report

Transcript Join Algorithms In MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce

Spyros Blanas, Jignesh M. Patel (University of Wisconsin-Madison) Eugene J. Shekita, Yuanyuan Tian (IBM Almaden Research Center) SIGMOD 2010 August 1, 2010 Presented by Hyojin Song

Contents

Introduction

Join Algorithms In MapReduce

Experimental Evaluation

Discussion

Conclusion

2 / 30

Introduction(1/3)

 Log Processing – – – – Important type of data analysis commonly done with MapReduce Log Table A log of events  click-stream   log of phone call records a sequence of transactions To compute various statistics for business insight    Often needs to be join  filtered aggregated mined for patterns Log data and Reference data(user information) Call records 2010.09.24.14:20.30

2010.09.24.14:30.45

2010.09.25.19:11.118

2010.09.28.06:40.97

2010.09.29.08:44.08

…… Number 01191655603 01046841397 01926540846 01098446512 01013461655 …… Reference Table Number 01191655603 01046841397 01926540846 01098446512 01013461655 …… Name 송효진 안철수 한효주 안인석 마음이 …… 3 / 30

Introduction(2/3)

 MapReduce Framework – Used to analyze large volumes of data – The success of MapReduce  Simple programming framework  To manage parallelization, fault tolerance, and load balancing – The critics of MapReduce  lack of a schema  lack of a declarative query language  lack of indexes – Difficult for joins  Not originally designed to combine information from several data sources  To use simple but inefficient algorithms to perform joins 4 / 30

Introduction(3/3)

 The benefits of MapReduce for log processing – – – – Scalability  China Mobile gathers 5-8TB of phone call records per day  Facebook collect almost 6TB of new log data everyday with totally 1.7PB

Schema free  flexibility  a log record may also change over time Simple scans preferable (<-> index scans) Time consuming work  gracefully fault tolerance support (<-> parallel RDBMS)  The goal of this paper – – the implementation of several well-known join strategies in MapReduce comprehensive experiments to compare these join techniques 5 / 30

Contents

Introduction

Join Algorithms In MapReduce

Experimental Evaluation

Discussion

Conclusion

Problem Statement 1. Repartition Join 2. Improved Repartition Join 3. Directed Join 4. Broadcast Join 5. Semi-Join 6. Per-split Semi-Join 6 / 30

Join Algorithms in MR

Problem Statement An equi-join between a log table L and a reference table R on single column, with |L| >> |R|  To propose further improving its performance with some

preprocessing techniques

– – Well-known in the RDBMS literature Adapting them to MapReduce is not always straightforward – Crucial implementation details of these join algorithms  To implement two additional functions: init() and close() – These are called before and after each map or reduce task 7 / 30

Join Algorithms in MR

1. Repartition Join The most commonly used join strategy in the MapReduce framework – L and R are dynamically partitioned on the join key – – The corresponding pairs of partitions are joined Similar to partitioned sort-merge join in the parallel RDBMS  Example Tables(Log table & User table) – – – Log table  500,000 records  Log has a lecture name and degree User table  10,000 records Join key is the student ID

Log Table

log DB B+ KRR A Opt A ML C0 OS A+ NL D … Student ID 2008-2424 2010-8281 2005-3682 2009-0078 2010-1004 2008-0909 … 8 / 30

User Table

Student ID 2008-0909 Name Ahn Jaemin 2010-1004 2009-0078 2005-3682 2010-8281 … Kim Somin Song Hyojin Lee taewhi An Inseok …

Join Algorithms in MR

1. Repartition Join A split of R or L (Distributed File System) L DB B 2008-2424 KRR A 2010-8281 Map Phase Intermediate results Local disk 2010-8281 2008-2424

L

: KRR A

L

: DB B R Song 2009-0078 An 2010-8281 …….

L NL D 2008-0909 ML C 2009-0078 OPT A 2005-3682

.

.

.

2010-8281 2008-0909 2009-0078 2009-0078 2005-3682

R

: An

L

: NL D

L

: ML C

R

: Song

L

: OPT A 9 / 30 Reduce Phase Buffer 2010-8281

R

: An 2008-0909 2010-8281

L

: NL D

L

: KRR A 2009-0078

R

: Song 2005-3682 2008-2424 2009-0078

L

: OPT A

L

: DB B

L

: ML C

Join Algorithms in MR

1. Repartition Join Reduce Phase Local disk 2010-8281 2008-2424

L

: KRR A

L

: DB B 2010-8281 2008-0909 2009-0078 2009-0078 2005-3682

R

: An

L

: NL D

L

: ML C

R

: Song

L

: OPT A Buffer

B R

2010-8281

R

: An

B L

2008-0909 2010-8281

L

: NL D

L

: KR A

B R

2009-0078

R

: Song

B L

2005-3682 2008-2424 2009-0078

L

: OPT A

L

: DB B

L

: ML C 10 / 30

Output File

(Distributed File System)

Student ID

2009-0078 2010-8281

Name

An In Seok Song Hyo Jin

Log

KRR A ML C

Join Algorithms in MR

1. Repartition Join Standard Repartition Join – – – Potential problem  all records have to be buffered.

May not fit in memory  The data is highly skewed  The key cardinality is small Variants of the standard repartition join are used in Pig, Hive, and Jaql today.

 They all suffer from the buffering problem  Improved Repartition Join – – – The output key is changed to a composite of the join key and the table tag The partitioning & grouping function is customized Records from the smaller table R are buffered and L records are streamed to generate the join output 11 / 30

Join Algorithms in MR

2. Improved Repartition Join A split of R or L (Distributed File System) L DB B 2008-2424 KRR A 2010-8281 Map Phase Intermediate results Local disk 2010-8281

L L

: KRR A 2008-2424

L L

: DB B R Song 2009-0078 An 2010-8281 …….

L NL D 2008-0909 ML C 2009-0078 OPT A 2005-3682

.

.

.

2010-8281

R

2008-0909

L R

: An

L

: NL D 2009-0078

L

2009-0078

R

2005-3682

L L

: ML C

R

: Song

L

: OPT A 12 / 30 Reduce Phase Buffer 2010-8281

R R

: An 2008-0909

L

2010-8281

L L

: NL D

L

: KRR A 2009-0078

R R

: Song 2005-3682

L

2008-2424

L

2009-0078

L L

: OPT A

L

: DB B

L

: ML C

Join Algorithms in MR

2. Improved Repartition Join Reduce Phase Local disk 2010-8281

L L

: KRR A 2008-2424

L L

: DB B 2010-8281

R

2008-0909

L R

: An

L

: NL D 2009-0078

L

2009-0078

R

2005-3682

L L

: ML C

R

: Song

L

: OPT A Buffer

B R

2010-8281

R

: An L records are streamed

B R

2009-0078

R

: Song L records are streamed 13 / 30

Output File

(Distributed File System)

Student ID

2009-0078 2010-8281

Name

An In Seok Song Hyo Jin

Log

KRR A ML C

Join Algorithms in MR

3. Directed Join Preprocessing for Repartition Join (Directed Join) – – Both L and R have already been partitioned on the join key  Pre-partitioning L on the join key  Then at query time, matching partitions from L and R can be directly joined A map-only MapReduce job.

  During the init phase, R i is retrieved from the DFS To use a main memory hash table, if it’s not already in local storage 14 / 30

Join Algorithms in MR

4. Broadcast Join Broadcast Join – – In most applications, |R| << |L| Instead of moving both R and L across the network, – – – To broadcast the smaller table R to avoids the network overhead A map-only job Each map task uses a main-memory hash table for either L or R 15 / 30

Join Algorithms in MR

4. Broadcast Join Broadcast Join – If R < a split of L  To build the hash table on R – If R > a split of L  To build the hash table on a split of L  Preprocessing for Broadcast Join – Most nodes in the cluster have a local copy of R in advance – To avoid retrieving R from the DFS in its init() function 16 / 30

Join Algorithms in MR

5. Semi-Join Semi-Join – – Some applications, |R| << |L|  In Facebook, user table has hundreds of millions of records  A few million unique active users per hour To avoid sending the records in R over the network that will not join with L  Preprocessing for Semi-Join – First two phases of semi-join can preprocess 17 / 30

Join Algorithms in MR

6. Per-Split Semi-Join Per-Split Semi-Join – – The problem of Semi-join : All records of extracted R will not join L i L i can be joined with R i directly  Preprocessing for Per-split Semi-join – Also benefit from moving its first two phases 18 / 30

Contents

Introduction

Join Algorithms In MapReduce

Experimental Evaluation

Discussion

Conclusion

1. Environment 2. Datasets 3. MapReduce Time Breakdown 4. Experimental Results 19 / 30

Experimental Evaluation

1. Environment System Specification – – All experiments run on a 100-node cluster Single 2.4GHz Intel Core 2 Duo processor – – 4GB of DRAM and two SATA disks Red Hat Enterprise Server 5.2 running Linux 2.6.18

 Network Specification – – The 100 nodes were spread across two racks Each node can execute two map and two reduce tasks concurrently – – – Each rack had its own gigabit Ethernet switch The rack level bandwidth is 32Gb/s Under full load, 35MB/s cross-rack node-to-node bandwidth  version 0.19.0, HDFS (128MB block size) 20 / 30

Experimental Evaluation

2. Datasets Datasets Join column size Record size Total size

Event Log (L)

10 bytes 100bytes (average) 500GB

User Info (R)

5 bytes 100 bytes (exactly) 10MB~100GB • • • • • • Join result is a 10 bytes join key n-to-1 join many users are inactive All the records in L always appear in the result To fix the fraction of R that was referenced by L to be 0.1%, 1%, or 10% To simulate some active users, a Zipf distribution was used 21 / 30

Experimental Evaluation

3. MapReduce Time Breakdown 22 / 30

Experimental Evaluation

3. MapReduce Time Breakdown MapReduce Time Breakdown – – What transpires during the execution of a MapReduce job The overhead of various execution components of MapReduce – System Environment  The standard repartition join algorithm  500GB log table and 30MB reference table  1% actually referenced by the log records  4000 map tasks and 200 reduce tasks  A node was assigned 40 map and 2 reduce tasks 23 / 30

Experimental Evaluation

3. MapReduce Time Breakdown Interesting Observations on MapReduce – – – – The map phase was clearly CPU-bound The reduce phase was limited by the network bandwidth  Writing the three copies of the join result to HDFS The disk and the network activities were moderate and periodic during map phase  The peaks were related to the output generation in the map task  The shuffle phase in the reduce task Almost idle for about 30 seconds – between the 9 min and 10 min mark  Waiting for the slowest map task By enabling independent and concurrent map tasks, almost all CPU, disk and network activities can be overlapped 24 / 30

Experimental Evaluation

4. Experimental Results ▣ No preprocessing 25 / 30 ▣ preprocessing

Experimental Evaluation

4. Experimental Results 26 / 30

Contents

Introduction

Join Algorithms In MapReduce

Experimental Evaluation

Discussion

Conclusion

27 / 30

Discussion

 Choosing the Right Strategy – – To determine what is the right join strategy for a given circumstance To provide an important first step for query optimization 28 / 30

Contents

Introduction

Join Algorithms In MapReduce

Experimental Evaluation

Discussion

Conclusion

29 / 30

Conclusion

 Joining log data with reference data in MapReduce has emerged as an important part – – Analytic operations for enterprise customers Web 2.0 companies  To design a series of join algorithms on top of MapReduce – – Without requiring any modification to the actual framework To propose many details for efficient implementation  Two additional function: Init(), close()  Practical preprocessing techniques  Future work – Multi-way joins – – – Indexing methods to speedup join queries Optimization module (selecting appropriate join algorithms) New programming models to extend the MapReduce framework 30 / 30