Context-Aware Recommendation by Aggregating User Context

Download Report

Transcript Context-Aware Recommendation by Aggregating User Context

Map-Reduce-Merge: Simplified Relational
Data Processing on Large Clusters
Hung-chih Yang 1, Ali Dasdan 1
Ruey-Lung Hsiao 2, D. Stott Parker 2
Yahoo! 1
Computer Science Department, UCLA 2
SIGMOD 2007, Beijing, China
Presented by Jongheum Yeon, 2009. 08. 13.
Outline
 Introduction
 Map-Reduce
 Map-Reduce-Merge
 Conclusions
Copyright  2009 by CEBT
2
Introduction
 New data-processing systems should consider alternatives to
using big, traditional databases
 Map-Reduce does a good job, in a limited context, with
extraordinary simplicity
 Map-Reduce-Merge will try to extend the applicability without
giving up too much simplicity
Copyright  2009 by CEBT
3
Introduction (cont’d)
Application
SQL
Language
Execution
Storage
Parallel
Databases
Sawzall
≈SQL
LINQ, SQL
Sawzall
Pig, Hive
DryadLINQ
Scope
Map-Reduce
Hadoop
GFS
BigTable
HDFS
S3
Copyright  2009 by CEBT
Dryad
Cosmos
Azure
SQL Server
4
Map-Reduce : Motivation
 Many special purpose tasks that operate on and produce large
amounts of data

Crawled documents, web requests, etc

Inverted indices, summaries, other kinds of derived data
 Needs to be distributed across large number of machines to
finish in a reasonable time

Parallelize the computation

Distribute data

Obscures original computation with these extra concerns
Copyright  2009 by CEBT
5
Map-Reduce : Benefits
 Automatic parallelization and distribution

User code complexity and size reduced
 Transparent fault-tolerance
 I/O scheduling

Fine grained partitioning of tasks

Dynamically scheduled on available workers
 Status and monitoring
Copyright  2009 by CEBT
6
Map-Reduce : Programming Model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions:


map (in_key, in_value) -> list (out_key,
intermediate_value)
–
Processes input key/value pair
–
Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list
(out_value)
–
Produces a set of merged output values (usually just one)
Copyright  2009 by CEBT
7
Map-Reduce : Data Flow
Data
Map
Reduce
Data
Map
Reduce
Data
Map
Copyright  2009 by CEBT
8
Map-Reduce : Data Flow
 Map : Generate new Key and its value
 Reduce : Integrate values of same key
Key1
Value1
Map
KeyA
ValueX
Reduce
A=X
KeyB
ValueY
Key1
Value1
Map
KeyB
ValueZ
Copyright  2009 by CEBT
Reduce
B=Y,Z
9
Map-Reduce : Architecture
Master
Worker
Worker
Map
GFS
Worker
Reduce
Worker
GFS
Reduce
Map
Copyright  2009 by CEBT
10
Map-Reduce : Architecture
 Master

Assigns and maintains the state of each map/reduce task

Propagating intermediate files to reduce tasks
 Worker

Execute Map or Reduce by request of Master
Copyright  2009 by CEBT
11
Map-Reduce : Distributed Processing
Input File
Input 1
Input 2
Map
Intermediate
File
Output File
1
2
…
…
Map
…
1
2
Input M
…
R
Shuffle
Shuffle
Reduce
Reduce
Output 1
Output 2
Copyright  2009 by CEBT
…
…
Map
2
…
R
Shuffle
Reduce
…
Output R
12
Map-Reduce : Example
Inverted Index
 Inverted Index
DocID=1
wordID
docID
Location
101
1
1
2
1
1
2
2
2
1
3
2
3
IDS 연구실의 페이지
201
DocID=2
203
IDB 연구실의 페이지
Word
docID
301
1
0
연구실
101
302
2
0
의
201
페이지
203
IDS
301
IDB
302
Copyright  2009 by CEBT
13
Map-Reduce : Example (cont’d)
 Input data to Map
Data
Key(docID)
Value(Text)
1
IDS 연구실의 페이지
2
IDB 연구실의 페이지
Map
Reduce
Data
Map
Reduce
Data
Map
 Output of Map
Key
(wordID)
Value
(docID:Location)
Key
(wordID)
Value
(docID:Location)
301
1:0
302
2:0
101
1:1
101
2:1
201
1:2
201
2:2
203
1:3
203
2:3
Copyright  2009 by CEBT
14
Map-Reduce : Example (cont’d)
 Shuffle

Collect same keys and convey them to Reduce
Key
(wordID)
Value
(docID:Location)
101
1:1 2:1
201
1:2 2:2
203
1:3 2:3
301
1:0
302
2:0
Data
Map
Reduce
Data
Map
Reduce
Data
Map
 Reduce writes the final result
101=1:1, 2:1
201=1:2, 2:2
203=1:3, 2:3
301=1:0
302=2:0
Copyright  2009 by CEBT
15
Map-Reduce : Example (cont’d)
 Other Examples

Distributed Grep

Count URL Access Frequency

–
<URL, 1>
–
<URL, total count>
Reverse Web-Link Graph
–
<target, source>
–
<target, list(source)>
Copyright  2009 by CEBT
16
Map-Reduce-Merge
 Map-Reduce is an extremely simple model, but with limited
context
 Map-Reduce handles mainly homogeneous datasets
 Relational operators are hard to implement with MapReduce(especially join operations)
 Map-Reduce-Merge tries to keep the simplicity of Map-Reduce
while extending it to be more complete
Copyright  2009 by CEBT
17
Map-Reduce-Merge
 Adds a merge phase to the Map-Reduce algorithm
 Allows processing of multiple heterogeneous datasets
 Like Map and Reduce, the Merge phase is implemented by the
developer
 Example:

Two datasets: department and employee

Goal: compute employee’s bonus based on individual rewardsand
department bonus adjustment
Copyright  2009 by CEBT
18
19
Map-Reduce-Merge
 Example

Match keys on dept_id in tables
Copyright  2009 by CEBT
20
Map-Reduce-Merge: Extending Map-Reduce
 Change to reduce phase / Merge phase
 Phases

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → [v3]
 becomes:

1. Map: (k1, v1) → [(k2, v2)]

2. Reduce: (k2, [v2]) → (k2, [v3])

3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])
Copyright  2009 by CEBT
21
Map-Reduce-Merge
 Additional user-definable operations

Merger: same principle as map and reduce
–

Processor: processes data from one source
–

process data on an individual source
Partition selector: selects the data that should go to the merger
–

analogous to the map and reduce definitions, define logic to do the
merge operation
which data should go to which merger?
Configurable iterator: how to iterate through each list as the
merging is done
–
how to step through each of the lists as you merge
Copyright  2009 by CEBT
22
Map-Reduce-Merge
Copyright  2009 by CEBT
23
Map-Reduce-Merge : Relational Data Processing
 Relational operators can be implemented using the Map-ReduceMerge model. This includes:

Projection

Aggregation

Generalized selection

Joins

Set union

Set intersection

Set difference

Etc…
Copyright  2009 by CEBT
24
Map-Reduce-Merge : Example, Set Union
 The two Map-Reduces emit each a sorted list of unique elements
 The Merge merges the two lists by iterating in the following way:

Store the smallest value of two and increase it’s iterator by one

If they are equal, store one of them and increase both iterators
Copyright  2009 by CEBT
25
Map-Reduce-Merge : Example, Set Difference
 We have two sets, A and B, we want to compute A-B
 The two Map-Reduces emit each a sorted list of unique elements
 The merge iterates simultaneously over the two lists:

If the value of A is less than B’s, store A’s value

If the value of B is smaller, increment B’s iterator

If the two are equal, increment both iterators
Copyright  2009 by CEBT
26
Map-Reduce-Merge : Example, Sort-Merge Join
 Map: partition records into buckets which are mutually exclusive
and each key range is assigned to a reducer
 Reduce: data in the sets are merged into a sorted set => sort
the data
 Merge: the merger joins the sorted data for each key range
Copyright  2009 by CEBT
27
Map-Reduce-Merge : Optimizations
 Map-reduce already optimizes using locality and backup tasks
 Optimize the number of connections between the outputs of the
reduce phase and the input of the merge phase ( Example: Set
intersection)
 Combining two phases into one (example: ReduceMerge)
Copyright  2009 by CEBT
28
Conclusions
 Map-Reduce-Merge allows us to work on heterogeneous datasets
 Map-Reduce-Merge supports joins which Map-reduce didn’t
directly do
 Nextstep: develop an SQL-like interface and an optimizer which
simplifies the development of a Map-reduce-Merge workflow
Copyright  2009 by CEBT
29