Context-Aware Recommendation by Aggregating User Context
Download
Report
Transcript Context-Aware Recommendation by Aggregating User Context
Map-Reduce-Merge: Simplified Relational
Data Processing on Large Clusters
Hung-chih Yang 1, Ali Dasdan 1
Ruey-Lung Hsiao 2, D. Stott Parker 2
Yahoo! 1
Computer Science Department, UCLA 2
SIGMOD 2007, Beijing, China
Presented by Jongheum Yeon, 2009. 08. 13.
Outline
Introduction
Map-Reduce
Map-Reduce-Merge
Conclusions
Copyright 2009 by CEBT
2
Introduction
New data-processing systems should consider alternatives to
using big, traditional databases
Map-Reduce does a good job, in a limited context, with
extraordinary simplicity
Map-Reduce-Merge will try to extend the applicability without
giving up too much simplicity
Copyright 2009 by CEBT
3
Introduction (cont’d)
Application
SQL
Language
Execution
Storage
Parallel
Databases
Sawzall
≈SQL
LINQ, SQL
Sawzall
Pig, Hive
DryadLINQ
Scope
Map-Reduce
Hadoop
GFS
BigTable
HDFS
S3
Copyright 2009 by CEBT
Dryad
Cosmos
Azure
SQL Server
4
Map-Reduce : Motivation
Many special purpose tasks that operate on and produce large
amounts of data
Crawled documents, web requests, etc
Inverted indices, summaries, other kinds of derived data
Needs to be distributed across large number of machines to
finish in a reasonable time
Parallelize the computation
Distribute data
Obscures original computation with these extra concerns
Copyright 2009 by CEBT
5
Map-Reduce : Benefits
Automatic parallelization and distribution
User code complexity and size reduced
Transparent fault-tolerance
I/O scheduling
Fine grained partitioning of tasks
Dynamically scheduled on available workers
Status and monitoring
Copyright 2009 by CEBT
6
Map-Reduce : Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
map (in_key, in_value) -> list (out_key,
intermediate_value)
–
Processes input key/value pair
–
Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list
(out_value)
–
Produces a set of merged output values (usually just one)
Copyright 2009 by CEBT
7
Map-Reduce : Data Flow
Data
Map
Reduce
Data
Map
Reduce
Data
Map
Copyright 2009 by CEBT
8
Map-Reduce : Data Flow
Map : Generate new Key and its value
Reduce : Integrate values of same key
Key1
Value1
Map
KeyA
ValueX
Reduce
A=X
KeyB
ValueY
Key1
Value1
Map
KeyB
ValueZ
Copyright 2009 by CEBT
Reduce
B=Y,Z
9
Map-Reduce : Architecture
Master
Worker
Worker
Map
GFS
Worker
Reduce
Worker
GFS
Reduce
Map
Copyright 2009 by CEBT
10
Map-Reduce : Architecture
Master
Assigns and maintains the state of each map/reduce task
Propagating intermediate files to reduce tasks
Worker
Execute Map or Reduce by request of Master
Copyright 2009 by CEBT
11
Map-Reduce : Distributed Processing
Input File
Input 1
Input 2
Map
Intermediate
File
Output File
1
2
…
…
Map
…
1
2
Input M
…
R
Shuffle
Shuffle
Reduce
Reduce
Output 1
Output 2
Copyright 2009 by CEBT
…
…
Map
2
…
R
Shuffle
Reduce
…
Output R
12
Map-Reduce : Example
Inverted Index
Inverted Index
DocID=1
wordID
docID
Location
101
1
1
2
1
1
2
2
2
1
3
2
3
IDS 연구실의 페이지
201
DocID=2
203
IDB 연구실의 페이지
Word
docID
301
1
0
연구실
101
302
2
0
의
201
페이지
203
IDS
301
IDB
302
Copyright 2009 by CEBT
13
Map-Reduce : Example (cont’d)
Input data to Map
Data
Key(docID)
Value(Text)
1
IDS 연구실의 페이지
2
IDB 연구실의 페이지
Map
Reduce
Data
Map
Reduce
Data
Map
Output of Map
Key
(wordID)
Value
(docID:Location)
Key
(wordID)
Value
(docID:Location)
301
1:0
302
2:0
101
1:1
101
2:1
201
1:2
201
2:2
203
1:3
203
2:3
Copyright 2009 by CEBT
14
Map-Reduce : Example (cont’d)
Shuffle
Collect same keys and convey them to Reduce
Key
(wordID)
Value
(docID:Location)
101
1:1 2:1
201
1:2 2:2
203
1:3 2:3
301
1:0
302
2:0
Data
Map
Reduce
Data
Map
Reduce
Data
Map
Reduce writes the final result
101=1:1, 2:1
201=1:2, 2:2
203=1:3, 2:3
301=1:0
302=2:0
Copyright 2009 by CEBT
15
Map-Reduce : Example (cont’d)
Other Examples
Distributed Grep
Count URL Access Frequency
–
<URL, 1>
–
<URL, total count>
Reverse Web-Link Graph
–
<target, source>
–
<target, list(source)>
Copyright 2009 by CEBT
16
Map-Reduce-Merge
Map-Reduce is an extremely simple model, but with limited
context
Map-Reduce handles mainly homogeneous datasets
Relational operators are hard to implement with MapReduce(especially join operations)
Map-Reduce-Merge tries to keep the simplicity of Map-Reduce
while extending it to be more complete
Copyright 2009 by CEBT
17
Map-Reduce-Merge
Adds a merge phase to the Map-Reduce algorithm
Allows processing of multiple heterogeneous datasets
Like Map and Reduce, the Merge phase is implemented by the
developer
Example:
Two datasets: department and employee
Goal: compute employee’s bonus based on individual rewardsand
department bonus adjustment
Copyright 2009 by CEBT
18
19
Map-Reduce-Merge
Example
Match keys on dept_id in tables
Copyright 2009 by CEBT
20
Map-Reduce-Merge: Extending Map-Reduce
Change to reduce phase / Merge phase
Phases
1. Map: (k1, v1) → [(k2, v2)]
2. Reduce: (k2, [v2]) → [v3]
becomes:
1. Map: (k1, v1) → [(k2, v2)]
2. Reduce: (k2, [v2]) → (k2, [v3])
3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])
Copyright 2009 by CEBT
21
Map-Reduce-Merge
Additional user-definable operations
Merger: same principle as map and reduce
–
Processor: processes data from one source
–
process data on an individual source
Partition selector: selects the data that should go to the merger
–
analogous to the map and reduce definitions, define logic to do the
merge operation
which data should go to which merger?
Configurable iterator: how to iterate through each list as the
merging is done
–
how to step through each of the lists as you merge
Copyright 2009 by CEBT
22
Map-Reduce-Merge
Copyright 2009 by CEBT
23
Map-Reduce-Merge : Relational Data Processing
Relational operators can be implemented using the Map-ReduceMerge model. This includes:
Projection
Aggregation
Generalized selection
Joins
Set union
Set intersection
Set difference
Etc…
Copyright 2009 by CEBT
24
Map-Reduce-Merge : Example, Set Union
The two Map-Reduces emit each a sorted list of unique elements
The Merge merges the two lists by iterating in the following way:
Store the smallest value of two and increase it’s iterator by one
If they are equal, store one of them and increase both iterators
Copyright 2009 by CEBT
25
Map-Reduce-Merge : Example, Set Difference
We have two sets, A and B, we want to compute A-B
The two Map-Reduces emit each a sorted list of unique elements
The merge iterates simultaneously over the two lists:
If the value of A is less than B’s, store A’s value
If the value of B is smaller, increment B’s iterator
If the two are equal, increment both iterators
Copyright 2009 by CEBT
26
Map-Reduce-Merge : Example, Sort-Merge Join
Map: partition records into buckets which are mutually exclusive
and each key range is assigned to a reducer
Reduce: data in the sets are merged into a sorted set => sort
the data
Merge: the merger joins the sorted data for each key range
Copyright 2009 by CEBT
27
Map-Reduce-Merge : Optimizations
Map-reduce already optimizes using locality and backup tasks
Optimize the number of connections between the outputs of the
reduce phase and the input of the merge phase ( Example: Set
intersection)
Combining two phases into one (example: ReduceMerge)
Copyright 2009 by CEBT
28
Conclusions
Map-Reduce-Merge allows us to work on heterogeneous datasets
Map-Reduce-Merge supports joins which Map-reduce didn’t
directly do
Nextstep: develop an SQL-like interface and an optimizer which
simplifies the development of a Map-reduce-Merge workflow
Copyright 2009 by CEBT
29