RankReduce – Processing K-Nearest Neighbors queries on Top

Download Report

Transcript RankReduce – Processing K-Nearest Neighbors queries on Top

Join Using MapReduce
Cloud Group, WAMDM
Youzhong MA
July 6, 2015
Outline


Overview about Join Using MapReduce
Details
Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012]
Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010]
Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012]





Conclusion
Trajectory Similarity Join Using MapReduce
2
Overview about Join Using MapReduce
Join type
[Join Comparision, Sigmod’2010]
[Optimizing Joins,VLDB’2010]
[Map-ReduceMerge,Sigmod’2007]
[Map-Join-Reduce,TKDE’2010]
[Theta-Joins, Sigmod’2011]
[Set Join, Sigmod’2010]
Featuers
Repartition join, broadcast join, semiTwo tables, equal join
join etc.
Multiple tables, equal
Star Join, Chain join
join
Multiple tables, equal
Basic Join
join
Modify the MapReduce Framework
Multiple tables, equal
join
reducer-centered cost model and a
Theta Join
join model
Set Similarity Join
String, Set based similarity
[Top-k Join, ICDE’2012]
Top-k Closest pairs
Essential pair partitioning;
divide-and-conquer and branch-andbound algorithms
Complex Join
[Parallel kNN Joins, EDBT’2012]
KNN Join
Approximation KNN, z-order based
method
[Fuzzy Join, ICDE’2012]
Similarity Join
hamming distance, edit distance, and
jaccard distance measures
3
Outline
Overview about Join Using MapReduce
Details





Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012]
Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010]
Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012]
Conclusion
Trajectory Similarity Join Using MapReduce


4
Introduction
k nearest neighbor join (kNN join)


Given two data sets R and S, for every point q in R, kNN join
returns k nearest points of q from S.
3-NN join for q
(q, p1)
(q, p3)
(q, p4)
Find kNN in S for points in R
Applications: data mining, spatial databases , etc.
5
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Introduction
Exact kNN Join



knn(r ; S) = set of kNN of r from S.
knnJ(R; S) = {(r ; knn(r ; S)) | for all r R}.
Approximate kNN Join


aknn(r, S) = approximate kNN of r from S.




6
p = kth NN of r in knn(r, S).
p’ = kth NN for r in aknn(r, S)
aknn(r, S) is a c-approximation of
knn(r, S): d(r, p) ≤ d(r, p’) ≤ c · d(r, p).
aknnJ(R, S)= {(r, aknn(r, S))|∀r ∈ R}.
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method




Partition R and S, each into n equal-sized disjoint blocks.
Perform (BNLJ) for each possible Ri ,Sj pairs of blocks.
Get global kNN results from n local kNN results for every
record in R.
R
S
R1
BNLJ (R1, S1)
R2
BNLJ (R1, S2)
S1
S2
7
BNLJ (R1, S)
BNLJ (R2, S1)
BNLJ (R2, S2)
BNLJ (R2, S)
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 1

8
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 2

BNLJ (R1, S1)
BNLJ (R1, S2)
9
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance


Build R-tree index for a block of S in a bucket to speed up kNN
computations.
Similar to BNLJ algorithm, only need to replace BNLJ with block Rtree join (BRJ) in the first round.
BRJ
10
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join

Problems with exact kNN join solution


Too much communication and computation (n2 buckets required)
Find solution requiring O(n) buckets.


We search for approximate solutions.
Space-filling curve based methods ([ICDE10], dubbed zkNN)
n2 buckets required, too much cost
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
11
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join:Z-order kNN join

The idea of zkNN



Transform d-dimensional points to 1-D values using Z-value.
Map d-dimensional kNN join query to 1-D range queries.
Multiple random shift copies are used to improve spatial locality.
q
𝑣
q
12
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join:Z-order kNN join

The idea of zkNN



13
Transform d-dimensional points to 1-D values using Z-value.
Map d-dimensional kNN join query to 1-D range queries.
Multiple random shift copies are used to improve spatial locality.
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join: H-zkNNJ


Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm

Partitioning policy:


To achieve linear communication and computation costs (to the
number of blocks n in each input data set)
Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ..., Ri,n} and {Si,1, ..., Si,n} using
(n − 1) z-values {zi,1, ..., zi,n}
Small neighborhood search
K=2
14
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join: H-zkNNJ


Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm

Partitioning policy:


To achieve linear communication and computation costs (to the
number of blocks n in each input data set)
Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ..., Ri,n} and {Si,1, ..., Si,n} using
(n − 1) z-values {zi,1, ..., zi,n}
K=3
15
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join: H-zkNNJ

Choice of partitioning values.




Computation of partitioning values.



Each block of Ri and Si shares the same boundary so we only search a small neighborhood
and minimize communication.
Goal: load balance.
Evenly partition Ri or Si
Quantiles can be used for evenly partitioning a data set D.
Sort a data set D and retrieve its (n − 1) quantiles (expensive).
We propose sampling based method to estimate quantiles.

16
We proved that both estimations are close enough to the original ranks with a high
probability (1-e −2/).
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join: H-zkNNJ

H −zkNNJ algorithm can be implemented in 3 rounds of
MapReduce.

17
Round 1: construct random shift copies for R and S, Ri and Si , i
∈ [1,α], and generate partitioning values for Ri and Si
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Approximate kNN Join: H-zkNNJ

H −zkNNJ algorithm can be implemented in 3 rounds of MapReduce.


18
Round 2: partition Ri and Si into blocks and compute the candidate
points for knn(r, S) for any r ∈ R.
Round 3: determine knn(r, C(r)) of any r ∈ R from the (r, Ci (r)) emitted
by round 2.
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Experiments
19
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Experiments
Running time
20
Communication cost
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Outline


Overview about Join Using MapReduce
Details





Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012]
Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010]
Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012]
Conclusion
Trajectory Similarity Join Using MapReduce
21
Motivating Scenarios

Detecting Plagiarism


Before publishing a Journal, editors have
to make sure there is no plagiarized
paper among the hundreds of papers to
be included in the Journal
Near-duplicate elimination


22
The archive of a search engine can
contain multiple copies of the same
page
Reasons: re-crawling, different hosts
holding the same redundant copies of a
page, etc.
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Problem Statement

Problem Statement:
Given two collections of
objects/items/records, a similarity metric
sim(o1,o2) and a threshold λ , find the
pairs of objects/items/records satisfying
sim(o1,o2)> λ
Solution:




Similarity Join
Some of the collections are enormous:



23
Google N-gram database : ~1trillion
records
GeneBank : 416GB of data
Facebook : 400 million active users
Try to process this data in
a parallel, distributed way
=> MapReduce
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Set-Similarity Join(SSJoin)



SSJoin: a powerful primitive for supporting
(string-)similarity joins
Input: 2 collections of sets
Goal: Identify all pairs of highly similar sets
S1={…}
S2={…}
….
{word1,word2
….….
Sn={…}
wordn}
24
SSJoinpred
pred: sim(Si,Ti)>0.3
|SiTi |
sim(Si, Ti) 
|SiTi |
T1={…}
T2={…}
…
Tn={…}
{word1,word2
….….
wordn}
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Set-Similarity Join


Most SSJoin algorithms are signature-based:
Signatures:



Have a filtering effect: SSJoin algorithm compares only candidates
not all pairs
Ensure correctness: Sign(r) ∩ Sign(s)  , whenever Sim(r, s) ≥ λ;
One possible signature scheme: Prefix-filtering

Compute Global Ordering of Tokens:
Marat …W. Safin ... Rafael ... Nadal ...P. … Smith …. John

Compute Signature of each input set: take the prefix of length n
Sign({John, W., Smith})=[W., Smith]
Sign({Marat,Safin})=[Marat, Safin]
Sign({Rafael, P., Nadal})=[Rafael,Nadal]
25
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Set-Similarity Join

Filtering Phase: Before doing the actual SSJoin,
cluster/group the candidates
a b c
Name
{John,W., Smith}
……
{Marat, Safin}
{Rafael, P., Nadal}
… … ...
…
cluster/bucket1

cluster/bucket2
d e
Name
{Smith, John}
…
{Safin,Marat,Michailowitsc}
{Nadal , Rafael, Parera}
… ...
….
cluster/bucketN
Run the SSjoin on each cluster => less workload
26
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Parallel Set-Similarity Join

Method comprises 3 stages:
Compute data statistics
for good signatures
Stage I: Token Ordering
27
Group candidates
based on signature
&
Compute SSJoin
Stage II RID-Pair Generation
Generate actual
pairs of joined records
Stage III: Record Join
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Stage I: Data Statistics(Basic Token Ordering)

Creates a global ordering of the tokens in the join column,
based on their frequency
RID
a
b
c
Global Ordering:
(based on
frequency)

2 MapReduce cycles:


28
1st : computing token frequencies
2nd: ordering the tokens by their frequencies
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Basic Token Ordering – 1st MapReduce cycle
, ,
map:
• tokenize the join
value of each record
• emit each token
with no. of occurrences 1
29
reduce:
• for each token, compute total
count (frequency)
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Basic Token Ordering – 2nd MapReduce cycle

Map

30
interchange key with value

reduce(use only 1 reducer)

emits the value
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Stage 2:RID-Pair Generation -- Map Phase

scan input records and for each record:




project it on RID & join attribute
tokenize it
extract prefix according to global ordering of tokens obtained in the Token
Ordering stage
route tokens to appropriate reducer
Global ordering of tokens obtained in the previous stage
31
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Routing: using individual tokens


Treats each token as a key
For each record, generates a (key, value) pair for each of
its prefix tokens:
Example:
• Given the global ordering:
Token
A
B
E
D
G
C
F
Frequency
10
10
22
23
23
40
48
“A B C”
=> prefix of length 2: A,B
=> generate/emit 2 (key,value) pairs:
• (A, (1,A B C))
• (B, (1,A B C))
32
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Grouping/Routing: using individual tokens

Advantage:


high quality of grouping of candidates( pairs of records that
have no chance of being similar, are never routed to the same
reducer)
Disadvantage:

33
high replication of data (same records might be checked for
similarity in multiple reducers, i.e. redundant work)
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Routing: Using Grouped Tokens
Example:
• Given the global ordering:
Token
A
B
E
D
G
C
F
Frequency
10
10
22
23
23
40
48
“A B C” => prefix of length 2: A,B
Suppose A,B belong to group X and
C belongs to group Y
=> generate/emit 2 (key,value) pairs:
• (X, (1,A B C))
• (Y, (1,A B C))
34
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Grouping/Routing: Using Grouped Tokens

Advantage:


Replication of data is not so pervasive
Disadvantage:

35
Quality of grouping is not so high (records having no chance of
being similar are sent to the same reducer which checks their
similarity)
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
RID-Pair Generation: Reduce Phase



This is the core of the entire method
Each reducer processes one/more buckets
In each bucket, the reducer looks for pairs of join attribute values
satisfying the join predicate
If the similarity of the 2 candidates >= threshold
=> output their ids and also their similarity
Bucket of
candidates
36
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Stage III: Generate pairs of joined records


Until now we have only pairs of RIDs, but we need actual records
Uses 2 MapReduce cycles

1st cycle: fills in the record information for each half of each pair

2nd cycle: brings together the previously filled in records
37
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Handling Insufficient Memory

Map-Based Block Processing.


In this approach, the map function replicates the blocks and interleaves
them in the order they will be processed by the reducer.
Reduce-Based Block Processing.
 In this approach, the map function sends each block exactly
once.
38
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Evaluation




Cluster: 10-node IBM x3650, running Hadoop
Data sets:

DBLP: 1.2M publications

CITESEERX: 1.3M publication
Best algorithm: BTO-PK-OPRJ
Most expensive stage: the RID-pair
generation
39
BTO: basic token ordering
BK: Basic kernel
BRJ: Basic record join
PK: PPJoin +kernel
OPRJ: one phase record join


Fixed data size, vary the
cluster size
Best time: BTO-PK-OPRJ
Efficient Parallel Set-Similarity Joins Using MapReduce [Rares Vernica et al. SIGMOD’2010]
Conclusion



KNN Join is computation-intensive.
Minimize communication and computation.
Effective filtering strategy to reduce the candidate pairs.

Parallel kNN Joins [EDBT’2012]



Efficient Parallel Set-Similarity Joins [SIGMOD’2010]



Space-filling curve based methods ([YLK10], dubbed zkNN)
n2 buckets required  O(n) buckets.
Prefix-filtering principle
Reduce candidate pairs
Good partition strategy to achieve good load balance.

Parallel kNN Joins [EDBT’2012]


Efficient Parallel Set-Similarity Joins [SIGMOD’2010]

40
Evenly partitioning the dataset using sampling method
Global Ordering: (based on frequency)
Outline


Overview about Join Using MapReduce
Details





Efficient Parallel kNN Joins for Large Data in MapReduce [EDBT’2012]
Efficient Parallel Set-Similarity Joins Using MapReduce [SIGMOD’2010]
Parallel Top-K Similarity Join Algorithms using MapReduce [ICDE’2012]
Conclusion
Trajectory Similarity Join Using MapReduce
41
Problem statement

Trajectory


A trajectory T is a sequence of pairs {< 𝑙1 , 𝑡1 >, … , < 𝑙𝑛 , 𝑡𝑛 >},
where 𝑙𝑖 ∈ ℝ𝑑 , 𝑡𝑖 ∈ ℕ.
Trajectory Join

42
Given two sets of trajectories R and S, a threshold 𝜀, the result of
the trajectory join query is a subset V of pairs < 𝑅𝑖 , 𝑆𝑖 > where (<
𝑅𝑖 ∈ 𝑅, 𝑆𝑖 ∈ 𝑆 >), such that the distance 𝐷(𝑅𝑖 , 𝑆𝑗 ) ≤ 𝜀, for any
pair in V and a given user defined distance function D.
Solutions

Naïve approaches


Block nested loop join (BNLJ) based method
Block nested loop join + Sliding window
R
S
43
R1
BNLJ (R1, S1)
R2
BNLJ (R1, S2)
S1
BNLJ (R2, S1)
S2
BNLJ (R2, S2)
Efficient Parallel kNN Joins for Large Data in MapReduce [Chi Zhang et al. EDBT’2012]
Improved approaches
Symbolic representation for trajectories based on the
Piecewise Aggregate Approximation (PAA) technique
Challenges: Data Skew



Solutions


44
Using hierarchical PAA to filter the candidate pairs recursively
Dividing the dense PAA to sub-partitions
Thank you!
Questions?
45