Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Transcript Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Top-k Set Similarity Joins
Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang
University of New South Wales and NICTA
Motivation

Data Cleaning
University
City
State
Postal Code
University of New South Wales
Sydney
NSW
2052
University of Sydney
Sydney
NSW
2006
University of Melbourne
Melbourne
Victoria
3010
University of Queensland
Brisbane
Queensland
4072
University of New South Vales
Sydney
NSW
2052
2
More Applications
Obama Has Busy Final Day Before
Taking Office as Bush Says Farewells
iht.com
Jan 20, 2009
New York Times
Jan 19th, 2009
3
(Traditional) Set Similarity Join



Each record is tokenized into a set
Given a collection of records, the set similarity join problem
is to find all pairs of records, <x,y>, such that sim(x,y)  t
Common similarity functions:
x = {A,B,C,D,E}
y = {B,C,D,E,F}


jaccard: J ( x, y ) 

cosine:
C ( x, y ) 

dice:
D ( x, y ) 
x y
x y
x y
x y
2x y
x y
t
4/6 = 0.67
t
4/5 = 0.8
t
8/10 = 0.8
What if t is unknown beforehand?
4
What if t is unknown beforehand?

Example – using jaccard similarity function







w = {A, B, C, D, E}
x = {A, B, C, E, F}
y = {B, C, D, E, F}
z = {B, C, F, G, H}
If t = 0.7  no results
If t = 0.4  <w,x>, <w,y>, <x,y>, <x,z>, <y,z>  too many
results and long running time
Return the top-k results ranked by their similarity values

if k = 1  <w,x>
5
Top-k Set Similarity Join

Return top-k pairs of records, ranked by similarity scores.

Advantages over traditional similarity join



without specifying a threshold
output results progressively  benefit interactive applications
produces most meaningful results under limited resources or
time constraints  can be stopped at any time, but still
guarantee sim(output results)  sim(unseen pairs)
6
Straightforward Solution

Start from a certain t, repeat the following steps:




answer traditional sim-join with t as threshold
if # of results  k, stop and output k results with highest sim
else, decrease t
Example (jaccard, k = 2)








Which thresholds shall we
enumerate?
w = {A, B, C, E}
x = {A, B, C, E, F}
y = {B, C, D, E, F}
z = {B, C, F, G, H}
t
t
t
t
=
=
=
=
0.9
0.8
0.7
0.6




0.8, 0.6
no result
<w,x>
results don’t change!
<w,x>
<w,x>, <x,y>
7
Naïve and Index-Based Algorithms

Naïve Algorithm:


Compare every pair of objects -> O(n2) time complexity
Index-based Algorithm
[Sarawagi et al. SIGMOD04]:
inverted lists
token
Record Set
Index Construction
<w,x>
<w,y>
Candidate Generation
record_id
A
w x
B
x
z …
C
y
z …
<x,y>
<x,z>
…
Verification
8
Result Pairs
y
Prefix Filter

Sort the tokens by a global ordering



[Chaudhuri et al. ICDE06, Bayardo et al. WWW07]
increasing order of document frequency
Only need to index the first few tokens (prefix) for each record
Example:

jaccard t = 0.8  |x  y|  4 if |x|=|y|=5
sorted

x=

y=
A
C
B
D
E
E
F
F
G
upper boundO(x,y) = 3 < 4!
G
sorted

prefix
Must share at least one token in prefix to be a candidate pair

For jaccard, prefix length = |x| * (1 – t) + 1  each t is associated
with a prefix length
9
Necessary Thresholds

Each prefix is associated with a threshold, i.e., the
maximum possible similarity a record can achieve with
other records.
t
x=


1.0
0.8
0.6
A
B
C
What thresholds shall we enumerate? All the thresholds
with which prefixes are associated!
Necessary thresholds


If we change between different thresholds, there exists a
database instance where the results will change
extend prefix by one token, and consider the new t
10
Event-driven Model

Problem: repeated invocation of sim-join algorithm


t is decreasing  run sim-join algorithm in an incremental way
Prefix Event <x, A, t>


initialize prefix length for each record as 1  <x, A, 1.0>
for each prefix event




probe the inverted list of the token for candidate pairs, verify the
candidate pairs, and insert them into temp results.
insert x into A’s inverted list
extend prefix by one token  maintain prefix events with a maxheap on t
stop until t  k-th temp result’s similarity
x
1.0
0.75
y
1.0
0.8
0.6
z
1.0
0.9
0.8
0.7
11
topk-join - Example
jaccard, k=2
w
A
B
C
E
x
A
B
C
E
t=0.6  2nd
temp result’s sim
<x, B, 0.8>
<y, C, 0.8>
F
<z, C, 0.8>
y
B
C
D
E
F
z
B
C
F
G
H
<w, B, 0.75>
inverted list
token
prefix event
temporary result
record_id
A
w
x
B
y
z
C
y
z
(w,x) = 0.8
x
(y,z) = 0.43
w
(x,y) = 0.67
12
verified
twice!
Optimizations - Verification


In the above example, (w,x) and (y,z) have been verified
twice
How to avoid repeated verification?


memorize all verified pairs with a hash table  too much
memory consumption
check if this pair will be identified again when it is verified for
the first time
1.0 0.8 0.6
x
y


A
A
B
C
D
D
E
E
F
F
if k-th temp result’s sim = 0.7
won’t be identified again!
keep only those will be identified again before algorithm stops
guarantee no pair will be verified twice
13
Optimizations - Indexing

How to reduce inverted list size to save memory?



x
A
C
D
E
F
y
B
C
D
E
F
identified by <x, C, 0.8> or <y, C, 0.8>, yet the maximum
similarity they can achieve is 4/6 = 0.67
t is decreasing  calculate the upper bound of similarity for
future probings into inverted lists
don’t insert into inverted list if this upper bound  k-th temp
result’s similarity
14
Experiment Settings

Algorithms



Measure



topk-join
pptopk: modified ppjoin[Xiao, et al. WWW08], a prefix-filter based
approach, with t = 0.95, 0.90, 0.85...
compare topk-join and pptopk (candidate size, running time)
output results progressively
Dataset
dataset
# of records
avg. record size
DBLP (author, title)
855k
14.0
TREC (author, title, abstract)
348k
130.1
TREC-3GRAM
348k
868.5
UNIREF-3GRAM (protein seq.)
500k
372.9
15
Experiment Results
16
Experiment Results
17
Thank you!
Questions?
18
Related Work

Index-based approaches



Prefix-based approaches




S. Sarawagi and A. Kirpal. Efficient set joins on similarity
predicates. In SIGMOD, 2004.
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering
algorithms for approximate string searches. in ICDE, 2008.
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator
for similarity joins in data cleaning. In ICDE, 2006.
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs
similarity search. In WWW, 2007.
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins
for near duplicate detection. In WWW, 2008.
PartEnum

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity
joins. In VLDB, 2006.
19

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Transcript Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Directory