Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Download
Report
Transcript Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Top-k Set Similarity Joins
Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang
University of New South Wales and NICTA
Motivation
Data Cleaning
University
City
State
Postal Code
University of New South Wales
Sydney
NSW
2052
University of Sydney
Sydney
NSW
2006
University of Melbourne
Melbourne
Victoria
3010
University of Queensland
Brisbane
Queensland
4072
University of New South Vales
Sydney
NSW
2052
2
More Applications
Obama Has Busy Final Day Before
Taking Office as Bush Says Farewells
iht.com
Jan 20, 2009
New York Times
Jan 19th, 2009
3
(Traditional) Set Similarity Join
Each record is tokenized into a set
Given a collection of records, the set similarity join problem
is to find all pairs of records, <x,y>, such that sim(x,y) t
Common similarity functions:
x = {A,B,C,D,E}
y = {B,C,D,E,F}
jaccard: J ( x, y )
cosine:
C ( x, y )
dice:
D ( x, y )
x y
x y
x y
x y
2x y
x y
t
4/6 = 0.67
t
4/5 = 0.8
t
8/10 = 0.8
What if t is unknown beforehand?
4
What if t is unknown beforehand?
Example – using jaccard similarity function
w = {A, B, C, D, E}
x = {A, B, C, E, F}
y = {B, C, D, E, F}
z = {B, C, F, G, H}
If t = 0.7 no results
If t = 0.4 <w,x>, <w,y>, <x,y>, <x,z>, <y,z> too many
results and long running time
Return the top-k results ranked by their similarity values
if k = 1 <w,x>
5
Top-k Set Similarity Join
Return top-k pairs of records, ranked by similarity scores.
Advantages over traditional similarity join
without specifying a threshold
output results progressively benefit interactive applications
produces most meaningful results under limited resources or
time constraints can be stopped at any time, but still
guarantee sim(output results) sim(unseen pairs)
6
Straightforward Solution
Start from a certain t, repeat the following steps:
answer traditional sim-join with t as threshold
if # of results k, stop and output k results with highest sim
else, decrease t
Example (jaccard, k = 2)
Which thresholds shall we
enumerate?
w = {A, B, C, E}
x = {A, B, C, E, F}
y = {B, C, D, E, F}
z = {B, C, F, G, H}
t
t
t
t
=
=
=
=
0.9
0.8
0.7
0.6
0.8, 0.6
no result
<w,x>
results don’t change!
<w,x>
<w,x>, <x,y>
7
Naïve and Index-Based Algorithms
Naïve Algorithm:
Compare every pair of objects -> O(n2) time complexity
Index-based Algorithm
[Sarawagi et al. SIGMOD04]:
inverted lists
token
Record Set
Index Construction
<w,x>
<w,y>
Candidate Generation
record_id
A
w x
B
x
z …
C
y
z …
<x,y>
<x,z>
…
Verification
8
Result Pairs
y
Prefix Filter
Sort the tokens by a global ordering
[Chaudhuri et al. ICDE06, Bayardo et al. WWW07]
increasing order of document frequency
Only need to index the first few tokens (prefix) for each record
Example:
jaccard t = 0.8 |x y| 4 if |x|=|y|=5
sorted
x=
y=
A
C
B
D
E
E
F
F
G
upper boundO(x,y) = 3 < 4!
G
sorted
prefix
Must share at least one token in prefix to be a candidate pair
For jaccard, prefix length = |x| * (1 – t) + 1 each t is associated
with a prefix length
9
Necessary Thresholds
Each prefix is associated with a threshold, i.e., the
maximum possible similarity a record can achieve with
other records.
t
x=
1.0
0.8
0.6
A
B
C
What thresholds shall we enumerate? All the thresholds
with which prefixes are associated!
Necessary thresholds
If we change between different thresholds, there exists a
database instance where the results will change
extend prefix by one token, and consider the new t
10
Event-driven Model
Problem: repeated invocation of sim-join algorithm
t is decreasing run sim-join algorithm in an incremental way
Prefix Event <x, A, t>
initialize prefix length for each record as 1 <x, A, 1.0>
for each prefix event
probe the inverted list of the token for candidate pairs, verify the
candidate pairs, and insert them into temp results.
insert x into A’s inverted list
extend prefix by one token maintain prefix events with a maxheap on t
stop until t k-th temp result’s similarity
x
1.0
0.75
y
1.0
0.8
0.6
z
1.0
0.9
0.8
0.7
11
topk-join - Example
jaccard, k=2
w
A
B
C
E
x
A
B
C
E
t=0.6 2nd
temp result’s sim
<x, B, 0.8>
<y, C, 0.8>
F
<z, C, 0.8>
y
B
C
D
E
F
z
B
C
F
G
H
<w, B, 0.75>
inverted list
token
prefix event
temporary result
record_id
A
w
x
B
y
z
C
y
z
(w,x) = 0.8
x
(y,z) = 0.43
w
(x,y) = 0.67
12
verified
twice!
Optimizations - Verification
In the above example, (w,x) and (y,z) have been verified
twice
How to avoid repeated verification?
memorize all verified pairs with a hash table too much
memory consumption
check if this pair will be identified again when it is verified for
the first time
1.0 0.8 0.6
x
y
A
A
B
C
D
D
E
E
F
F
if k-th temp result’s sim = 0.7
won’t be identified again!
keep only those will be identified again before algorithm stops
guarantee no pair will be verified twice
13
Optimizations - Indexing
How to reduce inverted list size to save memory?
x
A
C
D
E
F
y
B
C
D
E
F
identified by <x, C, 0.8> or <y, C, 0.8>, yet the maximum
similarity they can achieve is 4/6 = 0.67
t is decreasing calculate the upper bound of similarity for
future probings into inverted lists
don’t insert into inverted list if this upper bound k-th temp
result’s similarity
14
Experiment Settings
Algorithms
Measure
topk-join
pptopk: modified ppjoin[Xiao, et al. WWW08], a prefix-filter based
approach, with t = 0.95, 0.90, 0.85...
compare topk-join and pptopk (candidate size, running time)
output results progressively
Dataset
dataset
# of records
avg. record size
DBLP (author, title)
855k
14.0
TREC (author, title, abstract)
348k
130.1
TREC-3GRAM
348k
868.5
UNIREF-3GRAM (protein seq.)
500k
372.9
15
Experiment Results
16
Experiment Results
17
Thank you!
Questions?
18
Related Work
Index-based approaches
Prefix-based approaches
S. Sarawagi and A. Kirpal. Efficient set joins on similarity
predicates. In SIGMOD, 2004.
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering
algorithms for approximate string searches. in ICDE, 2008.
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator
for similarity joins in data cleaning. In ICDE, 2006.
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs
similarity search. In WWW, 2007.
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins
for near duplicate detection. In WWW, 2008.
PartEnum
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity
joins. In VLDB, 2006.
19