Efficient Similarity Joins for Near Duplicate Detection

Transcript Efficient Similarity Joins for Near Duplicate Detection

Efficient Similarity Joins
for Near Duplicate Detection
Chuan Xiao
The University of New South Wales, Australia
Joint Work: Wei Wang (UNSW), Xuemin Lin (UNSW),
Jeffrey Xu Yu (CUHK)
Outline
Introduction
 Algorithms
 Experiments
 Conclusion

2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
2
Near Duplicate Data
On one end, a winded Pete Sampras tried to summon
enough energy to give the New York fans another
memorable win to talk about it on the subway ride
home. On the other side, Roger Federer wore a sly
grin like he knew age was about to catch up to the
former world No. 1 - the man who owns the record of
14 Grand Slams he wants.
03/11/2008 | 11:28 AM
2015/7/21
By JAY COHEN, AP
Sports Writer
Mar 11, 4:23 am EDT
Efficient Similarity Joins for Near Duplicate Detection
3
Applications

For Web search engines:




SPAM TEMPLATE
What are the advantages of RAID5 over RAID4?
Perform focused crawling Q.
A. 1. Several write requests could be processed in
Sir/Madam,
parallel,of
since
the bottleneck
of a unique check disk has
Increase the quality and diversity
query
results
We
happily
announce
to
you
the
draw
of athe
EURO
been eliminated. 2. Read requests
have
higher
level
MILLIONS
SPANISH
LOTTERY
INTERNATIONAL
Identify spams.
of parallelism. Since the data is distributed over all
WINNINGS
PROGRAM
PROMOTIONS
held oninthe
disks, read requests
involve
all disks, whereas
27TH
MARCH
in SPAIN.
your
systems
with a 2008
dedicated
checkYour
diskcompany
the checkordisk
personal
e-mail address
attached to ticket number 653never participates
in read.
908-321-675 with serial main number
<NUMBER> drew lucky star winning numbers
 Perform document clustering<NUMBER> which consequently won in the 2ND
category, you have therefore been approved for a lump
 Find replicate Web collections
sum
Q.
What
pay are
out the
of 960.000.00
advantagesEuros.
of RAID5
(NINE
over
HUNDRED
RAID4?
AND
SIXTY THOUSAND
EUROS).
A. 1. Several
write requests
could be processed in
 Detect plagiarism
CONGRATULATIONS!!!
parallel,
since the bottleneck of a single check disk has
been eliminated. 2. Read requests have a higher level
Sincerely
yours,
of parallelism
on RAID5. Since the data is distributed
<NAME>
over all disks, read requests involve all disks, whereas
<AFFILIATION>
in systems with a check disk the check disk never
participates in read.
For Web mining:
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
4
Similarity Join




near duplicates = pairs of objects with high similarity
similarity -> quantitative way -> similarity function
Given a collection of records, the similarity join problem is to find all
pairs of records, <x,y>, such that sim(x,y)>=t
Tokenize:


Each record is a set of tokens from a finite universe.
Suppose each record is a single text document
 x = “yes as soon as possible”
 y = “as soon as possible please”


2015/7/21
word
yes
as
soon
as1
possbile
please
token
A
B
C
D
E
F
x = {A, B, C, D, E}
y = {B, C, D, E, F}
Efficient Similarity Joins for Near Duplicate Detection
5
Similarity Function

Common similarity functions:
 Jaccard: J ( x, y) 

x y
x y
t
x = {A,B,C,D,E}
y = {B,C,D,E,F}
4/6 = 0.67

Cosine: C ( x, y)  x  y  t

Overlap: O( x, y)  x  y  t
x y
4/5 = 0.8
4
Jaccard can be equivalently converted to Overlap
J ( x, y) 
2015/7/21
x y
x y
t
 O ( x, y ) 
t
x  y 
1 t
Efficient Similarity Joins for Near Duplicate Detection
6
Naïve and Index-Based Algorithms

Naïve Algorithm:


Compare every pair of objects -> O(n2) time complexity
Index-based Algorithm [MIR, SIGMOD04]:
Record Set
Index Construction
<w,x>
<w,y>
Candidate Generation
inverted lists
token
record_id
A
w x y
B
x z …
C
y z …
<x,y>
<x,z>
Verification
…
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
Result Pairs
7
Index-Based Algorithm


Index-based Algorithm
Example

stop words
too many candidate pairs!
Suppose sim(x,y) = O(x,y) >= t = 3
Name
w
Data Mining: Concepts and Techniques
x
Web Data Mining Techniques
y
Data Mining: Concepts, Models, Methods, and Algorithms
z
Data Management Concepts
u
v
The Merchant
Romeo
and Joliet
of Venice

Result: <w,x>, <w,y>
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
8
Prefix Filter [ICDE06, WWW07]

Sort the tokens by a global ordering



increasing order of document frequency
Index the first few tokens (prefix) for each record
Example:

suppose sim(x,y) = O(x,y) >= t = 4
sorted

x=

y=
A
B
B
C
E
D
F
E
G
uboundO(x,y) = 3 < 4!
C
D
D
A
E
B
F
G
E
sorted
prefix

Must share at least one token in prefix to be a candidate pair
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
9
Prefix Filter [ICDE06, WWW07]

O(x,y) >= t  prefix length = |x| - t + 1
J(x,y) >= t  O(x,y) >= t |x|  prefix length = (1-t) |x| + 1

Example: suppose sim(x,y) = J(x,y) >= t = 0.8





w = {C, D, E, F}
x = {B, C, D, E, F}
y = {A, B, C, D, F}
z = {G, A, B, E, F}
2015/7/21
Candidate Pairs
Results
<w,x>, <x,y>, <y,z>
<w,x>
Efficient Similarity Joins for Near Duplicate Detection
10
Prefix + Positional Information


We use prefix filter (All-Pairs [www07]) as basic framework
Intuition



tokens sorted -> rank, or position of tokens within a record
estimate tighter upper bounds of overlap between x and y with
positional information
Contributions

index construction


index not only tokens, but their positions in the record
 ppjoin algorithm
candidate generation

2015/7/21
probe tokens in suffix, compare the positions in the record
 ppjoin+ algorithm
Efficient Similarity Joins for Near Duplicate Detection
11
Positional Filter within Prefix (ppjoin)

Index both tokens and their positions
position
1
2
3
4
5

x=
B
C
D
E
F

y=
A
ubound O(x,y) =
B
C
D
F
1 + min(4, 3) = 4
x y
4
ubound J ( x, y ) 
  t  0.8
x y 6

uboundO(x,y) = 1+ min(|x| - px, |y| - py)
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
12
Positional Filter within Suffix (ppjoin+)



probe tokens in suffix, and compare their positions
suppose sim(x,y) = J(x,y) >= t = 0.8
|x| = |y| = 18, O(x,y) >= 16
prefix

x=
A B D E

y=
A C D E
suffix
Q
Q
binary search

uboundO(x,y) = 3
+4
+1
+7
= 15 < 16
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
13
Positional Filter within Suffix (ppjoin+)

Divide and Conquer
prefix
A B C D
suffix
3
2
3
1
3
2
3
A B C D




ubounddep=1 = 4 + 6 + 1 + 7 = 18
ubounddep=2 = 4 + 3 + 1 + 1 + 1 + 3 + 1 + 3 = 17
ubounddep=3 = 4 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 15
probe suffix recursively, until either candidate pair is
pruned, or reach max-depth
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
14
Effect of Filters

sim(x,y) = J(x,y) >= t = 0.8







after prefix filter:


<u,v>, <w,y>, <w,z>, <x,z>, <y,z>
after ppjoin+ (max-depth = 1):


<u,v>, <v,w>, <v,y>, <w,x>, <w,y>, <w,z>, <x,y>, <x,z>, <y,z>
after ppjoin:


u = {C, D, E, F}
v = {B, C, D, E, F}
w = {A, B, C, D, F}
x = {G, A, B, E, F}
y = {A, B, D, E, F}
z = {G, A, C, D, E, F}
<u,v>, <x,z>, <y,z>
real result:

<u,v>
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
15
Experiment Settings

Algorithms Compared




All-Pairs [WWW07]
PPJoin
PPJoin+
Measure



Jaccard, Cosine
Candidate Size, Running Time
Near Duplicate Web Page Detection

2015/7/21
compare with shingling [SEQS97]
Efficient Similarity Joins for Near Duplicate Detection
16
Experiment Settings

Environment



Pentium D 3.00GHz CPU, 2GB RAM
Debian 4.1, GCC 4.1.2 with –O3
Dataset
2015/7/21
dataset
# of records
avg. size
DBLP (author, title)
0.9M
14.0
ENRON (email)
0.5M
142.4
DBLP-3GRAM
0.9M
102.5
TREC-4GRAM
(author, title, abstract)
0.35M
866.9
TREC-32shingle
0.35M
32
Efficient Similarity Joins for Near Duplicate Detection
17
Experiment Results – DBLP, Jaccard

Candidate Pairs
2015/7/21

Running Time
Efficient Similarity Joins for Near Duplicate Detection
18
Exp. Results – Near Duplicate Web Page Detection

extract qgram and shingles set, and perform similarity join
rs = result from TREC-32shingle, rq = result from TREC-4gram

Precision = tp / rs =

Recall = tp / rq =
Results:


2015/7/21
/
rs
/
rq
tp
threshold
(Jaccard)
precision
recall
time
(qgramallpairs)
time
(qgramppjoin+)
time
(shingling
+ ssjoin)
0.95
0.38
0.11
41.98s
11.76s
1.00s
0.90
0.48
0.06
245.03s
43.37s
1.03s
0.85
0.58
0.04
926.54s
202.65s
1.03s
Efficient Similarity Joins for Near Duplicate Detection
19
Conclusion

Contributions

New algorithms for set-similarity joins



Features




positional filtering within prefix -> ppjoin
positional filtering within suffix -> ppjoin+
exact
outperform existing algorithms
integrated with near duplicate Web page detection methods
Future Work:

other similarity function


edit-distance
top-k similarity search queries
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
20
Related Work

Approximate:



LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high
dimensions via hashing. In VLDB, 1999.
Shingling: A. Z. Broder. On the resemblence and containment of
documents. In SEQS, 1997.
Exact:

Index-based:


Prefix-based:



S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In
SIGMOD, 2004.
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity
joins in data cleaning. In ICDE, 2006.
All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity
search. In WWW, 2007.
Pigeon-hole principle based:

2015/7/21
PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity
joins. In VLDB, 2006.
Efficient Similarity Joins for Near Duplicate Detection
21
Thank you!
Questions?
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
22
References



[SEQS97] A. Z. Broder. On the resemblance and containment of documents.
In SEQS 1997.
[MIR] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrival.
Addison Wesley, 1st edition, May 1999.
[VLDB99] LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high
dimensions via hashing. In VLDB, 1999.




[SIGMOD04] S. Sarawagi and A. Kirpal. Efficient set joins on similarity
predicates. In SIGMOD, 2004.
[ICDE06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for
similarity joins in data cleaning. In ICDE, 2006.
[VLDB06] PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact setsimilarity joins. In VLDB, 2006.
[WWW07] All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs
similarity search. In WWW, 2007.
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
23
Backup Slides

Memory Issues



We need twice amount of memory as All-Pairs on building index.
Space / Time
Some techniques to deal with memory



Integrated with RDBMS



Do not build index for widowed tokens (appear only once)
Sort the records are sorted by increasing size; dynamically remove
shorter records from inverted lists
Prefix filter in RDBMS [ICDE06]
Need to implement positional filters in both prefix and suffix
Q: What if the probing tokens are not found in y?


Convert overlap to hamming distance
Estimate the upper bound of hamming distance
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
24

Efficient Similarity Joins for Near Duplicate Detection

Transcript Efficient Similarity Joins for Near Duplicate Detection

Directory