Efficient Similarity Joins for Near Duplicate Detection
Download
Report
Transcript Efficient Similarity Joins for Near Duplicate Detection
Efficient Similarity Joins
for Near Duplicate Detection
Chuan Xiao
The University of New South Wales, Australia
Joint Work: Wei Wang (UNSW), Xuemin Lin (UNSW),
Jeffrey Xu Yu (CUHK)
Outline
Introduction
Algorithms
Experiments
Conclusion
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
2
Near Duplicate Data
On one end, a winded Pete Sampras tried to summon
enough energy to give the New York fans another
memorable win to talk about it on the subway ride
home. On the other side, Roger Federer wore a sly
grin like he knew age was about to catch up to the
former world No. 1 - the man who owns the record of
14 Grand Slams he wants.
03/11/2008 | 11:28 AM
2015/7/21
By JAY COHEN, AP
Sports Writer
Mar 11, 4:23 am EDT
Efficient Similarity Joins for Near Duplicate Detection
3
Applications
For Web search engines:
SPAM TEMPLATE
What are the advantages of RAID5 over RAID4?
Perform focused crawling Q.
A. 1. Several write requests could be processed in
Sir/Madam,
parallel,of
since
the bottleneck
of a unique check disk has
Increase the quality and diversity
query
results
We
happily
announce
to
you
the
draw
of athe
EURO
been eliminated. 2. Read requests
have
higher
level
MILLIONS
SPANISH
LOTTERY
INTERNATIONAL
Identify spams.
of parallelism. Since the data is distributed over all
WINNINGS
PROGRAM
PROMOTIONS
held oninthe
disks, read requests
involve
all disks, whereas
27TH
MARCH
in SPAIN.
your
systems
with a 2008
dedicated
checkYour
diskcompany
the checkordisk
personal
e-mail address
attached to ticket number 653never participates
in read.
908-321-675 with serial main number
<NUMBER> drew lucky star winning numbers
Perform document clustering<NUMBER> which consequently won in the 2ND
category, you have therefore been approved for a lump
Find replicate Web collections
sum
Q.
What
pay are
out the
of 960.000.00
advantagesEuros.
of RAID5
(NINE
over
HUNDRED
RAID4?
AND
SIXTY THOUSAND
EUROS).
A. 1. Several
write requests
could be processed in
Detect plagiarism
CONGRATULATIONS!!!
parallel,
since the bottleneck of a single check disk has
been eliminated. 2. Read requests have a higher level
Sincerely
yours,
of parallelism
on RAID5. Since the data is distributed
<NAME>
over all disks, read requests involve all disks, whereas
<AFFILIATION>
in systems with a check disk the check disk never
participates in read.
For Web mining:
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
4
Similarity Join
near duplicates = pairs of objects with high similarity
similarity -> quantitative way -> similarity function
Given a collection of records, the similarity join problem is to find all
pairs of records, <x,y>, such that sim(x,y)>=t
Tokenize:
Each record is a set of tokens from a finite universe.
Suppose each record is a single text document
x = “yes as soon as possible”
y = “as soon as possible please”
2015/7/21
word
yes
as
soon
as1
possbile
please
token
A
B
C
D
E
F
x = {A, B, C, D, E}
y = {B, C, D, E, F}
Efficient Similarity Joins for Near Duplicate Detection
5
Similarity Function
Common similarity functions:
Jaccard: J ( x, y)
x y
x y
t
x = {A,B,C,D,E}
y = {B,C,D,E,F}
4/6 = 0.67
Cosine: C ( x, y) x y t
Overlap: O( x, y) x y t
x y
4/5 = 0.8
4
Jaccard can be equivalently converted to Overlap
J ( x, y)
2015/7/21
x y
x y
t
O ( x, y )
t
x y
1 t
Efficient Similarity Joins for Near Duplicate Detection
6
Naïve and Index-Based Algorithms
Naïve Algorithm:
Compare every pair of objects -> O(n2) time complexity
Index-based Algorithm [MIR, SIGMOD04]:
Record Set
Index Construction
<w,x>
<w,y>
Candidate Generation
inverted lists
token
record_id
A
w x y
B
x z …
C
y z …
<x,y>
<x,z>
Verification
…
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
Result Pairs
7
Index-Based Algorithm
Index-based Algorithm
Example
stop words
too many candidate pairs!
Suppose sim(x,y) = O(x,y) >= t = 3
Name
w
Data Mining: Concepts and Techniques
x
Web Data Mining Techniques
y
Data Mining: Concepts, Models, Methods, and Algorithms
z
Data Management Concepts
u
v
The Merchant
Romeo
and Joliet
of Venice
Result: <w,x>, <w,y>
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
8
Prefix Filter [ICDE06, WWW07]
Sort the tokens by a global ordering
increasing order of document frequency
Index the first few tokens (prefix) for each record
Example:
suppose sim(x,y) = O(x,y) >= t = 4
sorted
x=
y=
A
B
B
C
E
D
F
E
G
uboundO(x,y) = 3 < 4!
C
D
D
A
E
B
F
G
E
sorted
prefix
Must share at least one token in prefix to be a candidate pair
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
9
Prefix Filter [ICDE06, WWW07]
O(x,y) >= t prefix length = |x| - t + 1
J(x,y) >= t O(x,y) >= t |x| prefix length = (1-t) |x| + 1
Example: suppose sim(x,y) = J(x,y) >= t = 0.8
w = {C, D, E, F}
x = {B, C, D, E, F}
y = {A, B, C, D, F}
z = {G, A, B, E, F}
2015/7/21
Candidate Pairs
Results
<w,x>, <x,y>, <y,z>
<w,x>
Efficient Similarity Joins for Near Duplicate Detection
10
Prefix + Positional Information
We use prefix filter (All-Pairs [www07]) as basic framework
Intuition
tokens sorted -> rank, or position of tokens within a record
estimate tighter upper bounds of overlap between x and y with
positional information
Contributions
index construction
index not only tokens, but their positions in the record
ppjoin algorithm
candidate generation
2015/7/21
probe tokens in suffix, compare the positions in the record
ppjoin+ algorithm
Efficient Similarity Joins for Near Duplicate Detection
11
Positional Filter within Prefix (ppjoin)
Index both tokens and their positions
position
1
2
3
4
5
x=
B
C
D
E
F
y=
A
ubound O(x,y) =
B
C
D
F
1 + min(4, 3) = 4
x y
4
ubound J ( x, y )
t 0.8
x y 6
uboundO(x,y) = 1+ min(|x| - px, |y| - py)
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
12
Positional Filter within Suffix (ppjoin+)
probe tokens in suffix, and compare their positions
suppose sim(x,y) = J(x,y) >= t = 0.8
|x| = |y| = 18, O(x,y) >= 16
prefix
x=
A B D E
y=
A C D E
suffix
Q
Q
binary search
uboundO(x,y) = 3
+4
+1
+7
= 15 < 16
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
13
Positional Filter within Suffix (ppjoin+)
Divide and Conquer
prefix
A B C D
suffix
3
2
3
1
3
2
3
A B C D
ubounddep=1 = 4 + 6 + 1 + 7 = 18
ubounddep=2 = 4 + 3 + 1 + 1 + 1 + 3 + 1 + 3 = 17
ubounddep=3 = 4 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 15
probe suffix recursively, until either candidate pair is
pruned, or reach max-depth
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
14
Effect of Filters
sim(x,y) = J(x,y) >= t = 0.8
after prefix filter:
<u,v>, <w,y>, <w,z>, <x,z>, <y,z>
after ppjoin+ (max-depth = 1):
<u,v>, <v,w>, <v,y>, <w,x>, <w,y>, <w,z>, <x,y>, <x,z>, <y,z>
after ppjoin:
u = {C, D, E, F}
v = {B, C, D, E, F}
w = {A, B, C, D, F}
x = {G, A, B, E, F}
y = {A, B, D, E, F}
z = {G, A, C, D, E, F}
<u,v>, <x,z>, <y,z>
real result:
<u,v>
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
15
Experiment Settings
Algorithms Compared
All-Pairs [WWW07]
PPJoin
PPJoin+
Measure
Jaccard, Cosine
Candidate Size, Running Time
Near Duplicate Web Page Detection
2015/7/21
compare with shingling [SEQS97]
Efficient Similarity Joins for Near Duplicate Detection
16
Experiment Settings
Environment
Pentium D 3.00GHz CPU, 2GB RAM
Debian 4.1, GCC 4.1.2 with –O3
Dataset
2015/7/21
dataset
# of records
avg. size
DBLP (author, title)
0.9M
14.0
ENRON (email)
0.5M
142.4
DBLP-3GRAM
0.9M
102.5
TREC-4GRAM
(author, title, abstract)
0.35M
866.9
TREC-32shingle
0.35M
32
Efficient Similarity Joins for Near Duplicate Detection
17
Experiment Results – DBLP, Jaccard
Candidate Pairs
2015/7/21
Running Time
Efficient Similarity Joins for Near Duplicate Detection
18
Exp. Results – Near Duplicate Web Page Detection
extract qgram and shingles set, and perform similarity join
rs = result from TREC-32shingle, rq = result from TREC-4gram
Precision = tp / rs =
Recall = tp / rq =
Results:
2015/7/21
/
rs
/
rq
tp
threshold
(Jaccard)
precision
recall
time
(qgramallpairs)
time
(qgramppjoin+)
time
(shingling
+ ssjoin)
0.95
0.38
0.11
41.98s
11.76s
1.00s
0.90
0.48
0.06
245.03s
43.37s
1.03s
0.85
0.58
0.04
926.54s
202.65s
1.03s
Efficient Similarity Joins for Near Duplicate Detection
19
Conclusion
Contributions
New algorithms for set-similarity joins
Features
positional filtering within prefix -> ppjoin
positional filtering within suffix -> ppjoin+
exact
outperform existing algorithms
integrated with near duplicate Web page detection methods
Future Work:
other similarity function
edit-distance
top-k similarity search queries
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
20
Related Work
Approximate:
LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high
dimensions via hashing. In VLDB, 1999.
Shingling: A. Z. Broder. On the resemblence and containment of
documents. In SEQS, 1997.
Exact:
Index-based:
Prefix-based:
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In
SIGMOD, 2004.
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity
joins in data cleaning. In ICDE, 2006.
All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity
search. In WWW, 2007.
Pigeon-hole principle based:
2015/7/21
PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity
joins. In VLDB, 2006.
Efficient Similarity Joins for Near Duplicate Detection
21
Thank you!
Questions?
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
22
References
[SEQS97] A. Z. Broder. On the resemblance and containment of documents.
In SEQS 1997.
[MIR] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrival.
Addison Wesley, 1st edition, May 1999.
[VLDB99] LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high
dimensions via hashing. In VLDB, 1999.
[SIGMOD04] S. Sarawagi and A. Kirpal. Efficient set joins on similarity
predicates. In SIGMOD, 2004.
[ICDE06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for
similarity joins in data cleaning. In ICDE, 2006.
[VLDB06] PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact setsimilarity joins. In VLDB, 2006.
[WWW07] All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs
similarity search. In WWW, 2007.
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
23
Backup Slides
Memory Issues
We need twice amount of memory as All-Pairs on building index.
Space / Time
Some techniques to deal with memory
Integrated with RDBMS
Do not build index for widowed tokens (appear only once)
Sort the records are sorted by increasing size; dynamically remove
shorter records from inverted lists
Prefix filter in RDBMS [ICDE06]
Need to implement positional filters in both prefix and suffix
Q: What if the probing tokens are not found in y?
Convert overlap to hamming distance
Estimate the upper bound of hamming distance
2015/7/21
Efficient Similarity Joins for Near Duplicate Detection
24