Near-duplicates detection: Comparison of the two algorithms seen in class

Transcript Near-duplicates detection: Comparison of the two algorithms seen in class

Near-duplicates detection
Comparison of the two
algorithms seen in class
Romain Colle
Description of algorithms
●
●
●
●
1st pass through the data : Both algorithms
compute a signature for each document, and
perform LSH on these signatures.
2nd pass through the data : Verification of the
relevance of the duplicates pairs found
(Jaccard similarity).
Algorithm SH uses Shingles + MinHashing to
compute the signatures.
Algorithm SK uses sketches of projections on
random hyperplanes to compute the
signatures.
Experimentation method
●
●
●
●
Run both algorithms on the data set
(WebBase), and compute precision.
Remove duplicates pairs found from the data
set.
Generate and insert large amounts of (near-)
duplicates documents (~10% of the data set).
Run both algorithms on the new dataset, and
compute precision and recall.
Results (original data set)
Number of pairs found (log
scale)
Near-duplicates pairs found in
original data set
7
6
5
4
Shingles
3
Sket ches
2
1
0
Small
M edium
Large
Results (modified dataset)
Recall for customized
datasets
102
Recall in %
100
98
96
Shingles
94
Sket ches
92
90
88
Few
Insertio ns
(mediumsize data)
Large
Large
insertio ns insertio ns
(medium- (large data)
size data)
Precision for customized datasets
70
Precision in %
60
50
40
Shingles
30
Sketches
20
10
0
Few Insertio ns
(medium-size
data)
Large insertio ns
(medium-size
data)
Large insertio ns
(large data)
Conclusion
●
●
Algorithm SK rocks !
However, it is computationally more
expensive
Tradeoff between speed and recall/precision
(given that algorithm SH performs quite well)
●

Near-duplicates detection: Comparison of the two algorithms seen in class

Transcript Near-duplicates detection: Comparison of the two algorithms seen in class

Directory