Detecting Near-Duplicates for Web Crawling
Download
Report
Transcript Detecting Near-Duplicates for Web Crawling
DETECTING NEAR-DUPLICATES FOR
WEB CRAWLING
Authors:
Gurmeet Singh Manku,
Arvind Jain, and
Anish Das Sarma
Presentation By:
Fernando Arreola
Outline
De-duplication
Goal of the Paper
Why is De-duplication Important?
Algorithm
Experiment
Related Work
Tying it Back to Lecture
Paper Evaluation
Questions
6/20/2011
Detecting Near-Duplicates for Web Crawling
2
De-duplication
The process of eliminating near-duplicate web
documents in a generic crawl
Challenge of near-duplicates:
Identifying
Use
How
exact duplicates is easy
checksums
to identify near-duplicate?
Near-duplicates
are identical in content but have differences
in small areas
6/20/2011
Ads, counters, and timestamps
Detecting Near-Duplicates for Web Crawling
3
Goal of the Paper
Present near-duplicate detection system which
improves web crawling
Near-duplicate detection system includes:
Simhash
technique
Technique
used to transform a web-page to an f-bit
fingerprint
Solution
to Hamming Distance Problem
Given
f-bit fingerprint find all fingerprints in a given
collection which differ by at most k-bit positions
6/20/2011
Detecting Near-Duplicates for Web Crawling
4
Why is De-duplication Important?
Elimination of near duplicates:
Saves
network bandwidth
Do
not have to crawl content if similar to previously crawled
content
Reduces
storage cost
Do
not have to store in local repository if similar to
previously crawled content
Improves
quality of search indexes
Local
repository used for building search indexes not
polluted by near-duplicates
6/20/2011
Detecting Near-Duplicates for Web Crawling
5
Algorithm: Simhash Technique
Convert web-page to set of features
Using Information Retrieval techniques
Dimension values start at 0
Update f-dimensional vector with weight of feature
Give a weight to each feature
Hash each feature into a f-bit value
Have a f-dimensional vector
e.g. tokenization, phrase detection
If i-th bit of hash value is zero -> subtract i-th vector value by weight of
feature
If i-th bit of hash value is one -> add the weight of the feature to the ith vector value
Vector will have positive and negative components
Sign (+/-) of each component are bits for the fingerprint
6/20/2011
Detecting Near-Duplicates for Web Crawling
6
Algorithm: Simhash Technique (cont.)
Very simple example
One
web-page
Web-page
Reduced
to two features
“Simhash”
“Technique”
Hash
-> weight = 2
-> weight = 4
features to 4-bits
“Simhash”
“Technique”
6/20/2011
text: “Simhash Technique”
-> 1101
-> 0110
Detecting Near-Duplicates for Web Crawling
7
Algorithm: Simhash Technique (cont.)
Start vector with all zeroes
0
0
0
0
6/20/2011
Detecting Near-Duplicates for Web Crawling
8
Algorithm: Simhash Technique (cont.)
Apply “Simhash” feature (weight = 2)
feature’s
f-bit value
calculation
1
0+2
2
0
1
0+2
2
0
0
0-2
-2
0
1
0+2
2
0
6/20/2011
Detecting Near-Duplicates for Web Crawling
9
Algorithm: Simhash Technique (cont.)
Apply “Technique” feature (weight = 4)
feature’s
f-bit value
calculation
0
2-4
-2
2
1
2+4
6
-2
1
-2 + 4
2
2
0
2-4
-2
2
6/20/2011
Detecting Near-Duplicates for Web Crawling
10
Algorithm: Simhash Technique (cont.)
Final vector:
-2
6
2
-2
Sign of vector values is -,+,+,Final 4-bit fingerprint = 0110
6/20/2011
Detecting Near-Duplicates for Web Crawling
11
Algorithm: Solution to Hamming
Distance Problem
Problem: Given f-bit fingerprint (F) find all fingerprints in a given
collection which differ by at most k-bit positions
Solution:
Create tables containing the fingerprints
Each table has a permutation (π) and a small integer (p) associated with it
Apply the permutation associated with the table to its fingerprints
Sort the tables
Store tables in main-memory of a set of machines
Iterate through tables in parallel
6/20/2011
Find all permutated fingerprints whose top pi bits match the top pi bits of πi(F)
For the fingerprints that matched, check if they differ from πi(F) in at most k-bits
Detecting Near-Duplicates for Web Crawling
12
Algorithm: Solution to Hamming
Distance Problem (cont.)
Simple example
F
= 0100 1101
K = 3
Have a collection of 8 fingerprints
Fingerprints
Create two tables
1100 0101
1111 1111
0101 1100
0111 1110
1111 1110
0000 0001
1111 0101
1101 0010
6/20/2011
Detecting Near-Duplicates for Web Crawling
13
Algorithm: Solution to Hamming
Distance Problem (cont.)
Fingerprints
1100 0101
1111 1111
0101 1100
0111 1110
1111 1110
0010 0001
1111 0101
1101 0010
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
1011 1111
1111 1111
0100 1000
1100 0101
0111 1101
1110 0111
1011 0100
6/20/2011
Detecting Near-Duplicates for Web Crawling
14
Algorithm: Solution to Hamming
Distance Problem (cont.)
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
1011 1111
1111 1111
0100 1000
1100 0101
0111 1101
1110 0111
1011 0100
Sort
Sort
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
0100 1000
1100 0101
0111 1101
1110 0111
1011 0100
1111 1111
1011 1111
6/20/2011
Detecting Near-Duplicates for Web Crawling
15
Algorithm: Solution to Hamming
Distance Problem (cont.)
F = 0100 1101
π(F) = 1101 0100
π(F) = 0101 0011
p = 3; π = Swap last four bits with first four bits
p = 3; π = Move last two bits to the front
0101 1100
0100 1000
1100 0101
Match!
0111 1101
1110 0111
1011 0100
1111 1111
1011 1111
6/20/2011
Detecting Near-Duplicates for Web Crawling
16
Algorithm: Solution to Hamming
Distance Problem (cont.)
With k =3, only fingerprint in first table is a nearduplicate of the F fingerprint
p = 3; π = Swap last four bits with
first four bits
p = 3; π = Move last two bits to the
front
1
1
0
1
0
1
0
0
1
1
0
0
0
1
0
1
6/20/2011
F
0
1
0
1
0
0
1
1
0
1
0
0
1
0
0
0
Detecting Near-Duplicates for Web Crawling
17
Algorithm: Compression of Tables
Store first fingerprint in a block (1024 bytes)
XOR the current fingerprint with the previous one
Append to the block the Huffman code for the position
of the most significant 1 bit
Append to the block the bits after the most significant 1
bit
Repeat steps 2-4 until block is full
Comparing to the query fingerprint
Use last fingerprint (key) in the block and perform
interpolation search to decompress appropriate block
6/20/2011
Detecting Near-Duplicates for Web Crawling
18
Algorithm: Extending to Batch Queries
Problem: Want to get near-duplicates for batch of query
fingerprints – not just one
Solution:
Use Google File System (GFS) and MapReduce
Create two files
Store the files in GFS
MapReduce allows for a task to be created per chunk
Iterate through chunks in parallel
Each task produces output of near-duplicates found
Produce sorted file from output of each task
6/20/2011
GFS breaks up the files into chunks
Use MapReduce to solve the Hamming Distance Problem for each chunk of F
for all queries in Q
File F has the collection of fingerprints
File Q has the query fingerprints
Remove duplicates if necessary
Detecting Near-Duplicates for Web Crawling
19
Experiment: Parameters
8 Billion web pages used
K = 1 …10
Manually tagged pairs as follows:
True
positives
Differ
False
slightly
positives
Radically
different pairs
Unknown
Could
6/20/2011
not be evaluated
Detecting Near-Duplicates for Web Crawling
20
Experiment: Results
Accuracy
Low k value -> a lot of false negatives
High k value -> a lot of false positives
Best value
-> k = 3
75% of near-duplicates reported
75% of reported cases are true positives
Running Time
Solution Hamming Distance: O(log(p))
Batch Query + Compression:
6/20/2011
32GB File & 200 tasks -> runs under 100 seconds
Detecting Near-Duplicates for Web Crawling
21
Related Work
Clustering related documents
Detect
near-duplicates to show related pages
Data extraction
Determine
schema of similar pages to obtain
information
Plagiarism
Detect
pages that have borrowed from each other
Spam
Detect
6/20/2011
spam before user receives it
Detecting Near-Duplicates for Web Crawling
22
Tying it Back to Lecture
Similarities
Indicated importance of de-duplication to save crawler
resources
Brief summary of several uses for near-duplicate detection
Differences
Lecture focus:
Breadth-first look at algorithms for near-duplicate detection
Paper focus:
In-depth look of simhash and Hamming Distance algorithm
6/20/2011
Includes how to implement and effectiveness
Detecting Near-Duplicates for Web Crawling
23
Paper Evaluation: Pros
Thorough step-by-step explanation of the algorithm
implementation
Thorough explanation on how the conclusions were
reached
Included brief description of how to improve
simhash + Hamming Distance algorithm
Categorize
web-pages before running simhash, create
algorithm to remove ads or timestamps, etc.
6/20/2011
Detecting Near-Duplicates for Web Crawling
24
Paper Evaluation: Cons
No comparison
How
much more effective or faster is it than other
algorithms?
By how much did it improve the crawler?
Limited batch queries to a specific technology
Implementation
required use of GFS
Approach not restricted to certain technology might be
more applicable
6/20/2011
Detecting Near-Duplicates for Web Crawling
25
Any Questions?
???
6/20/2011
Detecting Near-Duplicates for Web Crawling
26