Detecting Near-Duplicates for Web Crawling

Download Report

Transcript Detecting Near-Duplicates for Web Crawling

Detecting Near-Duplicates for
Web Crawling
Gurmeet Singh Manku, Arvind Jain,
Anish Das Sarma
Presented by Yen-Yi Hung
Overview








Introduction
Previous Related Work
Algorithm
Evaluation
Future Work
Pros & Cons
Comment
Reference
Introduction –
The drawback of duplicate pages







Waste network bandwidth
Affect refresh times
Impact politeness constraints
Increase storage cost
Affect the quality of search indexes
Increase the load on the remote host that is
serving such web pages
Affect customer satisfaction
Introduction –
Challenge and Contribution of this paper
Challenge:
 Dealing with scale issue
 Determining near-duplicates efficiently
Contribution:
 Showing that simhash could be used to deal
with the huge amount query
 Developing a way to solve Hamming Distance
Problem quickly (for online single query or
batch multi-queries)
Previous Related Work
Many related techniques are different when
they deal with different corpus, end goals,
feature sets, or signature schema.
Corpus:




Web Documents
Files in a file system
E-mails
Domain-Specific corpora
Previous Related Work (II)
End Goals:






Web mirrors
Clustering for related documents query
data extraction
Plagiarism
spam detection
duplicates in domain-specific corpora
Feature Sets:





Shingles from page content
Document vector from page content
Connectivity information
Anchor text, anchor window
Phrases
Previous Related Work (III)
Signature Schema:




Mod-p shingles
Min-hash for Jaccard similarity of sets
Signatures/fingerprints over IR-based document vectors
Checksums
This paper focus on Web Documents. Its goal is
to improve web crawling using Simhash
technique.
Algorithm –
Simhash fingerprinting

What could Simhash do?
Mapping high-dimensional vectors to small-sized
fingerprints.

The atypical Simhash property
Similar documents have similar hash values.

How to apply?
Converting web pages to a set of weighted features (computed
using standard IR techniques.)
Algorithm –
Hamming Distance Problem
Hamming Distance Problem
Given a collection of f-bit fingerprints and a query fingerprint F,
we need to identify whether an existing fingerprint differs from
F in at most k bits.
But simply probing the fingerprint collection is
impractical. So what should we do?
1. Build t Table T1, T2, …, Tt. Each table has an integer pi and
a permutation Πi.
2. Apply permutation Πi to each existing fingerprint in each
Table Ti and sort each Ti.
Algorithm –
Hamming Distance Problem
3. Given fingerprint F and an integer k which used to
determine the hamming distance. We use the following 2 steps
to solve hamming distance problem.
Step 1: Find all permuted fingerprints in Ti whose top pi bit-
positions match the top pi bit-positions of Πi(F)
Step 2: For each fingerprints found in step 1, check if it differs
from Πi(F) in at most k bit-posions.
Time Complexity:


Step 1 can be done in O(pi) steps using binary search.
Step 2 can be shrink to O(log pi) steps using interpolation
search.
Algorithm –
Compression of Fingerprints

Step1: The first fingerprint in the block is remembered in its
entirety.

Step2: Get the most significant 1-bit in the XOR of two
successive fingerprints, and we denote it as h.


Step3: Append the Huffman code of h to the block.
Step4: Append the bits to the right of the most-significant
1-bit to the block.

Step5: Repeat step 2,3,4 till a block (1024 bytes) is full
Algorithm –
Batch query implementation
Both File F (existing fingerprints) and File Q (the
batch of query fingerprints) are stored in a
shared-nothing distributed file system GFS.
The batch queries could be spilt into 2 phases
 Phase 1: We solve the hamming distance problem over some
chunks of F and the entire file Q as input. The outputs of the
computation are near-duplicate fingerprints.

Phase 2: MapReduce will remove duplicates and produces a
single sorted file according to the results of phase 1.
Evaluation

Is simhash a reasonable technique
when dealing with de-duplication issue?
when choosing k=3,
precision and recall ≒ 0.75
*According to the result of “Finding near-duplicate web pages: a large-scale
evaluation of algorithms” by M. R. Henzinger in 2006, its precision and recall
are around 0.8.
Evaluation

Will the characteristic of simhash affect the results? If
yes, then is it a significant impact?
Fig 2(a):
Right-half displays the specific distribution but not
the Left-half. This is because some similar contents
only have moderate difference in Simhash values.
Fig2(b):
Distribution has some spikes because of
empty pages, file not found pages, and the
similar login pages for some bulletin board
software.
Evaluation


32GB batch queries fingerprints with
200 mappers, the combined rates could
exceed 1GBps.
Given fixed number of mappers, the
time taken is roughly proportional to
the size of file Q. [Compression plays an
important role.]
Future Work
Based on this paper:



Document size
Category information de-duplication
Near-duplication vs. Clustering
Other Research topic:

More cost-effective approach of using just the URLs information
for de-duplication
Pros
Pros:



Efficient and Practical
Using compression and specific database design (GFS) to solve
the problem of fingerprint based de-duplication issues
Given a compact but thorough description of de-duplication
related work
Cons
Cons:




Limit of accuracy -- not based on explicit content matching of
the document but the possibility of similarity
This paper does not provide any evaluation results compared
with other algorithm
Though providing compression techniques, the cost of space still
remain questioned
Content-based de-duplication can only be implemented after the
Web pages have been downloaded. So it does not help reduce
the waste of bandwidth in crawling.
Comment

This technique is good. It provides an efficient
way of using Simhash to solve de-duplication
issue for a large amount of data. Though not
the first paper focusing on large amount of
web pages, but it indeed provides actual query
size in the real world.
Reference









Paolo Ferragina , Roberto Grossi , Ankur Gupta , Rahul Shah , Jeffrey Scott Vitter, On searching
compressed string collections cache-obliviously, Proceedings of the twenty-seventh ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, June 09-12, 2008, Vancouver, Canada
Hsin-Tsang Lee , Derek Leonard , Xiaoming Wang , Dmitri Loguinov, IRLbot: Scaling to 6 billion pages and
beyond, ACM Transactions on the Web (TWEB), v.3 n.3, p.1-34, June 2009
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums,
Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
Anirban Dasgupta , Ravi Kumar , Amit Sasturkar, De-duping URLs via rewrite rules, Proceeding of the 14th
ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las
Vegas, Nevada, USA
Lian'en Huang , Lei Wang , Xiaoming Li, Achieving both high precision and high recall in near-duplicate
detection, Proceeding of the 17th ACM conference on Information and knowledge management, October
26-30, 2008, Napa Valley, California, USA
Edith Cohen , Haim Kaplan, Leveraging discarded samples for tighter estimation of multiple-set aggregates,
Proceedings of the eleventh international joint conference on Measurement and modeling of computer
systems, June 15-19, 2009, Seattle, WA, USA
Amit Agarwal , Hema Swetha Koppula , Krishna P. Leela , Krishna Prasad Chitrapura , Sachin Garg , Pavan
Kumar GM , Chittaranjan Haty , Anirban Roy , Amit Sasturkar, URL normalization for de-duplication of web
pages, Proceeding of the 18th ACM conference on Information and knowledge management, November
02-06, 2009, Hong Kong, China
Hema Swetha Koppula , Krishna P. Leela , Amit Agarwal , Krishna Prasad Chitrapura , Sachin Garg , Amit
Sasturkar, Learning URL patterns for webpage de-duplication, Proceedings of the third ACM international
conference on Web search and data mining, February 04-06, 2010, New York, New York, USA
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR 2006,
pages 284-291, 2006.