MapDupReducer: Detecting Near Duplicates over Massive Datasets Chaokun Wang1 Jianmin Wang1 Xuemin Lin2 Wei Wang2 Haixun Wang3 Hongsong Li3 Wanpeng Tian1

Transcript MapDupReducer: Detecting Near Duplicates over Massive Datasets Chaokun Wang1 Jianmin Wang1 Xuemin Lin2 Wei Wang2 Haixun Wang3 Hongsong Li3 Wanpeng Tian1

MapDupReducer: Detecting Near Duplicates over Massive Datasets
Chaokun Wang1 Jianmin Wang1 Xuemin Lin2 Wei Wang2 Haixun Wang3 Hongsong Li3 Wanpeng Tian1 Jun Xu41 Rui Li1
1School of Software, Tsinghua University
Key Laboratory for Information System Security, Ministry of Education
Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China
2School of Computer Science and Engineering, University of New South Wales & NICTA, Sydney, Australia
3Microsoft Research Asia
4Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
1 Introduction
4.2 User Interface
X = x1; x2; . . . ; xn is a document set where each xi is
a document. The task of Near Duplicate Detection
(NDD) is to find all of the pairs (xi; xj) such that the
similarity between xi and xj is not smaller than a given
threshold (typically close to 1).
• GUI interface to select a data source, set a parameter.
• The beginning of a document is shown as the tooltip.
• The differences between two duplicates are shown in color.
Challenges
• Very expensive to detect near duplicates exactly in a large
data set.
• Non-trivial to deal with NDD in parallel fashion.
2 Related Work
3 System Architecture
The system MapDupReducer
consists of two parts:
• the user application
• the server engine
MapDupReducer is implemented
on top of the Hadoop v?.?.?.
4 Our Approach
4.1 Near Duplicate Detection by the PPJoin Paradigm in
Four MapReduce jobs for NDD
MapReduce
• Job1: Compute the total order
• Job2: Prefix filter
• Job3: Position filter
• Job4: Verification (Jaccard similarity)
4.3 An Example
Given a set of documents and a Jaccard threshold t = 0.8.
Job1:
IN: (x, “C D E F”) (y, “B C D E F”) (z, “A B C D F”),
OUT: (C,1) (D,1) (E,1) (F,1) (B,1) (C,1) (D,1) (E,1) (F,1) (A,1)
(B,1) (C,1) (D,1) (F,1)
IN: (A, {1}) (B, {1 1})(C, {1 1 1}) (D, {1 1 1}) (E, {1 1}) (F, {1 1 1})
OUT: (A,1) (B,2) (C,3) (D,3) (E,2) (F,3)
Job2:
IN: (x, “C D E F”) (y, “B C D E F”) (z, “A B C D F”)
OUT: (E,x@1) (B,y@1) (E,y@2) (A,z@1) (B,z@2)
IN: (A,{z@1}) (B,{y@1 z@2}) (E,{x@1 y@2})
OUT: ((y,z,01), 1#2) ((x,y,01), 1#2)
Job3: Mapper: A special IdentityMapper
IN: ((y,z), 01), {1#2}; ((x,y), 01), {1#2}
OUT: (x,y)
Job4: fetches the corresponding pre-processed documents
and compares them to generate the final NDD results.
{
{
{
{
{
5 Preliminary Experimental Results
1200
1000
Time Cost (s)
The MapReduce framework is a mature platform with
distinctive features such as ease of use and high faulttolerance.
• map(k,v)  list(k1,v1)
• reduce(k1, list(v1))  v2
Approximate String Join
• Tables R1 and R2 with string attributes Ai and Aj respectively,
an threshold k  {(t, t΄)|(t, t΄)R1R2,
dist (R2.Ai(t), R2.Aj(t΄)) ≤k}.
PPJoin
• an extension to the All-Pairs algorithm
• combines positional filtering and prefix-filtering
800
600
MapDupReduce
400
FuzzyJoin
200
0
The size of the input document set

MapDupReducer: Detecting Near Duplicates over Massive Datasets Chaokun Wang1 Jianmin Wang1 Xuemin Lin2 Wei Wang2 Haixun Wang3 Hongsong Li3 Wanpeng Tian1

Transcript MapDupReducer: Detecting Near Duplicates over Massive Datasets Chaokun Wang1 Jianmin Wang1 Xuemin Lin2 Wei Wang2 Haixun Wang3 Hongsong Li3 Wanpeng Tian1

Directory