Sketching - University of Cincinnati

Download Report

Transcript Sketching - University of Cincinnati

Lecture 18
 Syntactic Web Clustering
 CS 728 - 2007

1
Outline

Previously

Studied web clustering based on web link structure

Some discussion of term-document vector spaces

Today

Syntactic clustering of the web

Identifying syntactic duplicates

Locality sensitive hash functions

Resemblance and shingling

Min-wise independent permutations

The sketching model

Hamming distance and Edit distance
2
Motivation: Near-Duplicate
Elimination

Many web pages are duplicates or nearduplicates of other pages
 Mirror
sites
 FAQs, manuals, legal documents
 Different versions of the same document
 Plagiarism

Duplicates are bad for search engines
 Increase
index size
 Harm quality of search results

Question: How to efficiently process the
repository of crawled pages and eliminate (near)duplicates?
3
Syntactic Clustering of the Web
[Broder, Glassman, Manasse, Zweig 97]



U: space of all possible documents
S  U: collection of documents
Given sim: U × U  [0,1]: a similarity measure
among documents
 If
p,q are very similar sim(p,q) is close to 1
 If p,q are very unsimilar, sim(p,q) is close to 0
 Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a
normalized distance between p and q.

G: a threshold graph on S:
are connected by an edge iff sim(p,q)  t (t =
threshold)
 p,q

Goal: find the connected components of G
4
Main Challenges

S is huge
 Web

has 10 billion pages
Documents are not compressed
 Needs
many disks to store S
 Each sim computation is costly




Documents in S should be processed in a stream
Main memory is small relative to S
Cannot afford more than O(|S|) time
How to create the graph G?
requires |S| passes and |S|2 similarity
computations
 Naively,
5
Sketching Schemes

T = a small set (|S| < |T| << |U|)

A sketching scheme for sim:
 Compression
function: a randomized mapping
: U  T
 Reconstruction function: : TT  [0,1]
 For every pair p,q, with high probability, have
((p),(q))  sim(p,q)
6
Syntactic Clustering by Sketching










P  empty table of size |S|
G  empty graph on |S| nodes
for i = 1,…,|S|
read document pi from the stream
P[i]  (pi)
for i = 1,…,|S|
for j = 1,…,|S|
if ((P[i],P[j])  t)
add edge (i,j) to G
output connected components of G
7
Analysis



Can compute sketches in one pass
Table P can be stored in a single file on a single machine
Creating G requires |S|2 applications of 




Easier than full-fledged computations of sim
Quadratic time is still a problem
Connected components algorithm is heavy but feasible
Need a linear time algorithm that is approximation

Idea: Use Hashing
8
Sketching vs Fingerprinting vs Hashing



Hashing h: 0,1}k
 Set Membership testing for set S of size n
 Desire uniform distribution over bin address 0,1}k
 Minimize collisions per bin – reduce lookup time
 Minimize hash table size  n N=2k
Fingerprinting f : 0,1}k
 Object Equality testing over set S of size n
 Distribution over 0,1}k is irrelevant
 Avoid collisions altogether
 Tolerate larger k – typically N > n2
Sketching phi: 0,1}k
 Similarity testing for set S of size n
 Distribution over 0,1}k is irrelevant
 Minimize collisions of dis-similar sets
 Minimize table size  n N=2k
Sketching via Locality Sensitive
Hashing (LSH)
[Indyk, Motwani, 98]
H = { h | h: U  T }: a family of hash
functions
 H is locality sensitive w.r.t. sim if for all
p,q  U, Pr[h(p) = h(q)] = sim(p,q).

 Probability
is over random choice of h from H
 Probability of collision = similarity between p
and q
10
Syntactic Clustering by LSH








P  empty table of size |S|
G  empty graph on |S| nodes
Choose random h
for i = 1,…,|S|
read document pi from the stream
P[i]  h(pi)
sort P and group by value
output groups
11
Analysis




Can compute hash values in one pass
Table P can be stored in a single file on a single machine
Sorting and grouping takes O(|S| log |S|) simple
comparisons
Each group consists of pages whose hash value is the
same


By LSH property, they are likely to be similar to each other
Let’s apply this to the web and see if makes sense

Need sim measure – Idea: shingling
12
Shingling and Resemblance
[Broder et al 97]




tokens: words, numbers, HTML tags, etc.
tokenization(p): sequence of tokens produced from
document p
w: a small integer
Sw(p) = w-shingling of p = set all distinct contiguous
subsequences of tokenization(p) of length w.




Ex: p = “a rose is a rose is a rose”, w = 4
Sw(p) = { (a rose is a), (rose is a rose), (is a rose is) }
Possible to use multisets as well
resemblancew(p,q) =
13
Shingling Example



A = “a rose is a rose is a rose”
B = “a rose is a flower which is a rose”
Preserving multiplicity
 w=1


 sim(SA,SB) = 0.7
SA = {a, a, a, is, is, rose, rose, rose}
SB = {a, a, a, is, is, rose, rose, flower, which}
 sim(SA,SB) = 0.5
 w=3  sim(SA,SB) = 0.3
 w=2

Disregarding multiplicity
 sim(SA,SB) = 0.6
 w=2  sim(SA,SB) = 0.5
 w=3  sim(SA,SB) = 0.4285
 w=1
LSH for Resemblance
Sw(p)

resemblancew(p,q) =

 = a random permutation on w




Sw(q)
 induces a random order on all length w sequences of tokens
 also induces a random order on any subset X  W
For each such subset and for each x  X, Pr(min ((X)) = x) = 1/|X|
LSH for resemblance: h(p) = min((Sw(p)))
15
LSH for Resemblance (cont.)


Lemma:
Proof:
16
Problems

How do we pick  ?
 Need
random choice
 Need to efficiently find min element

How many possible values ?
 ||w
! So need O(||w log ||w) bits to
represent at minimum
 Still need to compute min element
Some Theory: Pairwise independent



Universal Hash functions: (Pairwise independent)
 H : a finite collection (family) of hash functions
mapping U ! {0...m-1}
 H is universal if,
 for h in H picked uniformly at random,
 and for all x1, x2 in U, x1  x2
Pr(h(x1) = h(x2)) · 1/m
The class of hash functions
 hab(x) = ((a x + b) mod p) mod m
is universal (p ¸ m is a prime, a = {1…p-1}, b = {0…p-1})
Some Theory: Minwise independent



Minwise independent permutations:
 Sn : a finite collection (family) of permutations
mapping {1…n} to {1…n}
 H is minwise independent if,
 for  in Sn picked uniformly at random,
 and for X subset of {1…n}, and all x in X
Pr(min{(X)} = (x)) = 1/|X|
It is actually hard to find a “compact” collection of hash
functions that is minwise independent, but we can use
an approximation.
In practice – universal hashes work well!
Back to similarity and resemblence

If  in Sn and Sn is minwise independent then:
Prm in ( S ( A))}  m in ( S ( B ))} 


S ( A)  S ( B )
S ( A)  S ( B )
 r ( A, B )
This suggests we could just keep one minimum value as
our “sketch”, but our confidence would be low (high
variance)
What we want for a sketch of size k is either
 use k ’s,
 or keep the k minimum values for one 
Multiple Permutations



Better Variance Reduction
 Instead of larger k, stick with k=1
 Multiple, independent permutations
Sketch Construction
 Pick p random permutations of U – π1,π2, …,πp
 sk(A) = minimal elements under π1(SA), …, πp(SA)
Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB)
 Earlier lemma  true for p=1
 Linearity of expectations
 Variance reduction – independence of π1, …,πp
Other Known Sketching Schemes

Resemblance
[Broder, Glassman, Manasse, Zweig 97],
[Broder, Charikar, Frieze, Mitzenmacher 98]

Hamming distance
[Kushilevitz, Ostrovsky, Rabani 98],
[Indyk,
Motwani 98]
[Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]
 Cosine similarity [Charikar 02]
 Earth mover distance [Charikar 02]
 Edit distance [Bar-Yossef, Jayram, Krauthgamer,
Kumar 04]
22
The General Sketching Model
k vs. r Gap
Problem
Promise:
d(x,y) ≤ k or d(x,y) ≥ r
Alice
Shared
Randomness
x
y
x)
Goal:
Decide which of the
two holds.
Bob
y)
d(x,y) ≥ r
Approximation
d(x,y) ≤ k
Referee
23
Applications





Large data sets
Clustering
Nearest Neighbor schemes
Data streams
Management of Files
over the Network
 Differential backup
 Synchronization
Theory
Low distortion embeddings
Simultaneous messages
communication complexity
24