Sketching - University of Cincinnati
Download
Report
Transcript Sketching - University of Cincinnati
Lecture 18
Syntactic Web Clustering
CS 728 - 2007
1
Outline
Previously
Studied web clustering based on web link structure
Some discussion of term-document vector spaces
Today
Syntactic clustering of the web
Identifying syntactic duplicates
Locality sensitive hash functions
Resemblance and shingling
Min-wise independent permutations
The sketching model
Hamming distance and Edit distance
2
Motivation: Near-Duplicate
Elimination
Many web pages are duplicates or nearduplicates of other pages
Mirror
sites
FAQs, manuals, legal documents
Different versions of the same document
Plagiarism
Duplicates are bad for search engines
Increase
index size
Harm quality of search results
Question: How to efficiently process the
repository of crawled pages and eliminate (near)duplicates?
3
Syntactic Clustering of the Web
[Broder, Glassman, Manasse, Zweig 97]
U: space of all possible documents
S U: collection of documents
Given sim: U × U [0,1]: a similarity measure
among documents
If
p,q are very similar sim(p,q) is close to 1
If p,q are very unsimilar, sim(p,q) is close to 0
Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a
normalized distance between p and q.
G: a threshold graph on S:
are connected by an edge iff sim(p,q) t (t =
threshold)
p,q
Goal: find the connected components of G
4
Main Challenges
S is huge
Web
has 10 billion pages
Documents are not compressed
Needs
many disks to store S
Each sim computation is costly
Documents in S should be processed in a stream
Main memory is small relative to S
Cannot afford more than O(|S|) time
How to create the graph G?
requires |S| passes and |S|2 similarity
computations
Naively,
5
Sketching Schemes
T = a small set (|S| < |T| << |U|)
A sketching scheme for sim:
Compression
function: a randomized mapping
: U T
Reconstruction function: : TT [0,1]
For every pair p,q, with high probability, have
((p),(q)) sim(p,q)
6
Syntactic Clustering by Sketching
P empty table of size |S|
G empty graph on |S| nodes
for i = 1,…,|S|
read document pi from the stream
P[i] (pi)
for i = 1,…,|S|
for j = 1,…,|S|
if ((P[i],P[j]) t)
add edge (i,j) to G
output connected components of G
7
Analysis
Can compute sketches in one pass
Table P can be stored in a single file on a single machine
Creating G requires |S|2 applications of
Easier than full-fledged computations of sim
Quadratic time is still a problem
Connected components algorithm is heavy but feasible
Need a linear time algorithm that is approximation
Idea: Use Hashing
8
Sketching vs Fingerprinting vs Hashing
Hashing h: 0,1}k
Set Membership testing for set S of size n
Desire uniform distribution over bin address 0,1}k
Minimize collisions per bin – reduce lookup time
Minimize hash table size n N=2k
Fingerprinting f : 0,1}k
Object Equality testing over set S of size n
Distribution over 0,1}k is irrelevant
Avoid collisions altogether
Tolerate larger k – typically N > n2
Sketching phi: 0,1}k
Similarity testing for set S of size n
Distribution over 0,1}k is irrelevant
Minimize collisions of dis-similar sets
Minimize table size n N=2k
Sketching via Locality Sensitive
Hashing (LSH)
[Indyk, Motwani, 98]
H = { h | h: U T }: a family of hash
functions
H is locality sensitive w.r.t. sim if for all
p,q U, Pr[h(p) = h(q)] = sim(p,q).
Probability
is over random choice of h from H
Probability of collision = similarity between p
and q
10
Syntactic Clustering by LSH
P empty table of size |S|
G empty graph on |S| nodes
Choose random h
for i = 1,…,|S|
read document pi from the stream
P[i] h(pi)
sort P and group by value
output groups
11
Analysis
Can compute hash values in one pass
Table P can be stored in a single file on a single machine
Sorting and grouping takes O(|S| log |S|) simple
comparisons
Each group consists of pages whose hash value is the
same
By LSH property, they are likely to be similar to each other
Let’s apply this to the web and see if makes sense
Need sim measure – Idea: shingling
12
Shingling and Resemblance
[Broder et al 97]
tokens: words, numbers, HTML tags, etc.
tokenization(p): sequence of tokens produced from
document p
w: a small integer
Sw(p) = w-shingling of p = set all distinct contiguous
subsequences of tokenization(p) of length w.
Ex: p = “a rose is a rose is a rose”, w = 4
Sw(p) = { (a rose is a), (rose is a rose), (is a rose is) }
Possible to use multisets as well
resemblancew(p,q) =
13
Shingling Example
A = “a rose is a rose is a rose”
B = “a rose is a flower which is a rose”
Preserving multiplicity
w=1
sim(SA,SB) = 0.7
SA = {a, a, a, is, is, rose, rose, rose}
SB = {a, a, a, is, is, rose, rose, flower, which}
sim(SA,SB) = 0.5
w=3 sim(SA,SB) = 0.3
w=2
Disregarding multiplicity
sim(SA,SB) = 0.6
w=2 sim(SA,SB) = 0.5
w=3 sim(SA,SB) = 0.4285
w=1
LSH for Resemblance
Sw(p)
resemblancew(p,q) =
= a random permutation on w
Sw(q)
induces a random order on all length w sequences of tokens
also induces a random order on any subset X W
For each such subset and for each x X, Pr(min ((X)) = x) = 1/|X|
LSH for resemblance: h(p) = min((Sw(p)))
15
LSH for Resemblance (cont.)
Lemma:
Proof:
16
Problems
How do we pick ?
Need
random choice
Need to efficiently find min element
How many possible values ?
||w
! So need O(||w log ||w) bits to
represent at minimum
Still need to compute min element
Some Theory: Pairwise independent
Universal Hash functions: (Pairwise independent)
H : a finite collection (family) of hash functions
mapping U ! {0...m-1}
H is universal if,
for h in H picked uniformly at random,
and for all x1, x2 in U, x1 x2
Pr(h(x1) = h(x2)) · 1/m
The class of hash functions
hab(x) = ((a x + b) mod p) mod m
is universal (p ¸ m is a prime, a = {1…p-1}, b = {0…p-1})
Some Theory: Minwise independent
Minwise independent permutations:
Sn : a finite collection (family) of permutations
mapping {1…n} to {1…n}
H is minwise independent if,
for in Sn picked uniformly at random,
and for X subset of {1…n}, and all x in X
Pr(min{(X)} = (x)) = 1/|X|
It is actually hard to find a “compact” collection of hash
functions that is minwise independent, but we can use
an approximation.
In practice – universal hashes work well!
Back to similarity and resemblence
If in Sn and Sn is minwise independent then:
Prm in ( S ( A))} m in ( S ( B ))}
S ( A) S ( B )
S ( A) S ( B )
r ( A, B )
This suggests we could just keep one minimum value as
our “sketch”, but our confidence would be low (high
variance)
What we want for a sketch of size k is either
use k ’s,
or keep the k minimum values for one
Multiple Permutations
Better Variance Reduction
Instead of larger k, stick with k=1
Multiple, independent permutations
Sketch Construction
Pick p random permutations of U – π1,π2, …,πp
sk(A) = minimal elements under π1(SA), …, πp(SA)
Claim: E[ sim(sk(A),sk(B)) ] = sim(SA,SB)
Earlier lemma true for p=1
Linearity of expectations
Variance reduction – independence of π1, …,πp
Other Known Sketching Schemes
Resemblance
[Broder, Glassman, Manasse, Zweig 97],
[Broder, Charikar, Frieze, Mitzenmacher 98]
Hamming distance
[Kushilevitz, Ostrovsky, Rabani 98],
[Indyk,
Motwani 98]
[Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]
Cosine similarity [Charikar 02]
Earth mover distance [Charikar 02]
Edit distance [Bar-Yossef, Jayram, Krauthgamer,
Kumar 04]
22
The General Sketching Model
k vs. r Gap
Problem
Promise:
d(x,y) ≤ k or d(x,y) ≥ r
Alice
Shared
Randomness
x
y
x)
Goal:
Decide which of the
two holds.
Bob
y)
d(x,y) ≥ r
Approximation
d(x,y) ≤ k
Referee
23
Applications
Large data sets
Clustering
Nearest Neighbor schemes
Data streams
Management of Files
over the Network
Differential backup
Synchronization
Theory
Low distortion embeddings
Simultaneous messages
communication complexity
24