CS206 --- Electronic Commerce

Download Report

Transcript CS206 --- Electronic Commerce

My Favorite Algorithms for
Large-Scale Data Mining
Shingling
Minhashing
Locality-Sensitive Hashing
1
Similarity Search
A universal set of “objects.”
A collection of sets of objects.
Find the pairs of sets that are “similar.”
 Jaccard similarity of sets = size of
intersection divided by size of union.
2
Example: Jaccard Similarity
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
3
Applications
 Collaborative Filtering : Amazon
customers as the set of products they
buy.
 Recommend what similar customers bought.
 Similar Documents : A document as its
set of k-shingles = strings of k
consecutive characters.
 Examples: news articles from same source,
plagiarism.
4
Applications – (2)
Fingerprint Checking : Represent a
fingerprint by the set of positions of
minutiae.
 Requires discretization.
Entity Resolution : Represent records
describing individuals by sets of
attribute/value pairs.
5
Key Ideas
1. Shingling : (Andrei Broder) Convert
documents into sets.
2. Minhashing : (Edith Cohen, Broder)
Construct small signatures for sets so
Jaccard similarity of sets can be
determined from the signatures.
3. Locality-Sensitive Hashing : (Rajeev
Motwani, Piotr Indyk) Focus on (likely)
similar pairs without looking at all pairs.
6
The Big Picture – Documents
Localitysensitive
Hashing
Document
The set
of strings
of length k
that appear
in the document
Signatures :
Candidate
pairs :
those pairs
of signatures
that we need
to test for
similarity.
short integer
vectors that
represent the
sets and
reflect their
similarity
7
The Big Picture – Fingerprints
Localitysensitive
Hashing
Fingerprint
Candidate
pairs :
those pairs
of fingerprints
that we need
to test for
similarity.
Optional
minhashing
here
8
When Is Similarity Interesting?
1. When the sets are so large or so many
that they cannot fit in main memory.
2. When there are so many sets that
comparing all pairs of sets takes too
much time.
9
Shingling
k -Shingles
Documents as Sets
10
Shingles
A k-shingle (or k-gram) for a document
is a sequence of k characters that
appears in the document.
Example: k=2; doc = abcab. Set of 2shingles = {ab, bc, ca}.
 Option: regard shingles as a bag, and count
ab twice.
Represent a doc by its set of k -shingles.
11
Working Assumption
Documents that have lots of shingles in
common have similar text, even if the
text appears in different order.
Careful: you must pick k large enough,
or most documents will have most
shingles.
 k = 5 is OK for short documents; k = 10 is
better for long documents.
12
Shingles: Compression Option
To compress long shingles, we can hash
them to (say) 4 bytes.
Represent a doc by the set of hash
values of its k-shingles.
Two documents could (rarely) appear to
have shingles in common, when in fact
only the hash-values were shared.
13
Minhashing
Matrix Formulation
Signatures
Similarity of Signatures
14
Similarity as a Matrix Problem
Think of sets represented by a matrix
of 0’s and 1’s.
Row = object.
Column = set.
1 means that object is in that set.
15
Example: Similarity of Columns
u
v
w
x
y
z
C1
0
1
1
0
1
0
C2
1 *
0 *
1 **
0
1 **
1 *
C1 = {v,w,y}
C2 = {u,w,y,z}
Sim (C1, C2) =
2/5 = 0.4
16
Four Types of Rows
Given columns C1 and C2, rows may be
classified as:
a
b
c
d
C1
1
1
0
0
C2
1
0
1
0
Also, a = # rows of type a , etc.
Note Sim (C1, C2) = a /(a +b +c ).
17
Minhashing
Imagine the rows permuted randomly.
Define “hash” function h (C ) = the
number of the first (in the permuted
order) row in which column C has 1.
Use several (100?) independent hash
functions to create a signature with that
number of integer hash-values.
18
Minhashing Example
Input matrix
Signature matrix M
1 4 3
1
0
1
0
2
1
2
1
3 2 4
1
0
0
1
7 1 7
0
1
0
1
2
1
4
1
6 3 6
0
1
0
1
1
2
1
2
2 6 1
0
1
0
1
5 7 2
1
0
1
0
4 5 5
1
0
1
0
19
Surprising Property
The probability (over all permutations
of the rows) that h (C1) = h (C2) is the
same as Sim (C1, C2).
Both are a /(a +b +c )! Why?
 Look down columns C1 and C2 (in
permuted order) until we see a 1.
 If it’s a type-a row, then h (C1) = h (C2).
If a type-b or type-c row, then not.
20
Similarity for Signatures
The similarity of signatures is the
fraction of the rows in which they
agree.
 Remember, each row corresponds to a
permutation or “hash function.”
21
Implementation – (1)
 You can’t really permute rows
physically.
 Good approximation to permuting
rows: pick 100 (?) hash functions.
 For each column c and each hash
function hi , keep a “slot” M (i, c ) for
that minhash value.
22
Implementation – (2)
for each row r
for each column c
if c has 1 in row r
for each hash function hi do
if hi (r ) is a smaller value than
M (i, c ) then
M (i, c ) := hi (r );
23
Example
Row
1
2
3
4
5
C1
1
0
1
1
0
C2
0
1
1
0
1
h(x) = x mod 5
g(x) = 2x +1 mod 5
Sig1
Sig2
h(1) = 1
g(1) = 3
h(2) = 2
g(2) = 0
1
3
-
1
3
2
0
h(3) = 3
g(3) = 2
1
2
2
0
h(4) = 4
g(4) = 4
1
2
2
0
h(5) = 0
g(5) = 1
1
2
0
0
24
Implementation – (3)
Often, data is given by column, not
row.
 E.g., columns = documents, rows =
shingles.
If so, sort matrix once so it is by row.
And always compute hi (r ) only once
for each row.
25
Locality-Sensitive Hashing
The All-Pairs Problem
Banding of Signature Matrices
Other LSH Techniques
26
Finding Similar Sets
We can use minhashing to replace sets
(columns of the matrix) by short lists of
integers.
But we still need to compare each pair
of signatures.
Example: 20 million Amazon customers;
2*1014 pairs of customers to evaluate.
27
Locality-Sensitive Hashing
 What we want seems impossible. Map
signatures to buckets so that:
1. Two similar signatures have a very good
chance of appearing in the same bucket.
2. If two signatures are not very similar,
they probably don’t appear in one bucket.
 Then, we only have to compare
bucket-mates (candidate pairs ).
28
LSH for Signatures
Think of the signature for each column
(set) as a column of the signature
matrix S.
Divide the rows of S into b bands of r
rows each.
29
Partition Into Bands – (1)
r rows
per band
b bands
One
signature
Matrix S
30
Partition into Bands – (2)
For each band, hash its portion of each
column to a hash table with many buckets.
Candidate column pairs are those that hash
to the same bucket for ≥ 1 band.
Tune b and r to catch most similar pairs,
but few nonsimilar pairs.
31
Buckets
Matrix S
r rows
b bands
32
Analysis of LSH – What We Want
Probability
= 1 if s > t
Probability
of sharing
a bucket
No chance
if s < t
t
Similarity s of two sets
33
What One Band of One Row
Gives You
Remember:
probability of
equal hash-values
= similarity
Probability
of sharing
a bucket
t
Similarity s of two sets
34
What b Bands of r Rows Gives You
At least
one band
identical
t ~ (1/b)1/r
Probability
of sharing
a bucket
t
Similarity s of two sets
No bands
identical
1 - (1 - s r )b
Some row All rows
of a band of a band
unequal are equal
35
Example: b = 20; r = 5
s
.2
.3
.4
.5
.6
.7
.8
1-(1-sr)b
.006
.047
.186
.470
.802
.975
.9996
36
Summary of Minhash/LSH
1. Represent the objects you are
comparing by sets (e.g., shingling).
2. Represent the sets by signatures
(minhashing).
3. Use LSH to create buckets; candidate
pairs are those in the same bucket.
4. Evaluate only the candidate pairs.
37
Application: LSH for Fingerprints
Place a grid on a fingerprint.
 Normalize so identical prints will overlap.
Set of grid points where minutiae are
located represents the fingerprint.
 Possibly, treat minutiae near a grid
boundary as if also present in adjacent grid
points.
38
Discretizing Minutiae
Minutia
located
here
Maybe pretend
it is here also
39
Applying LSH to Fingerprints
We could minhash the bit-vectors to
obtain signatures.
But since there probably aren’t too
many grid points, we can work from the
bit-vectors directly.
40
LSH/Fingerprints – (2)
Pick 100 (?) sets of 3 (?) grid points,
randomly.
For each set of three grid points, those
prints that have 1 for all three points
are placed in a bucket.
 All pairs in this bucket are candidates.
41
Application: Matching
Customer Records
I once took a consulting job solving the
following problem:
 Company A agreed to solicit customers for
Company B, for a fee.
 They then argued over how many
customers.
 Neither recorded exactly which customers
were involved.
42
Customer Records – (2)
Company B had about 1 million records
of all its customers.
Company A had about 1 million records
describing customers, some of which it
had signed up for B.
Records had name, address, and
phone, but for various reasons, they
could be different for the same person.
43
Customer Records – (3)
Step 1: Design a measure (“score ”) of
how similar records are:
 E.g., deduct points for small misspellings
(“Jeffrey” vs. “Geoffery”) or same phone
with different area code.
Step 2: Score all pairs of records;
report high scores as matches.
44
Customer Records – (4)
Problem: (1 million)2 is too many pairs
of records to score.
Solution: A simple LSH.
 Three hash functions: exact values of name,
address, phone.
• Compare iff records are identical in at least one.
 Misses similar records with a small
difference in all three fields.
45
Aside: Validation of Results
We were able to tell what values of the
scoring function were reliable in an
interesting way.
 Identical records had a creation date
difference of 10 days.
 We only looked for records created within
90 days, so bogus matches had a 45-day
average.
46
Validation – (2)
By looking at the pool of matches with
a fixed score, we could compute the
average time-difference, say x, and
deduce that fraction (45-x )/35 of them
were valid matches.
Alas, the lawyers didn’t think the jury
would understand.
47
The End
Thanks for Listening
48