Transcript mgaws2 4881
Dimension Reduction in the
Hamming Cube
(and its Applications)
Rafail Ostrovsky
UCLA
PLAN
Problem Formulations
Communication complexity game
What really happened? (dimension
reduction)
Solutions to 2 problems
– ANN
– k-clustering
What’s next?
2
http://www.cs.ucla.edu/~rafail/
Problem statements
Johnson-lindenstrauss lemma: n points in high
dim. Hilbert Space can be embedded into
O(logn) dim subspace with small distortion
Q: how do we do it for the Hamming Cube?
(we show how to avoid impossibility of
[Charicar-Sahai])
3
http://www.cs.ucla.edu/~rafail/
Many different formulations of ANN
ANN – “approximate nearest neighbor search”
(many applications in computational
geometry, biology/stringology, IR, other areas)
Here are different formulations:
4
http://www.cs.ucla.edu/~rafail/
Approximate Searching
Motivation: given a DB of “names”, user with a
“target” name, find if any of DB names are “close”
to the current name, without doing liner scan.
Jon
Alice
Bob
Eve
McVeigh
Kate
Fred
MacVeigh ?
5
http://www.cs.ucla.edu/~rafail/
Geometric formulation
Nearest Neighbor Search (NNS): given N blue
points (and a distance function, say Euclidian
distance in Rd), store all these points somehow
6
http://www.cs.ucla.edu/~rafail/
Data structure question
given a new red point, find closest blue point.
Naive solution 1: store blue points “as is” and
when given a red point, measure distances to all
blue points.
Q: can we do better?
7
http://www.cs.ucla.edu/~rafail/
Can we do better?
Easy in small dimensions (Voronoi diagrams)
“Curse of dimensionality” in High Dimensions…
[KOR]: Can get a good “approximate” solution
efficiently!
8
http://www.cs.ucla.edu/~rafail/
Hamming Cube Formulation for ANN
Given a DB of N blue n-bit strings, process
them somehow. Given an n-bit red string find
ANN in the Hyper-Cube {0,1}n
00101011
01011001
11101001
10110110
11010101
11011000
10101010
10101111
11010100
Naïve solution 2: pre-compute all (exponential
#) of answers (want small data-structures!)
9
http://www.cs.ucla.edu/~rafail/
Clustering problem that I’ll discuss in detail
K-clustering
10
http://www.cs.ucla.edu/~rafail/
An example of Clustering – find “centers”
Given N points in Rd
11
http://www.cs.ucla.edu/~rafail/
A clustering formulation
Find cluster “centers”
12
http://www.cs.ucla.edu/~rafail/
Clustering formulation
The “cost” is the sum of distances
13
http://www.cs.ucla.edu/~rafail/
Main technique
First, as a communication game
Second, interpreted as a dimension reduction
14
http://www.cs.ucla.edu/~rafail/
COMMUNICATION COMPLEXITY GAME
Given two players Alice and Bob,
Alice is secretly given string x
Bob is secretly given string y
they want to estimate hamming distance
between x and y with small communication
(with small error), provided that they have
common randomness
How can they do it? (say length of |x|=|y|= N)
Much easier: how do we check that x=y ?
15
http://www.cs.ucla.edu/~rafail/
Main lemma : an abstract game
How can Alice and Bob estimate hamming distance between X
and Y with small CC?
We assume Alice and Bob share randomness
ALICE
BOB
X1X2X3X4…Xn
Y1Y2Y3Y4…Yn
16
http://www.cs.ucla.edu/~rafail/
A simpler question
To estimate hamming distance between X and Y
(within (1+ e)) with small CC, sufficient for Alice and
Bob for any L to be able to distinguish X and Y for:
– H(X,Y) <= L
OR
– H(X,Y) > (1+ e) L
Q: why sampling does not work?
ALICE
X1X2X3X4…Xn
BOB
Y1Y2Y3Y4…Yn
17
http://www.cs.ucla.edu/~rafail/
Alice and Bob pick the SAME n-bit blue R
each bit of R=1 independently with probability 1/2L
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
18
http://www.cs.ucla.edu/~rafail/
What is the difference in probabilities?
H(X,Y) <= L
H(X,Y) > (1+ e) L
and
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
19
http://www.cs.ucla.edu/~rafail/
How do we amplify?
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
20
http://www.cs.ucla.edu/~rafail/
How do we amplify? - Repeat, with many
independent R’s but same distribution!
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
21
http://www.cs.ucla.edu/~rafail/
a refined game with a small communication
How can Alice and Bob distinguish X and Y:
– H(X,Y) <= L OR
– H(X,Y) > (1+ e) L
BOB
ALICE
X1X2X3X4…Xn
For each R
XOR (subset) of Xi
Compare the outputs.
Pick 1/ e2 logN
R’s with correct
distribution
Compare this
linear
transformation.
Y1Y2Y3Y4…Yn
For each R XOR (the
same subset) of Yi
Compare the outputs.
22
http://www.cs.ucla.edu/~rafail/
Dimension Reduction in the Hamming Cube [OR]
For each L, we can pick
O(log N) R’s and boost the
Probabilities!
Key Property: we get an
embedding from large to
small cube that preserve
ranges around L very well.
23
http://www.cs.ucla.edu/~rafail/
Dimension Reduction in the Hamming Cube [OR]
For each L, we can pick
O(log N) R’s and boost the
Probabilities!
Key Property: we get an
embedding from large to
small cube that preserve
ranges around L.
Key idea in applications:
can build inverse lookup
table for the small cube!
24
http://www.cs.ucla.edu/~rafail/
Applications
Applications of the dimension reduction in the
Hamming CUBE
For ANN in the Hamming cube and Rd
For K-Clustering
25
http://www.cs.ucla.edu/~rafail/
Application to ANN in the Hamming Cube
For each possible L build a “small cube” and
project original DB to a small cube
Pre-compute inverse table for each entry of
the small cube.
Why is this efficient?
How do we answer any query?
How do we navigate between different L?
26
http://www.cs.ucla.edu/~rafail/
Putting it All together:
User’s private approx search from DB
Each projection is O(log N) R’s. User picks many
such projections for each L-range. That defines
all the embeddings.
Now, DB builds inverse lookup tables for each
projection as new DB’s for each L.
User can now “project” its query into small cube
and use binary search on L
27
http://www.cs.ucla.edu/~rafail/
MAIN THM [KOR]
Can build poly-size data-structure to do ANN
for high-dimensional data in time polynomial in
d and poly-log in N
–
–
–
–
For the hamming cube
L_1
L_2
Square of the Euclidian dist.
[IM] had a similar results, slightly weaker
guarantee.
28
http://www.cs.ucla.edu/~rafail/
Dealing with Rd
Project to random lines, choose “cut” points…
Well, not exactly… we need “navigation”
http://www.cs.ucla.edu/~rafail/
29
Clustering
Huge number of applications (IR,
mining, analysis of stat data, biology,
automatic taxonomy formation, web,
topic-specific data-collections, etc.)
Two independent issues:
– Representation of data
– Forming “clusters” (many incomparable
methods)
30
http://www.cs.ucla.edu/~rafail/
Representation of data examples
Latent semantic indexing yields points in Rd
with l2 distance (distance indicating similarity_
Min-wise permutation (Broder at. al.) approach
yields points in the hamming metric
Many other representations from IR literature
lead to other metrics, including edit-distance
metric on strings
Recent news: we show (Ostrovsky-Rabani 04)
that we can embed edit-distance metric into l1
with small distortion distortion= exp(sqrt(\log n \log log n))
31
http://www.cs.ucla.edu/~rafail/
Geometric Clustering: examples
Min-sum clustering in Rd: form clusters s.t. the
sum of intra-cluster distances in minimized
K-clustering: pick k “centers” in the ambient
space. The cost is the sum of distances from
each data-point to the closest center
Agglomerative clustering (form clusters below
some distance-threshold)
Q: which is better?
32
http://www.cs.ucla.edu/~rafail/
Methods are (in general) incomparable
33
http://www.cs.ucla.edu/~rafail/
Min-SUM
34
http://www.cs.ucla.edu/~rafail/
2-Clustering
35
http://www.cs.ucla.edu/~rafail/
A k-clustering problem: notation
N – number of points
d – dimension
k – number of centers
36
http://www.cs.ucla.edu/~rafail/
About k-clustering
When k if fixed, this is easy for small d
[Kleinberg, Papadimitriou, Raghavan]: NP-complete
for k=2 for the cube
[Drineas, Frieze, Kannan, Vempala, Vinay]” NP
complete for Rd for square of the Euclidian distance
When k is not fixed, this is facility location (Euclidian kmedian)
For fixed d but growing k a PTAS was given by [Arora,
Raghavan, Rao] (using dynamic prog.)
(this talk):PTAS for fixed k, arbitrary d
37
http://www.cs.ucla.edu/~rafail/
Common tools in geometric PTAS
Dynamic programming
Sampling [Schulman, AS, DLVK]
[DFKVV] use SVD
Embeddings/dimension reduction seem
useless because
– Too many candidate centers
– May introduce new centers
38
http://www.cs.ucla.edu/~rafail/
Our k-clustering result
A PTAS for fixed k
–
–
–
–
Hamming cube {0,1}d
l1d
l2d (Euclidian distance)
Square of the Euclidian distance
39
http://www.cs.ucla.edu/~rafail/
Main ideas
For 2-clustering find a good partition is as
good as solving the problem
Switch to cube
Try partitions in the embedded lowdimensional data set
Given a partition, compute centers and cost in
the original data send
Embedding/dim. reduction used to reduce the
number of partitions
40
http://www.cs.ucla.edu/~rafail/
Stronger property of [OR] dimension reduction
Our random linear transformation preserve
ranges!
41
http://www.cs.ucla.edu/~rafail/
THE ALGORITHM
42
http://www.cs.ucla.edu/~rafail/
The algorithm yet again
Guess 2-center distance
Map to small cube
Partition in the small cube
Measure the partition in the big cube
THM: gets within (1+e) of optimal.
Disclaimer: PTAS is (almost never) practical,
this shows “feasibility only”, more ideas are
needed for a practical solution.
43
http://www.cs.ucla.edu/~rafail/
Dealing with k>2
Apex of a tournament is a node of max outdegree
Fact: apex has a path of length 2 to every
node
Every point is assigned an apex of center
“tournaments”:
–
–
–
–
Guess all (k choose 2) center distances
Embed into (k choose 2) small cubes
Guess center-projection in small cubes
For every point, for every pair of centers, define a
“tournament” which center is closer in the projection
44
http://www.cs.ucla.edu/~rafail/
Conclusions
Dimension reduction in the
cube allows to deal with
huge number of
“incomparable” attributes.
Embeddings of other metrics
into the cube allows fast ANN
for other metrics
Real applications still require
considerable additional ideas
Fun area to work in
45
http://www.cs.ucla.edu/~rafail/