Transcript mgaws2 4881

Dimension Reduction in the
Hamming Cube
(and its Applications)
Rafail Ostrovsky
UCLA
PLAN
 Problem Formulations
 Communication complexity game
 What really happened? (dimension
reduction)
 Solutions to 2 problems
– ANN
– k-clustering
 What’s next?
2
http://www.cs.ucla.edu/~rafail/
Problem statements
 Johnson-lindenstrauss lemma: n points in high
dim. Hilbert Space can be embedded into
O(logn) dim subspace with small distortion
 Q: how do we do it for the Hamming Cube?
 (we show how to avoid impossibility of
[Charicar-Sahai])
3
http://www.cs.ucla.edu/~rafail/
Many different formulations of ANN
 ANN – “approximate nearest neighbor search”
 (many applications in computational
geometry, biology/stringology, IR, other areas)
 Here are different formulations:
4
http://www.cs.ucla.edu/~rafail/
Approximate Searching
 Motivation: given a DB of “names”, user with a
“target” name, find if any of DB names are “close”
to the current name, without doing liner scan.
Jon
Alice
Bob
Eve
McVeigh
Kate
Fred

MacVeigh ?
5
http://www.cs.ucla.edu/~rafail/
Geometric formulation
 Nearest Neighbor Search (NNS): given N blue
points (and a distance function, say Euclidian
distance in Rd), store all these points somehow
6
http://www.cs.ucla.edu/~rafail/
Data structure question
 given a new red point, find closest blue point.
Naive solution 1: store blue points “as is” and
when given a red point, measure distances to all
blue points.
Q: can we do better?
7
http://www.cs.ucla.edu/~rafail/
Can we do better?
 Easy in small dimensions (Voronoi diagrams)
 “Curse of dimensionality” in High Dimensions…
 [KOR]: Can get a good “approximate” solution
efficiently!
8
http://www.cs.ucla.edu/~rafail/
Hamming Cube Formulation for ANN
 Given a DB of N blue n-bit strings, process
them somehow. Given an n-bit red string find
ANN in the Hyper-Cube {0,1}n
00101011
01011001
11101001
10110110
11010101
11011000
10101010
10101111

11010100
 Naïve solution 2: pre-compute all (exponential
#) of answers (want small data-structures!)
9
http://www.cs.ucla.edu/~rafail/
Clustering problem that I’ll discuss in detail
 K-clustering
10
http://www.cs.ucla.edu/~rafail/
An example of Clustering – find “centers”
 Given N points in Rd
11
http://www.cs.ucla.edu/~rafail/
A clustering formulation
 Find cluster “centers”
12
http://www.cs.ucla.edu/~rafail/
Clustering formulation
 The “cost” is the sum of distances
13
http://www.cs.ucla.edu/~rafail/
Main technique
 First, as a communication game
 Second, interpreted as a dimension reduction
14
http://www.cs.ucla.edu/~rafail/
COMMUNICATION COMPLEXITY GAME




Given two players Alice and Bob,
Alice is secretly given string x
Bob is secretly given string y
they want to estimate hamming distance
between x and y with small communication
(with small error), provided that they have
common randomness
 How can they do it? (say length of |x|=|y|= N)
 Much easier: how do we check that x=y ?
15
http://www.cs.ucla.edu/~rafail/
Main lemma : an abstract game
 How can Alice and Bob estimate hamming distance between X
and Y with small CC?
 We assume Alice and Bob share randomness
ALICE
BOB
X1X2X3X4…Xn
Y1Y2Y3Y4…Yn

16
http://www.cs.ucla.edu/~rafail/
A simpler question
 To estimate hamming distance between X and Y
(within (1+ e)) with small CC, sufficient for Alice and
Bob for any L to be able to distinguish X and Y for:
– H(X,Y) <= L
OR
– H(X,Y) > (1+ e) L
 Q: why sampling does not work?
ALICE
X1X2X3X4…Xn

BOB
Y1Y2Y3Y4…Yn
17
http://www.cs.ucla.edu/~rafail/
Alice and Bob pick the SAME n-bit blue R
each bit of R=1 independently with probability 1/2L
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
18
http://www.cs.ucla.edu/~rafail/
What is the difference in probabilities?
H(X,Y) <= L
H(X,Y) > (1+ e) L
and
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
19
http://www.cs.ucla.edu/~rafail/
How do we amplify?
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
20
http://www.cs.ucla.edu/~rafail/
How do we amplify? - Repeat, with many
independent R’s but same distribution!
0 10 00 1 0 0 1 0 0
0 10 00 1 0 0 1 0 0
010 10 0 0 1 0 1 0
010 11 1 0 1 0 1 0
X
Y
XOR
XOR
0/1
0/1
21
http://www.cs.ucla.edu/~rafail/
a refined game with a small communication
 How can Alice and Bob distinguish X and Y:
– H(X,Y) <= L OR
– H(X,Y) > (1+ e) L
BOB
ALICE
X1X2X3X4…Xn
For each R
XOR (subset) of Xi
Compare the outputs.

Pick 1/ e2 logN
R’s with correct
distribution
Compare this
linear
transformation.
Y1Y2Y3Y4…Yn
For each R XOR (the
same subset) of Yi
Compare the outputs.
22
http://www.cs.ucla.edu/~rafail/
Dimension Reduction in the Hamming Cube [OR]
For each L, we can pick
O(log N) R’s and boost the
Probabilities!
Key Property: we get an
embedding from large to
small cube that preserve
ranges around L very well.
23
http://www.cs.ucla.edu/~rafail/
Dimension Reduction in the Hamming Cube [OR]
For each L, we can pick
O(log N) R’s and boost the
Probabilities!
Key Property: we get an
embedding from large to
small cube that preserve
ranges around L.
Key idea in applications:
can build inverse lookup
table for the small cube!
24
http://www.cs.ucla.edu/~rafail/
Applications
 Applications of the dimension reduction in the
Hamming CUBE
 For ANN in the Hamming cube and Rd
 For K-Clustering
25
http://www.cs.ucla.edu/~rafail/
Application to ANN in the Hamming Cube
 For each possible L build a “small cube” and
project original DB to a small cube
 Pre-compute inverse table for each entry of
the small cube.
 Why is this efficient?
 How do we answer any query?
 How do we navigate between different L?
26
http://www.cs.ucla.edu/~rafail/
Putting it All together:
User’s private approx search from DB
 Each projection is O(log N) R’s. User picks many
such projections for each L-range. That defines
all the embeddings.
 Now, DB builds inverse lookup tables for each
projection as new DB’s for each L.
 User can now “project” its query into small cube
and use binary search on L
27
http://www.cs.ucla.edu/~rafail/
MAIN THM [KOR]
 Can build poly-size data-structure to do ANN
for high-dimensional data in time polynomial in
d and poly-log in N
–
–
–
–
For the hamming cube
L_1
L_2
Square of the Euclidian dist.
 [IM] had a similar results, slightly weaker
guarantee.
28
http://www.cs.ucla.edu/~rafail/
Dealing with Rd
 Project to random lines, choose “cut” points…

Well, not exactly… we need “navigation”
http://www.cs.ucla.edu/~rafail/
29
Clustering
 Huge number of applications (IR,
mining, analysis of stat data, biology,
automatic taxonomy formation, web,
topic-specific data-collections, etc.)
 Two independent issues:
– Representation of data
– Forming “clusters” (many incomparable
methods)
30
http://www.cs.ucla.edu/~rafail/
Representation of data examples
 Latent semantic indexing yields points in Rd
with l2 distance (distance indicating similarity_
 Min-wise permutation (Broder at. al.) approach
yields points in the hamming metric
 Many other representations from IR literature
lead to other metrics, including edit-distance
metric on strings
 Recent news: we show (Ostrovsky-Rabani 04)
that we can embed edit-distance metric into l1
with small distortion distortion= exp(sqrt(\log n \log log n))
31
http://www.cs.ucla.edu/~rafail/
Geometric Clustering: examples
 Min-sum clustering in Rd: form clusters s.t. the
sum of intra-cluster distances in minimized
 K-clustering: pick k “centers” in the ambient
space. The cost is the sum of distances from
each data-point to the closest center
 Agglomerative clustering (form clusters below
some distance-threshold)
 Q: which is better?
32
http://www.cs.ucla.edu/~rafail/
Methods are (in general) incomparable
33
http://www.cs.ucla.edu/~rafail/
Min-SUM
34
http://www.cs.ucla.edu/~rafail/
2-Clustering
35
http://www.cs.ucla.edu/~rafail/
A k-clustering problem: notation
 N – number of points
 d – dimension
 k – number of centers
36
http://www.cs.ucla.edu/~rafail/
About k-clustering
 When k if fixed, this is easy for small d
 [Kleinberg, Papadimitriou, Raghavan]: NP-complete
for k=2 for the cube
 [Drineas, Frieze, Kannan, Vempala, Vinay]” NP
complete for Rd for square of the Euclidian distance
 When k is not fixed, this is facility location (Euclidian kmedian)
 For fixed d but growing k a PTAS was given by [Arora,
Raghavan, Rao] (using dynamic prog.)
 (this talk):PTAS for fixed k, arbitrary d
37
http://www.cs.ucla.edu/~rafail/
Common tools in geometric PTAS
 Dynamic programming
 Sampling [Schulman, AS, DLVK]
 [DFKVV] use SVD
 Embeddings/dimension reduction seem
useless because
– Too many candidate centers
– May introduce new centers
38
http://www.cs.ucla.edu/~rafail/
Our k-clustering result
 A PTAS for fixed k
–
–
–
–
Hamming cube {0,1}d
l1d
l2d (Euclidian distance)
Square of the Euclidian distance
39
http://www.cs.ucla.edu/~rafail/
Main ideas
 For 2-clustering find a good partition is as
good as solving the problem
 Switch to cube
 Try partitions in the embedded lowdimensional data set
 Given a partition, compute centers and cost in
the original data send
 Embedding/dim. reduction used to reduce the
number of partitions
40
http://www.cs.ucla.edu/~rafail/
Stronger property of [OR] dimension reduction
 Our random linear transformation preserve
ranges!
41
http://www.cs.ucla.edu/~rafail/
THE ALGORITHM
42
http://www.cs.ucla.edu/~rafail/
The algorithm yet again




Guess 2-center distance
Map to small cube
Partition in the small cube
Measure the partition in the big cube
 THM: gets within (1+e) of optimal.
 Disclaimer: PTAS is (almost never) practical,
this shows “feasibility only”, more ideas are
needed for a practical solution.
43
http://www.cs.ucla.edu/~rafail/
Dealing with k>2
 Apex of a tournament is a node of max outdegree
 Fact: apex has a path of length 2 to every
node
 Every point is assigned an apex of center
“tournaments”:
–
–
–
–
Guess all (k choose 2) center distances
Embed into (k choose 2) small cubes
Guess center-projection in small cubes
For every point, for every pair of centers, define a
“tournament” which center is closer in the projection
44
http://www.cs.ucla.edu/~rafail/
Conclusions
 Dimension reduction in the
cube allows to deal with
huge number of
“incomparable” attributes.
 Embeddings of other metrics
into the cube allows fast ANN
for other metrics
 Real applications still require
considerable additional ideas
 Fun area to work in
45
http://www.cs.ucla.edu/~rafail/