Transcript Slides
Spectral Approaches to
Nearest Neighbor Search
arXiv:1408.0751
Robert Krauthgamer (Weizmann Institute)
Joint with: Amirali Abdullah, Alexandr Andoni,
Ravi Kannan
Les Houches, January 2015
Nearest Neighbor Search (NNS)
ο½
Preprocess: a set π of π points in βπ
ο½
Query: given a query point π, report a
point πβ β π with the smallest distance
to π
πβ
π
Motivation
ο½
Generic setup:
ο½
ο½
ο½
Application areas:
ο½
ο½
ο½
Points model objects (e.g. images)
Distance models (dis)similarity measure
machine learning: k-NN rule
signal processing, vector quantization,
bioinformatics, etcβ¦
Distance can be:
ο½
Hamming, Euclidean,
edit distance, earth-mover distance, β¦
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
πβ
π
Curse of Dimensionality
ο½
All exact algorithms degrade rapidly with the
dimension π
Algorithm
Query time
Space
Full indexing
π(log π β
π)
No indexing β
linear scan
π(π β
π)
ππ(π) (Voronoi diagram size)
π(π β
π)
Approximate NNS
ο½
Given a query point π, report πβ² β π s.t.
β
πβ² β π β€ π min
π
βπ
β
π
ο½
ο½
ο½
π β₯ 1 : approximation factor
randomized: return such πβ² with probability
β₯ 90%
Heuristic perspective: gives a set of
candidates (hopefully small)
πβ
π
πβ²
NNS algorithms
Itβs all about space partitions !
Low-dimensional
[Arya-Mountβ93], [Clarksonβ94], [Arya-MountNetanyahu-Silverman-Weβ98], [Kleinbergβ97],
[HarPeledβ02],[Arya-Fonseca-Mountβ11],β¦
ο½
High-dimensional
[Indyk-Motwaniβ98], [Kushilevitz-OstrovskyRabaniβ98], [Indykβ98, β01], [Gionis-IndykMotwaniβ99], [Charikarβ02], [Datar-ImmorlicaIndyk-Mirrokniβ04], [Chakrabarti-Regevβ04],
[Panigrahyβ06], [Ailon-Chazelleβ06], [AndoniIndykβ06], [Andoni-Indyk-NguyenRazenshteynβ14], [Andoni-Razenshteynβ15]
ο½
6
Low-dimensional
kd-trees,β¦
π =1+π
runtime: π βπ(π) β
log π
ο½
ο½
ο½
7
High-dimensional
Locality-Sensitive Hashing
Crucial use of random projections
ο½
ο½
ο½
Johnson-Lindenstrauss Lemma: project to random subspace of
dimension π(π β2 log π) for 1 + π approximation
Runtime: π1/π for π-approximation
ο½
8
Practice
Data-aware partitions
ο½
ο½
ο½
ο½
ο½
optimize the partition to your dataset
PCA-tree [Sproullβ91, McNamesβ01, Verma-Kpotufe-Dasguptaβ09]
randomized kd-trees [SilpaAnan-Hartleyβ08, Muja-Loweβ09]
spectral/PCA/semantic/WTA hashing [Weiss-Torralba-Fergusβ08,
Wang-Kumar-Changβ09, Salakhutdinov-Hintonβ09, Yagnik-Strelow-RossLinβ11]
9
Practice vs Theory
ο½
Data-aware projections often outperform (vanilla)
random-projection methods
But no guarantees (correctness or performance)
ο½
JL generally optimal [Alonβ03, Jayram-Woodruffβ11]
ο½
ο½
Even for some NNS setups! [Andoni-Indyk-Patrascuβ06]
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?
10
Plan for the rest
ο½
ο½
ο½
Model
Two spectral algorithms
Conclusion
11
Our model
ο½
βlow-dimensional signal + large noiseβ
ο½
ο½
ο½
inside high dimensional space
Signal: π β π for subspace π β βπ of dimension π βͺ π
Data: each point is perturbed by a full-dimensional
Gaussian noise ππ (0, π 2 πΌπ )
π
12
Model properties
ο½
Data π = π + πΊ
ο½
ο½
Query π = π + ππ s.t.:
ο½
ο½
ο½
ο½
points in P have at least unit norm
||π β πβ || β€ 1 for βnearest neighborβ πβ
||π β π|| β₯ 1 + π for everybody else
Noise entries π(0, π 2 )
1
up to factor poly(π β1 π log π)
ο½
πβ
ο½
Claim: exact nearest neighbor is still the same
π 1/4
Noise is large:
ο½
ο½
ο½
13
has magnitude π π β π1/4 β« 1
top π dimensions of π capture sub-constant mass
JL would not work: after noise, gap very close to 1
Algorithms via PCA
ο½
Find the βsignal subspaceβ π ?
ο½
ο½
Use Principal Component Analysis (PCA)?
ο½
ο½
ο½
then can project everything to π and solve NNS there
β extract top direction(s) from SVD
e.g., π-dimensional space π that minimizes
πβπ π
2 (π, π)
If PCA removes noise βperfectlyβ, we are done:
ο½
ο½
14
π=π
Can reduce to π-dimensional NNS
NNS performance as if we are in π dimensions for full model?
ο½
Best we can hope for
ο½
dataset contains a βworst-caseβ π-dimensional instance
ο½
Reduction from dimension π to π
ο½
Spoiler: Yes
15
PCA under noise fails
ο½
ο½
Does PCA find βsignal subspaceβ π under noise ?
No ο
2 (π, π)
π
πβπ
ο½
PCA minimizes
ο½
good only on βaverageβ, not βworst-caseβ
weak signal directions overpowered by noise directions
typical noise direction contributes ππ=1 ππ2 π 2 = Ξ(ππ 2 )
ο½
ο½
πβ
16
1st Algorithm: intuition
ο½
Extract βwell-captured pointsβ
ο½
ο½
ο½
points with signal mostly inside top PCA space
should work for large fraction of points
Iterate on the rest
πβ
17
Iterative PCA
β’
β’
β’
β’
Find top PCA subspace π
πΆ=points well-captured by π
Build NNS d.s. on {πΆ projected onto π}
Iterate on the remaining points, π β πΆ
Query: query each NNS d.s. separately
ο½
To make this work:
ο½
Nearly no noise in π: ensuring π close to π
ο½
ο½
Capture only points whose signal fully in π
ο½
18
π determined by heavy-enough spectral directions (dimension may be
less than π)
well-captured: distance to π explained by noise only
β’
β’
β’
β’
Simpler model
ο½
Assume: small noise
ο½
ππ = ππ + πΌπ ,
ο½
ο½
ο½
well-captured if π π, π β€ 2πΌ
Claim 1: if πβ captured by πΆ, will find it in NNS
ο½
ο½
ο½
Query: query each NNS separately
can be even adversarial
Algorithm:
ο½
ο½
where ||πΌπ || βͺ π
Find top-k PCA subspace π
πΆ=points well-captured by π
Build NNS on {πΆ projected onto π}
Iterate on remaining points, π β πΆ
for any captured π:
||ππ β ππ || = || π β π|| ± 4πΌ = ||π β π|| ± 5πΌ
Claim 2: number of iterations is π(log π)
ο½
ο½
ο½
19
πβπ π
2
(π, π) β€
πβπ π
2
π, π β€ π β
πΌ 2
for at most 1/4-fraction of points, π 2 π, π β₯ 4πΌ 2
hence constant fraction captured in each iteration
Analysis of general model
ο½
ο½
ο½
Need to use randomness of the noise
Want to say that βsignalβ is stronger than βnoiseβ (on
average)
Use random matrix theory
ο½
ο½
π =π+πΊ
πΊ is random π × π with entries π(0, π 2 )
ο½
ο½
π has rank β€ π and (Frobenius-norm)2 β₯ π
ο½
ο½
ο½
20
All singular values π2 β€ π 2 π β π/ π
important directions have π2 β₯ Ξ©(π/π)
can ignore directions with π2 βͺ ππ/π
Important signal directions stronger than noise!
Closeness of subspaces ?
ο½
Trickier than singular values
ο½
ο½
Top singular vector not stable under perturbation!
Only stable if second singular value much smaller
ο½
How to even define βclosenessβ of subspaces?
ο½
To the rescue: Wedinβs sin-theta theorem
ο½
sin π π, π = max min ||π₯ β π¦||
π₯βπ π¦βπ
|π₯|=1
21
π
π
π
Wedinβs sin-theta theorem
ο½
Developed by [Davis-Kahanβ70], [Wedinβ72]
ο½
Theorem:
ο½
Consider π = π + πΊ
π is top-π subspace of π
π is the π-space containing π
ο½
Then: sin π π, π β€
ο½
ο½
ο½
π
||πΊ||
ππ (π)
Another way to see why we need to take directions with
sufficiently heavy singular values
22
Additional issue: Conditioning
ο½
After an iteration, the noise is not random anymore!
ο½
non-captured points might be βbiasedβ by capturing criterion
ο½
Fix: estimate top PCA subspace from a small sample of
the data
ο½
Might be purely due to analysis
ο½
23
But does not sound like a bad idea in practice either
Performance of Iterative PCA
ο½
Can prove there are π
ο½
In each, we have NNS in β€ π dimensional space
ο½
Overall query time: π
ο½
Reduced to π
24
π log π iterations
1
ππ π
β
π β
log 3/2 π
π log π instances of π-dimension NNS!
2nd Algorithm: PCA-tree
Closer to algorithms used in practice
ο½
β’
β’
β’
β’
Find top PCA direction π£
Partition into slabs β₯ π£
Snap points to β₯ hyperplane
Recurse on each slice
β π/π
25
Query:
β’ follow all tree paths that
may contain πβ
2 algorithmic modifications
Find top PCA direction π£
Partition into slabs β₯ π£
Snap points to β₯ hyperplanes
Recurse on each slice
β’
β’
β’
β’
ο½
Centering:
ο½
ο½
ο½
Query:
β’ follow all tree paths that
may contain πβ
Need to use centered PCA (subtract average)
Otherwise errors from perturbations accumulate
Sparsification:
ο½
ο½
Need to sparsify the set of points in each node of the tree
Otherwise can get a βdenseβ cluster:
ο½
ο½
26
not enough variance in signal
lots of noise
Analysis
ο½
An βextremeβ version of Iterative PCA Algorithm:
ο½
ο½
just use the top PCA direction: guaranteed to have signal !
Main lemma: the tree depth is β€ 2π
ο½
ο½
ο½
because each discovered direction close to π
snapping: like orthogonalizing with respect to each one
cannot have too many such directions
π 2π
π
ο½
Query runtime: π
ο½
Overall performs like π(π β
log π)-dimensional NNS!
27
Wrap-up
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?
ο½
Here:
ο½
ο½
ο½
ο½
Immediate questions:
ο½
ο½
ο½
Model: βlow-dimensional signal + large noiseβ
like NNS in low dimensional space
via βrightβ adaptation of PCA
Other, less-structured signal/noise models?
Algorithms with runtime dependent on spectrum?
Broader Q: Analysis that explains empirical success?
28