Transcript Slides

Spectral Approaches to
Nearest Neighbor Search
arXiv:1408.0751
Robert Krauthgamer (Weizmann Institute)
Joint with: Amirali Abdullah, Alexandr Andoni,
Ravi Kannan
Les Houches, January 2015
Nearest Neighbor Search (NNS)

Preprocess: a set 𝑃 of 𝑛 points in ℝ𝑑

Query: given a query point π‘ž, report a
point π‘βˆ— ∈ 𝑃 with the smallest distance
to π‘ž
π‘βˆ—
π‘ž
Motivation

Generic setup:



Application areas:



Points model objects (e.g. images)
Distance models (dis)similarity measure
machine learning: k-NN rule
signal processing, vector quantization,
bioinformatics, etc…
Distance can be:

Hamming, Euclidean,
edit distance, earth-mover distance, …
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
π‘βˆ—
π‘ž
Curse of Dimensionality

All exact algorithms degrade rapidly with the
dimension 𝑑
Algorithm
Query time
Space
Full indexing
𝑂(log 𝑛 β‹… 𝑑)
No indexing –
linear scan
𝑂(𝑛 β‹… 𝑑)
𝑛𝑂(𝑑) (Voronoi diagram size)
𝑂(𝑛 β‹… 𝑑)
Approximate NNS

Given a query point π‘ž, report 𝑝′ ∈ 𝑃 s.t.
βˆ—
𝑝′ βˆ’ π‘ž ≀ 𝑐 min
𝑝
βˆ’π‘ž
βˆ—
𝑝



𝑐 β‰₯ 1 : approximation factor
randomized: return such 𝑝′ with probability
β‰₯ 90%
Heuristic perspective: gives a set of
candidates (hopefully small)
π‘βˆ—
π‘ž
𝑝′
NNS algorithms
It’s all about space partitions !
Low-dimensional
[Arya-Mount’93], [Clarkson’94], [Arya-MountNetanyahu-Silverman-We’98], [Kleinberg’97],
[HarPeled’02],[Arya-Fonseca-Mount’11],…

High-dimensional
[Indyk-Motwani’98], [Kushilevitz-OstrovskyRabani’98], [Indyk’98, β€˜01], [Gionis-IndykMotwani’99], [Charikar’02], [Datar-ImmorlicaIndyk-Mirrokni’04], [Chakrabarti-Regev’04],
[Panigrahy’06], [Ailon-Chazelle’06], [AndoniIndyk’06], [Andoni-Indyk-NguyenRazenshteyn’14], [Andoni-Razenshteyn’15]

6
Low-dimensional
kd-trees,…
𝑐 =1+πœ–
runtime: πœ– βˆ’π‘‚(𝑑) β‹… log 𝑛



7
High-dimensional
Locality-Sensitive Hashing
Crucial use of random projections



Johnson-Lindenstrauss Lemma: project to random subspace of
dimension 𝑂(πœ– βˆ’2 log 𝑛) for 1 + πœ– approximation
Runtime: 𝑛1/𝑐 for 𝑐-approximation

8
Practice
Data-aware partitions





optimize the partition to your dataset
PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-Dasgupta’09]
randomized kd-trees [SilpaAnan-Hartley’08, Muja-Lowe’09]
spectral/PCA/semantic/WTA hashing [Weiss-Torralba-Fergus’08,
Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-RossLin’11]
9
Practice vs Theory

Data-aware projections often outperform (vanilla)
random-projection methods
But no guarantees (correctness or performance)

JL generally optimal [Alon’03, Jayram-Woodruff’11]


Even for some NNS setups! [Andoni-Indyk-Patrascu’06]
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?
10
Plan for the rest



Model
Two spectral algorithms
Conclusion
11
Our model

β€œlow-dimensional signal + large noise”



inside high dimensional space
Signal: 𝑃 βŠ‚ π‘ˆ for subspace π‘ˆ βŠ‚ ℝ𝑑 of dimension π‘˜ β‰ͺ 𝑑
Data: each point is perturbed by a full-dimensional
Gaussian noise 𝑁𝑑 (0, 𝜎 2 𝐼𝑑 )
π‘ˆ
12
Model properties

Data 𝑃 = 𝑃 + 𝐺


Query π‘ž = π‘ž + π‘”π‘ž s.t.:




points in P have at least unit norm
||π‘ž βˆ’ π‘βˆ— || ≀ 1 for β€œnearest neighbor” π‘βˆ—
||π‘ž βˆ’ 𝑝|| β‰₯ 1 + πœ– for everybody else
Noise entries 𝑁(0, 𝜎 2 )
1
up to factor poly(πœ– βˆ’1 π‘˜ log 𝑛)

πœŽβ‰ˆ

Claim: exact nearest neighbor is still the same
𝑑 1/4
Noise is large:



13
has magnitude 𝜎 𝑑 β‰ˆ 𝑑1/4 ≫ 1
top π‘˜ dimensions of 𝑃 capture sub-constant mass
JL would not work: after noise, gap very close to 1
Algorithms via PCA

Find the β€œsignal subspace” π‘ˆ ?


Use Principal Component Analysis (PCA)?



then can project everything to π‘ˆ and solve NNS there
β‰ˆ extract top direction(s) from SVD
e.g., π‘˜-dimensional space 𝑆 that minimizes
π‘βˆˆπ‘ƒ 𝑑
2 (𝑝, 𝑆)
If PCA removes noise β€œperfectly”, we are done:


14
𝑆=π‘ˆ
Can reduce to π‘˜-dimensional NNS
NNS performance as if we are in π‘˜ dimensions for full model?

Best we can hope for

dataset contains a β€œworst-case” π‘˜-dimensional instance

Reduction from dimension 𝑑 to π‘˜

Spoiler: Yes
15
PCA under noise fails


Does PCA find β€œsignal subspace” π‘ˆ under noise ?
No 
2 (𝑝, 𝑆)
𝑑
π‘βˆˆπ‘ƒ

PCA minimizes

good only on β€œaverage”, not β€œworst-case”
weak signal directions overpowered by noise directions
typical noise direction contributes 𝑛𝑖=1 𝑔𝑖2 𝜎 2 = Θ(π‘›πœŽ 2 )


π‘βˆ—
16
1st Algorithm: intuition

Extract β€œwell-captured points”



points with signal mostly inside top PCA space
should work for large fraction of points
Iterate on the rest
π‘βˆ—
17
Iterative PCA
β€’
β€’
β€’
β€’
Find top PCA subspace 𝑆
𝐢=points well-captured by 𝑆
Build NNS d.s. on {𝐢 projected onto 𝑆}
Iterate on the remaining points, 𝑃 βˆ– 𝐢
Query: query each NNS d.s. separately

To make this work:

Nearly no noise in 𝑆: ensuring 𝑆 close to π‘ˆ


Capture only points whose signal fully in 𝑆

18
𝑆 determined by heavy-enough spectral directions (dimension may be
less than π‘˜)
well-captured: distance to 𝑆 explained by noise only
β€’
β€’
β€’
β€’
Simpler model

Assume: small noise

𝑝𝑖 = 𝑝𝑖 + 𝛼𝑖 ,



well-captured if 𝑑 𝑝, 𝑆 ≀ 2𝛼
Claim 1: if π‘βˆ— captured by 𝐢, will find it in NNS



Query: query each NNS separately
can be even adversarial
Algorithm:


where ||𝛼𝑖 || β‰ͺ πœ–
Find top-k PCA subspace 𝑆
𝐢=points well-captured by 𝑆
Build NNS on {𝐢 projected onto 𝑆}
Iterate on remaining points, 𝑃 βˆ– 𝐢
for any captured 𝑝:
||𝑝𝑆 βˆ’ π‘žπ‘† || = || 𝑝 βˆ’ π‘ž|| ± 4𝛼 = ||𝑝 βˆ’ π‘ž|| ± 5𝛼
Claim 2: number of iterations is 𝑂(log 𝑛)



19
π‘βˆˆπ‘ƒ 𝑑
2
(𝑝, 𝑆) ≀
π‘βˆˆπ‘ƒ 𝑑
2
𝑝, π‘ˆ ≀ 𝑛 β‹… 𝛼 2
for at most 1/4-fraction of points, 𝑑 2 𝑝, 𝑆 β‰₯ 4𝛼 2
hence constant fraction captured in each iteration
Analysis of general model



Need to use randomness of the noise
Want to say that β€œsignal” is stronger than β€œnoise” (on
average)
Use random matrix theory


𝑃 =𝑃+𝐺
𝐺 is random 𝑛 × π‘‘ with entries 𝑁(0, 𝜎 2 )


𝑃 has rank ≀ π‘˜ and (Frobenius-norm)2 β‰₯ 𝑛



20
All singular values πœ†2 ≀ 𝜎 2 𝑛 β‰ˆ 𝑛/ 𝑑
important directions have πœ†2 β‰₯ Ξ©(𝑛/π‘˜)
can ignore directions with πœ†2 β‰ͺ πœ–π‘›/π‘˜
Important signal directions stronger than noise!
Closeness of subspaces ?

Trickier than singular values


Top singular vector not stable under perturbation!
Only stable if second singular value much smaller

How to even define β€œcloseness” of subspaces?

To the rescue: Wedin’s sin-theta theorem

sin πœƒ 𝑆, π‘ˆ = max min ||π‘₯ βˆ’ 𝑦||
π‘₯βˆˆπ‘† π‘¦βˆˆπ‘ˆ
|π‘₯|=1
21
𝑆
πœƒ
π‘ˆ
Wedin’s sin-theta theorem

Developed by [Davis-Kahan’70], [Wedin’72]

Theorem:

Consider 𝑃 = 𝑃 + 𝐺
𝑆 is top-𝑙 subspace of 𝑃
π‘ˆ is the π‘˜-space containing 𝑃

Then: sin πœƒ 𝑆, π‘ˆ ≀



πœƒ
||𝐺||
πœ†π‘™ (𝑃)
Another way to see why we need to take directions with
sufficiently heavy singular values
22
Additional issue: Conditioning

After an iteration, the noise is not random anymore!

non-captured points might be β€œbiased” by capturing criterion

Fix: estimate top PCA subspace from a small sample of
the data

Might be purely due to analysis

23
But does not sound like a bad idea in practice either
Performance of Iterative PCA

Can prove there are 𝑂

In each, we have NNS in ≀ π‘˜ dimensional space

Overall query time: 𝑂

Reduced to 𝑂
24
𝑑 log 𝑛 iterations
1
πœ–π‘‚ π‘˜
β‹… 𝑑 β‹… log 3/2 𝑛
𝑑 log 𝑛 instances of π‘˜-dimension NNS!
2nd Algorithm: PCA-tree
Closer to algorithms used in practice

β€’
β€’
β€’
β€’
Find top PCA direction 𝑣
Partition into slabs βŠ₯ 𝑣
Snap points to βŠ₯ hyperplane
Recurse on each slice
β‰ˆ πœ–/π‘˜
25
Query:
β€’ follow all tree paths that
may contain π‘βˆ—
2 algorithmic modifications
Find top PCA direction 𝑣
Partition into slabs βŠ₯ 𝑣
Snap points to βŠ₯ hyperplanes
Recurse on each slice
β€’
β€’
β€’
β€’

Centering:



Query:
β€’ follow all tree paths that
may contain π‘βˆ—
Need to use centered PCA (subtract average)
Otherwise errors from perturbations accumulate
Sparsification:


Need to sparsify the set of points in each node of the tree
Otherwise can get a β€œdense” cluster:


26
not enough variance in signal
lots of noise
Analysis

An β€œextreme” version of Iterative PCA Algorithm:


just use the top PCA direction: guaranteed to have signal !
Main lemma: the tree depth is ≀ 2π‘˜



because each discovered direction close to π‘ˆ
snapping: like orthogonalizing with respect to each one
cannot have too many such directions
π‘˜ 2π‘˜
πœ–

Query runtime: 𝑂

Overall performs like 𝑂(π‘˜ β‹… log π‘˜)-dimensional NNS!
27
Wrap-up
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?

Here:




Immediate questions:



Model: β€œlow-dimensional signal + large noise”
like NNS in low dimensional space
via β€œright” adaptation of PCA
Other, less-structured signal/noise models?
Algorithms with runtime dependent on spectrum?
Broader Q: Analysis that explains empirical success?
28