Spectral Approaches to Nearest Neighbor Search (slides)
Download
Report
Transcript Spectral Approaches to Nearest Neighbor Search (slides)
Spectral Approaches to
Nearest Neighbor Search
Alex Andoni
Joint work with: Amirali Abdullah
Ravi Kannan
Robi Krauthgamer
Nearest Neighbor Search (NNS)
ο½
Preprocess: a set π of points
ο½
Query: given a query point π, report a
point πβ β π with the smallest distance
to π
πβ
π
Motivation
ο½
Generic setup:
ο½
ο½
ο½
Application areas:
ο½
ο½
ο½
machine learning: k-NN rule
speech/image/video/music recognition, vector
quantization, bioinformatics, etcβ¦
Distance can be:
ο½
ο½
Points model objects (e.g. images)
Distance models (dis)similarity measure
Hamming, Euclidean,
edit distance, Earth-mover distance, etcβ¦
Primitive for other problems
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
πβ
π
Curse of dimensionality
ο½
All exact algorithms degrade rapidly with the
dimension π
Algorithm
Query time
Space
Full indexing
π(log π β
π)
No indexing β
linear scan
π(π β
π)
ππ(π) (Voronoi diagram size)
π(π β
π)
Approximate NNS
ο½
Given a query point π, report a point πβ²
β π s.t. πβ² β π β€ π πβ β π
ο½
ο½
π : approximation
randomized: a point πβ² returned with 90%
probability
πβ
π
ο½
Heuristic perspective: gives a set of
candidates (hopefully small)
πβ²
NNS algorithms
Itβs all about space partitions !
Low-dimensional
[Arya-Mountβ93], [Clarksonβ94], [Arya-MountNetanyahu-Silverman-Weβ98], [Kleinbergβ97],
[Har-Peledβ02],[Arya-Fonseca-Mountβ11],β¦
ο½
High-dimensional
[Indyk-Motwaniβ98], [Kushilevitz-OstrovskyRabaniβ98], [Indykβ98, β01], [Gionis-IndykMotwaniβ99], [Charikarβ02], [Datar-ImmorlicaIndyk-Mirrokniβ04], [Chakrabarti-Regevβ04],
[Panigrahyβ06], [Ailon-Chazelleβ06], [A-Indykβ06],
[A-Indyk-Nguyen-Razenshteynβ14], [ARazenshteynβ14]
ο½
6
Low-dimensional
kd-trees,β¦
π =1+π
ο½
ο½
runtime:
ο½
7
1
π π(π)
β
log π
High-dimensional
Locality-Sensitive Hashing
Crucial use of random projections
ο½
ο½
ο½
Johnson-Lindenstrauss Lemma: project to random subspace of
dimension π(log π /π 2 ) for 1 + π approximation
Runtime: π1/π for π approximation
ο½
8
Practice
Data-aware partitions
ο½
ο½
ο½
ο½
ο½
9
optimize the partition to your dataset
PCA-tree [Sβ91, Mβ01,VKDβ09]
randomized kd-trees [SAH08, ML09]
spectral/PCA/semantic/WTA hashing [WTF08, CKW09, SH09,
YSRL11]
Practice vs Theory
ο½
Data-aware projections often outperform (vanilla)
random-projection methods
But no guarantees (correctness or performance)
ο½
JL generally optimal [Aloβ03, JWβ13]
ο½
ο½
Even for some NNS regimes! [AIPβ06]
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?
10
Plan for the rest
ο½
ο½
ο½
Model
Two spectral algorithms
Conclusion
11
Our model
ο½
βlow-dimensional signal + large noiseβ
ο½
ο½
ο½
inside high dimensional space
Signal: π β π where π β βπ of dimension π βͺ π
Data: each point is perturbed by a full-dimensional
Gaussian noise ππ (0, π 2 )
π
12
Model properties
ο½
ο½
Data π = π + πΊ
Query π = π + ππ s.t.:
ο½
ο½
ο½
ο½
||π β πβ || β€ 1 for βnearest neighborβ
||π β π|| β₯ 1 + π for everybody else
Noise π(0, π 2 )
1
ο½
πβ
ο½
roughly the limit when nearest neighbor is still the same
π1/4
up to factors poly in π log π
Noise is large:
ο½
ο½
ο½
13
π π β π1/4 β« 1
top π dimensions of π capture sub-constant mass
JL would not work: after noise, gap very close to 1
NNS performance as if we are in π dimensions ?
ο½
Best we can hope for
ο½
dataset contains a βworst-caseβ π-dimensional instance
ο½
Reduction from dimension π to π
ο½
Spoiler: Yes
14
Tool: PCA
ο½
Principal Component Analysis
ο½
ο½
15
like SVD
Top PCA direction: direction maximizing variance of the data
Naïve attempt via PCA
ο½
Use PCA to find the βsignal subspaceβ π ?
ο½
ο½
ο½
Does NOT work
ο½
ο½
find top-k PCA space
project everything and solve NNS there
Sparse signal directions overpowered by noise directions
PCA is good only on βaverageβ, not βworst-caseβ
πβ
16
1st Algorithm: intuition
ο½
Extract βwell-captured pointsβ
ο½
ο½
ο½
points with signal mostly inside top PCA space
should work for large fraction of points
Iterate on the rest
πβ
17
Iterative PCA
β’
β’
β’
β’
Find top PCA subspace π
πΆ=points well-captured by π
Build NNS d.s. on {πΆ projected onto π}
Iterate on the remaining points, π β πΆ
Query: query each NNS d.s. separately
ο½
To make βPCA workβ :
ο½
Nearly no noise in π: ensuring π close to π
ο½
ο½
Capture points whose signal fully in π
ο½
18
π determined by heavy-enough spectral directions (dimension may be
less than π)
well-captured: distance to π explained by noise only
Analysis of PCA
ο½
ο½
Generally want to say that βsignalβ stronger than βnoiseβ
(on average)
Use random matrix theory
ο½
ο½
π =π+πΊ
πΊ is random π × π with entries π(0, π 2 )
ο½
ο½
π is rank π and has Frobenius2 at least π
ο½
ο½
19
All singular values π2 β€ π 2 π β π/ π
Important directions have π2 β₯ Ξ©(π/π)
Important signal directions stronger than noise
Closeness of subspaces ?
ο½
Trickier than singular values
ο½
ο½
Top singular vector not necessary stable under perturbation!
Only true if second singular value much smaller
ο½
How to even define βclosenessβ of subspaces?
ο½
To rescue: Wedinβs sin-theta theorem
ο½
sin π π, π = max min ||π₯ β π¦||
π₯βπ π¦βπ
|π₯|=1
20
π
Wedinβs sin-theta theorem
ο½
Developed by [Davis-Kahanβ70], [Wedinβ72]
ο½
Theorem:
ο½
Consider π = π + πΊ
π is top-π subspace of π
π is the space of π
ο½
Then: sin π π, π β€
ο½
ο½
ο½
π
||πΊ||
ππ (π)
Another way to see why we need to take directions with
sufficiently heavy singular values
21
Additional issue: Conditioning
ο½
After an iteration, the noise is not random anymore!
ο½
Conditioning because of selection of points (non-captured
points)
ο½
Fix: estimate top PCA subspace from a small sample of
the data
ο½
Might be purely due to analysis
ο½
22
But does not sound like a bad idea in practice either
Performance of Iterative PCA
ο½
Can prove there are π
ο½
In each, we have NNS in β€ π dimensional space
ο½
Overall query time: π
ο½
Reduced to π
23
π log π iterations
1
ππ π
β
π β
log 3/2 π
π log π instances of π-dimension NNS !
2nd Algorithm: PCA-tree
Closer to algorithms used in practice
ο½
β’
β’
β’
β’
Find top PCA direction π£
Partition into slabs β₯ π£
Snap points to β₯ hyperplanes
Recurse on each slice
β π/π
24
Query:
β’ follow all tree paths that
may contain πβ
2 algorithmic modifications
Find top PCA direction π£
Partition into slabs β₯ π£
Snap points to β₯ hyperplanes
Recurse on each slice
β’
β’
β’
β’
ο½
Centering:
ο½
ο½
ο½
Query:
β’ follow all tree paths that
may contain πβ
Need to use centered PCA (subtract average)
Otherwise errors from perturbations accumulate
Sparsification:
ο½
ο½
Need to sparsify the set of points in each node of the tree
Otherwise can get a tight cluster:
ο½
ο½
25
not enough variance in signal
lots of noise
Analysis
ο½
An βextremeβ version of Iterative PCA Algorithm:
ο½
ο½
just use the top PCA direction: guaranteed to have signal !
Main lemma: the tree depth is β€ 2π
ο½
ο½
ο½
because each discovered direction close to π
snapping: like orthogonolizing with respect to each one
cannot have too many such directions
π 2π
π
ο½
Query runtime: π
ο½
Final theorem: like NNS in π(π β
log π)-dimensional space
26
Wrap-up
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?
ο½
Here:
ο½
ο½
ο½
ο½
Model: βlow-dimensional signal + large noiseβ
like NNS in low dimensional space
via βrightβ adaptation of PCA
Immediate questions:
ο½
ο½
ο½
27
PCA-tree: like in π-dimensional space?
Other, less-structured signal/noise models?
Algorithms with runtime dependent on spectrum?
Post-perspective
ο½
Can data-aware projections help in worst-case situation?
ο½
Data-aware LSH provably better than classic LSH
ο½
ο½
[A-Indyk-Nguyen-Razenshteynβ14], [A-Razenshteyn]
Overall improve performance almost quadratically
But not based on (spectral) projections
ο½
ο½
ο½
random data-dependent hash function
Data-aware projections for LSH ?
Instance-optimal partitions/projections?
28