Slides - University of New South Wales
Download
Report
Transcript Slides - University of New South Wales
NICTA Machine Learning Research Group Seminar
APPROXIMATE NEAREST
NEIGHBOR QUERIES WITH A
TINY INDEX
6/26/14
Wei Wang, University of New South Wales
1
Outline
2
Overview of Our Research
SRS: c-Approximate Nearest Neighbor with a tiny
index [PVLDB 2015]
Conclusions
NICTA Machine Learning Research Group Seminar
6/26/14
Research Projects
3
Similarity Query Processing
Keyword search on (Semi-) Structured Data
Graph
Succinct Data Structures
NICTA Machine Learning Research Group Seminar
6/26/14
NN and c-ANN Queries
4
c-ANN = x
x
D = {a, b, x}
aka. (1+ε)-ANN
NN = a
Definitions
A set of points, D = ∪i=1n {Oi}, in d-dimensional Euclidean
space (d is large, e.g., hundreds)
Given a query point q, find the closest point, O*, in D
Relaxed version:
Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q)
May return a c-ANN point with at least constant probability
NICTA Machine Learning Research Group Seminar 6/26/14
Applications and Challenges
5
Applications
Feature
vectors: Data Mining, Multimedia DB
Fundamental geometric problem: “post-office problem”
Quantization in coding/compression
…
Challenges
Curse
of Dimensionality / Concentration of Measure
Hard
Large
to find algorithms sub-linear in n and polynomial in d
data size: 1KB for a single point with 256 dims
NICTA Machine Learning Research Group Seminar
6/26/14
Existing Solutions
6
NN:
Linear scan is (practically) the best approach using linear space & time
O(d5log(n)) query time, O(n2d+δ) space
O(dn1-ε(d)) query time, O(dn) space
Linear scan: O(dn/B) I/Os, O(dn) space
(1+ε)-ANN
LSH is the best approach using sub-quadratic space
O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space
Probabilistic test remove exponential dependency on d
Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2)) space
LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space
ρ= 1/(1+ε) + oc(1)
NICTA Machine Learning Research Group Seminar
6/26/14
Approximate NN for Multimedia
Retrieval
7
Cover-tree
Spill-tree
Reduce to NN search with Hamming distance
Dimensionality reduction (e.g., PCA)
Quantization-based approaches (e.g., CK-Means)
NICTA Machine Learning Research Group Seminar
6/26/14
LSH is the best approach using sub-quadratic space
Locality Sensitive Hashing (LSH)
8
Equality search
Index: store o into bucket h(o)
Query: retrieve every o in the bucket
h(q), verify if o = q
LSH
∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝
1/Dist(q, o)
h :: Rd Z
technically,
dependent on r
“Near-by” points (blue) have more
chance of colliding with q than “faraway” points (red)
NICTA Machine Learning Research Group Seminar
6/26/14
LSH: Indexing & Query Processing
9
Index
For a fixed r
sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩
store o into bucket sig(o)
Reduce Query Cost
Iteratively increase r
Query
Const
Succ.
Prob.
Search with a fixed r
Retrieve and “verify” points in
the bucket sig(q)
Repeat this L times (boosting)
Galloping search to find the first
Incurs additional cost +
good r
2
only c quality guarantee
NICTA Machine Learning Research Group Seminar
6/26/14
Locality Sensitive Hashing (LSH)
10
Standard LSH
c2-ANN
binary search on R(ci, ci+1)-NN problems
LSH on external memory
O((dn/B)0.5) query,
O((dn/B)1.5) space
LSB-forest [SIGMOD’09, TODS’10]:
different reduction from c2-ANN to a R(ci, ci+1)-NN
problem
O(n*log(n)/B) query,
A
C2LSH [SIGMOD’12]:
O(n*log(n)/B) space
Do
not use composite hash keys
Perform fine-granular counting number of collisions in m LSH
projections
SRS
O(n/B) query,
(Ours) O(n/B) space
NICTA Machine Learning Research Group Seminar
6/26/14
Weakness of Ext-Memory LSH Methods
11
Existing methods uses super-linear space
Thousands
(or more) of hash tables needed if rigorous
People resorts to hashing into binary code (and using
Hamming distance) for multimedia retrieval
Can only handle c, where c = x2, for integer x ≥ 2
To
enable reusing the hash table (merging buckets)
Valuable information lost (due to quantization)
Update? (changes to n, and c)
Dataset Size
Audio, 40MB
LSB-forest
1500 MB
C2LSH+
127 MB
NICTA Machine Learning Research Group Seminar
SRS (Ours)
2 MB
6/26/14
SRS: Our Proposed Method
12
Solving c-ANN queries with O(n) query time and O(n)
space with constant probability
Constants hidden in O() is very small
Early-termination condition is provably effective
Advantages:
Small index
Rich-functionality
Simple
Central idea:
c-ANN query in d dims kNN query in m-dims with
filtering
Model the distribution of m “stable random projections”
NICTA Machine Learning Research Group Seminar
6/26/14
2-stable Random Projection
13
Let D be the 2-stable random projection =
standard Normal distribution N(0, 1)
For two i.i.d. random variables A ~ D, B ~ D, then
x*A + y*B ~ (x2+y2)1/2 * D
Illustration
V
r1
V
r
2 Seminar
NICTA Machine Learning Research Group
6/26/14
Dist(O) and ProjDist(O) and
Their Relationship
Proj(O)
14
O
O
Proj(Q)
r1
Q
Q
O in d dims
m 2-stable random
projections
r2
z1≔⟨V, r1⟩ ~ N(0, ǁvǁ)
z2≔⟨V, r2⟩ ~ N(0, ǁvǁ)
z12+z22 ~ ǁvǁ2 *χ2m
Dist(O)
(z1, … zm) in m dims
ProjDist(O)
i.e., scaled Chi-squared distribution of m degrees of freedom
Ψm(x): cdf of the standardχ2m distribution
NICTA Machine Learning Research Group Seminar
6/26/14
LSH-like Property
15
Intuitive idea:
If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with
high probability
But the inverse is NOT true
NN object in the projected space is most likely not the NN object
in the original space with few projections, as
Many far-away objects projected before the NN/cNN objects
But we can bound the expected number of such cases! (say T)
Solution
Perform incremental k-NN search on the projected space till
accessing T objects
+ Early termination test
NICTA Machine Learning Research Group Seminar
6/26/14
Indexing
16
Finding the minimum m
Input
Output
n, c,
T ≔ max # of points to access by the algorithm
m : # of 2-stable random projections
T’ ≤ T: a better bound on T
m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space
index
Generate m 2-stable random projections n projected
points in a m-dimensional space
Index these projections using any index that supports
incremental kNN search, e.g., R-tree
Space cost: O(m * n) = O(n)
NICTA Machine Learning Research Group Seminar
6/26/14
SRS-αβ(T, c, pτ)
17
Early-termination test:
c 2 * ProjDist2 (ok )
m
p
2
Dist (omin )
Compute proj(Q)
Do incremental kNN search
from proj(Q)
// stopping condition α
for k = 1 to T
Compute
Dist(Ok)
Maintain Omin = argmin1≤i≤k Dist(Oi)
If early-termination test (c, pτ) = TRUE
// stopping condition β
BREAK
Return Omin
c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18
Index = 0.0059n, Query = 0.0084n, succ prob = 0.13
Main Theorem:
NICTA
Machine
Research p
Group
Seminar
SRS-αβ returns a c-NN
point
withLearning
probability
with 6/26/14
O(n) I/O cost
τ-f(m,c)
Variations of SRS-αβ(T, c, pτ)
18
Compute proj(Q)
Do incremental kNN search from proj(Q)
for k = 1 to T
// stopping condition α
Compute Dist(Ok)
Maintain Omin = argmin1≤i≤k Dist(Oi)
If early-termination test (c, pτ) = TRUE
// stopping condition β
1.
2.
3.
BREAK
Return Omin
SRS-α
SRS-β
SRS-αβ(T, c’, pτ)
Better quality; query cost is O(T)
Best quality; query cost bounded by O(n); handles c = 1
Better quality; query cost bounded by O(T)
All
with
success
probability
least p6/26/14
NICTA
Machine
Learning
Research GroupatSeminar
τ
Other Results
19
Can be easily extended to support top-k c-ANN
queries (k > 1)
No previous known guarantee on the correctness of
returned results
We guarantee the correctness with probability at
least pτ, if SRS-αβ stops due to early-termination
condition
≈100%
in practice (97% in theory)
NICTA Machine Learning Research Group Seminar
6/26/14
Analysis
20
NICTA Machine Learning Research Group Seminar
6/26/14
Stopping Condition α
21
“near” point: the NN point its distance ≕ r
“far” points: points whose distance > c * r
Then for any κ > 0 and any o:
Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2)
Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)
Both because ProjDist2(o)/Dist2(o) ~ χ2m
Pr[the NN point projected beforeκ*r] ≥ ψm(κ2)
Pr[# of bad points projected before κ*r < T]
> (1 - ψm(κ2/c2)) * (n/T)
Choose κ such that P1 + P2 – 1 > 0
Feasible due to good concentration bound for χ2m
NICTA Machine Learning Research Group Seminar
6/26/14
P1
P2
Choosing κ
22
κ*r
NICTA Machine Learning Research Group Seminar
Let c = 4
Mode = m – 2
Blue: 4
Red: 4*(c2) = 64
6/26/14
ProjDist(OT): Case I
Consider cases where both
conditions hold (re. near and far
points) P1 + P2 – 1 probability
23
ProjDist(o) in m-dims
ProjDist(OT)
NICTA Machine Learning Research Group Seminar
Omin = the NN point
6/26/14
ProjDist(OT): Case II
Consider cases where both
conditions hold (re. near and far
points) P1 + P2 – 1 probability
24
ProjDist(o) in m-dims
ProjDist(OT)
Omin = a cNN point
NICTA Machine Learning Research Group Seminar
6/26/14
Early-termination Condition (β)
25
Omit the proof here
Also
relies on the fact that the squared sum of m
projected distances follows a scaled χ2m distribution
Key to
Handle
the case where c = 1
Returns
the NN point with guaranteed probability
Impossible to handle by LSH-based methods
Guarantees
the correctness of top-k cANN points
returned when stopped by this condition
No
such guarantee by any previous method
NICTA Machine Learning Research Group Seminar
6/26/14
Experiment Setup
26
Algorithms
LSB-forest
[SIGMOD’09, TODS’10]
C2LSH [SIGMOD’12]
SRS-* [VLDB’15]
Data
Measures
Index
size, query cost, result quality, success probability
NICTA Machine Learning Research Group Seminar
6/26/14
Datasets
27
5.6PB
369GB
16GB
NICTA Machine Learning Research Group Seminar
6/26/14
Tiny Image Dataset (8M pts, 384 dims)
28
Fastest: SRS-αβ,
Slowest: C2LSH
Quality the other
way around
SRS-αhas comparable
quality with C2LSH yet
has much lower cost.
SRS-* dominates LSBforest
NICTA Machine Learning Research Group Seminar
6/26/14
Approximate Nearest Neighbor
29
Empirically better than
the theoretic guarantee
With 15% I/Os of
linear scan, returns NN
with probability 71%
With 62% I/Os of
linear scan, returns NN
with probability 99.7%
NICTA Machine Learning Research Group Seminar
6/26/14
Large Dataset (0.45 Billion)
30
NICTA Machine Learning Research Group Seminar
6/26/14
Summary
31
c-ANN queries in arbitrarily high dim space kNN
query in low dim space
Our index size is approximately d/m of the size of
the data file
Opens up a new direction in c-ANN queries in highdimensional space
Find
efficient solution to kNN problem in 6-10
dimensional space
NICTA Machine Learning Research Group Seminar
6/26/14
Q&A
32
Similarity Query Processing Project Homepage:
http://www.cse.unsw.edu.au/~weiw/project/simjoin.html
NICTA Machine Learning Research Group Seminar
6/26/14