Nearest Neighbor Retrieval Using Distance
Download
Report
Transcript Nearest Neighbor Retrieval Using Distance
Nearest Neighbor Retrieval Using
Distance-Based Hashing
Vassilis Athitsos
Michalis Potamias+
University of Texas, Arlington
Boston University
Panagiotis Papapetrou
Boston University
George Kollios
Boston University
nearest neighbor problem
Setting:
Given:
object P* from S, that is closest to Q
NNs appear in various applications
under many different distance
functions
query Q (previously unseen)
Find and Return:
database of objects S
distance function D
classification of handwritten digits
hand-pose estimation
Can perform linear scan…
Cost
7/7/2015
large S
expensive D
Distance Based Hashing
2
cost model
Dominating cost:
Distance function may
be very “expensive”
Time series (DTW)
String Alignment (Edit)
Computer vision
Cost Model: minimize
number of distance
computations
7/7/2015
Distance Based Hashing
Dynamic Programming
for Edit Distance
3
some existing solutions
If objects are low dimensional, exact nearest
neighbors are fast
If objects are high dimensional, for some
distance functions (Hamming) approximate
nearest neighbors are fast, using LSH
However in many interesting settings “linear
scan” may be the only approach for exact NNs
7/7/2015
high dimensional, non-metric
Distance Based Hashing
4
dbh setting
No assumptions for the
distance function
probably non-metric
Distance function
computations dominate
the cost
Trade perfect accuracy for
faster results
7/7/2015
Distance Based Hashing
5
dbh method overview
Preprocess:
Hash database using appropriate functions
Query Q arrives:
7/7/2015
Hash it!
Filter: Retrieve colliding objects as “candidate
NNs”
Refine: Compute the actual distance between
query and candidates
Return: Candidate that is closest to Q
Distance Based Hashing
6
Background
Background: hash – based
indexing
Building the index
h1
database
Query Time
query
D
D
h1
min
Use L tables in parallel
h1
h2
hL
7/7/2015
D
Distance Based Hashing
…
}
L
8
Background: locality sensitive
hashing
Choice of Hash Functions is important!
LSH family of functions [IM98]
z
x
r
An LSHF in a Hash-based Indexing
scheme guarantees sublinear
behavior for approximate NNs!
Such families have been constructed
for Hamming, L2…
y
cr
What if there is no LSH family for the
Distance function used?
7/7/2015
Edit, DTW etc.
Distance Based Hashing
9
Distance Based Hashing
Hash
based Indexing scheme
Can be applied to any space & any D
Its hash functions treat D as a black box
Optimization
DBH: family of hash functions
x
Pseudo-Line projection [FL95]
maps an object into the real line
y,z are pivot-points from the
database
Project x on the y-z pseudoline
Use a threshold to make it
discrete valued
D(
x,
D(
x,y
)
y
z)
z
D(y,z)
y,z
F (x)
- - This family is not an LSHF
++ Definition does not depend on
the specific distance function,
only on the 3 pairwise distances.
7/7/2015
Distance Based Hashing
x Dx, y D y, z Dx, z
2D y, z
2
F
y,z
2
2
11
DBH: method
Preprocessing:
1.
2.
3.
Use a random choice of K of these pseudoline
projections to define a hash function
Build L such (K-bit) functions
Hash all objects of S to the L h-tables
At query time:
1.
2.
3.
4.
7/7/2015
Apply the same L functions to Q
Filter : Retrieve colliding objects (candidate set)
Refine: Invoke D for candidates
Return: Nearest*
Distance Based Hashing
12
DBH: accuracy vs cost
Accuracy : Percentage of queries for which DBH returns true NN
Cost: Amount of distance computations
Problem: Given desired accuracy minimize the cost
Choice of K,L affects the cost and the accuracy
Sampling: approximate distributions
Probability of NNs colliding
Probability of non-NNs colliding
Perform binary search for best (K,L)
Distance Matrix
0 5 4…
3 0
…
...
...
0
Desired
Accuracy
7/7/2015
K, L
TRAINING PHASE
Distance Based Hashing
DBH Index Structure
13
DBH: accuracy
Probability of collision
between any Query Q and
its Nearest Neighbor N(Q)
for a single projection
function
CQ, N Q PrhH DBH hQ hN Q
800
700
600
Employ sampling to
estimate C(Q,N(Q))
500
400
300
200
100
Use K and L to shift
distribution to desired
accuracy
Probability of collision in at
least one of the L K-bit
tables
…and compute
7/7/2015
0
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
CK , L Q, N Q 1 1 C Q, N Q
AccuracyK , L
QX
Distance Based Hashing
K L
CK , L Q, N Q PrQ dQ
14
DBH: cost
x
x,y
)
Hash and LookUp
D(
D(
x,
y
z)
D
D(y,z)
D
D
z
min
y,z
HashCost: Number of distance
computations to evaluate hash
functions
LookupCost: number of
objects that collide in at least
one of the L hash tables
F (x)
HashCostK ,L Q 2KL
LookupCostK , L Q CK , L Q, x
xU
CostK ,L Q LookupCostK ,L Q HashCostK ,L Q
Query Cost:
Total Cost (for all Queries):
CostK , L
QX
7/7/2015
Distance Based Hashing
CostK , L Q PrQ dQ
15
DBH: further optimization
Hierarchical DBH
1.
2.
7/7/2015
Build M parallel DBH indices for different
subsets of queries
Partition according to distribution D(Q,N(Q))
Queries that are close to their NN are “easier”
Reduce HashCost by restricting HDBH to a
small subset of database pivot-points for
the projections
Distance Based Hashing
17
Experiments
experiments: datasets
We test DBH on 3 datasets:
Unipen (timeseries ~30 – digits)
MNIST (images 28x28 – digits)
Shape Context Matching
60K (test: 10K)
Hands (images 256x256 – hand-pose)
7/7/2015
Dynamic Time Warping
10K (test: 5K)
Chamfer Distance
80K (test: 1K)
Distance Based Hashing
19
experiments: results
Training-set
to opt K, L
Test-set experiment
Compare to modified VP-tree
handles non-metric data
Accuracy vs Cost plot
7/7/2015
X-axis : Accuracy
Y-axis : Distance Computations
Distance Based Hashing
20
experiments: results
7/7/2015
Distance Based Hashing
21
conclusion
Distance Based Hashing
is a hash-based indexing framework for NN retrieval
Not sublinear, just speedup
General purpose: No properties assumed for distance
function - black box
May be further optimized for bigger speedups
Future: Can we build a scheme for “black box”
distance function and provide a statistical argument
for sublinear behavior to the size of the database?
7/7/2015
Distance Based Hashing
22
thank you!
Famous NNs : Castor (Κάστωρ) and Polydeuces (Πολυδεύκης)
7/7/2015
Distance Based Hashing
23