Nearest Neighbor Retrieval Using Distance

Download Report

Transcript Nearest Neighbor Retrieval Using Distance

Nearest Neighbor Retrieval Using
Distance-Based Hashing
Vassilis Athitsos
Michalis Potamias+
University of Texas, Arlington
Boston University
Panagiotis Papapetrou
Boston University
George Kollios
Boston University
nearest neighbor problem

Setting:



Given:




object P* from S, that is closest to Q
NNs appear in various applications
under many different distance
functions


query Q (previously unseen)
Find and Return:


database of objects S
distance function D
classification of handwritten digits
hand-pose estimation
Can perform linear scan…
Cost


7/7/2015
large S
expensive D
Distance Based Hashing
2
cost model

Dominating cost:
Distance function may
be very “expensive”




Time series (DTW)
String Alignment (Edit)
Computer vision
Cost Model: minimize
number of distance
computations
7/7/2015
Distance Based Hashing
Dynamic Programming
for Edit Distance
3
some existing solutions

If objects are low dimensional, exact nearest
neighbors are fast

If objects are high dimensional, for some
distance functions (Hamming) approximate
nearest neighbors are fast, using LSH

However in many interesting settings “linear
scan” may be the only approach for exact NNs

7/7/2015
high dimensional, non-metric
Distance Based Hashing
4
dbh setting

No assumptions for the
distance function

probably non-metric

Distance function
computations dominate
the cost

Trade perfect accuracy for
faster results
7/7/2015
Distance Based Hashing
5
dbh method overview

Preprocess:


Hash database using appropriate functions
Query Q arrives:




7/7/2015
Hash it!
Filter: Retrieve colliding objects as “candidate
NNs”
Refine: Compute the actual distance between
query and candidates
Return: Candidate that is closest to Q
Distance Based Hashing
6
Background
Background: hash – based
indexing

Building the index
h1
database

Query Time
query
D
D
h1

min
Use L tables in parallel
h1
h2
hL
7/7/2015
D
Distance Based Hashing
…
}
L
8
Background: locality sensitive
hashing


Choice of Hash Functions is important!
LSH family of functions [IM98]
z
x
r



An LSHF in a Hash-based Indexing
scheme guarantees sublinear
behavior for approximate NNs!
Such families have been constructed
for Hamming, L2…
y
cr
What if there is no LSH family for the
Distance function used?

7/7/2015
Edit, DTW etc.
Distance Based Hashing
9
Distance Based Hashing
Hash
based Indexing scheme
Can be applied to any space & any D
Its hash functions treat D as a black box
Optimization
DBH: family of hash functions
x
Pseudo-Line projection [FL95]
maps an object into the real line



y,z are pivot-points from the
database
Project x on the y-z pseudoline
Use a threshold to make it
discrete valued
D(
x,
D(
x,y
)

y
z)
z
D(y,z)
y,z
F (x)


- - This family is not an LSHF
++ Definition does not depend on
the specific distance function,
only on the 3 pairwise distances.
7/7/2015
Distance Based Hashing
x  Dx, y   D y, z   Dx, z 
2D y, z 
2
F
y,z
2
2
11
DBH: method
Preprocessing:

1.
2.
3.
Use a random choice of K of these pseudoline
projections to define a hash function
Build L such (K-bit) functions
Hash all objects of S to the L h-tables
At query time:

1.
2.
3.
4.
7/7/2015
Apply the same L functions to Q
Filter : Retrieve colliding objects (candidate set)
Refine: Invoke D for candidates
Return: Nearest*
Distance Based Hashing
12
DBH: accuracy vs cost
Accuracy : Percentage of queries for which DBH returns true NN
Cost: Amount of distance computations
Problem: Given desired accuracy minimize the cost
Choice of K,L affects the cost and the accuracy
Sampling: approximate distributions







Probability of NNs colliding
Probability of non-NNs colliding
Perform binary search for best (K,L)

Distance Matrix
0 5 4…
3 0
…
...
...
0
Desired
Accuracy
7/7/2015
K, L
TRAINING PHASE
Distance Based Hashing
DBH Index Structure
13
DBH: accuracy

Probability of collision
between any Query Q and
its Nearest Neighbor N(Q)
for a single projection
function
CQ, N Q  PrhH DBH hQ  hN Q
800
700
600

Employ sampling to
estimate C(Q,N(Q))
500
400
300
200
100

Use K and L to shift
distribution to desired
accuracy


Probability of collision in at
least one of the L K-bit
tables
…and compute
7/7/2015
0
0.5

0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1

CK , L Q, N Q   1  1  C Q, N Q 
AccuracyK , L  
QX
Distance Based Hashing

K L
CK , L Q, N Q PrQ dQ
14
DBH: cost
x
x,y
)
Hash and LookUp 
D(

D(
x,
y
z)
D
D(y,z)
D
D
z
min
y,z

HashCost: Number of distance
computations to evaluate hash
functions

LookupCost: number of
objects that collide in at least
one of the L hash tables
F (x)
HashCostK ,L Q  2KL
LookupCostK , L Q   CK , L Q, x 
xU
CostK ,L Q  LookupCostK ,L Q  HashCostK ,L Q

Query Cost:

Total Cost (for all Queries):
CostK , L  
QX
7/7/2015
Distance Based Hashing
CostK , L Q  PrQ dQ
15
DBH: further optimization
Hierarchical DBH
1.



2.
7/7/2015
Build M parallel DBH indices for different
subsets of queries
Partition according to distribution D(Q,N(Q))
Queries that are close to their NN are “easier”
Reduce HashCost by restricting HDBH to a
small subset of database pivot-points for
the projections
Distance Based Hashing
17
Experiments
experiments: datasets

We test DBH on 3 datasets:

Unipen (timeseries ~30 – digits)



MNIST (images 28x28 – digits)



Shape Context Matching
60K (test: 10K)
Hands (images 256x256 – hand-pose)


7/7/2015
Dynamic Time Warping
10K (test: 5K)
Chamfer Distance
80K (test: 1K)
Distance Based Hashing
19
experiments: results

Training-set

to opt K, L

Test-set  experiment

Compare to modified VP-tree


handles non-metric data
Accuracy vs Cost plot


7/7/2015
X-axis : Accuracy
Y-axis : Distance Computations
Distance Based Hashing
20
experiments: results
7/7/2015
Distance Based Hashing
21
conclusion

Distance Based Hashing
is a hash-based indexing framework for NN retrieval




Not sublinear, just speedup
General purpose: No properties assumed for distance
function - black box
May be further optimized for bigger speedups
Future: Can we build a scheme for “black box”
distance function and provide a statistical argument
for sublinear behavior to the size of the database?
7/7/2015
Distance Based Hashing
22
thank you!
Famous NNs : Castor (Κάστωρ) and Polydeuces (Πολυδεύκης)
7/7/2015
Distance Based Hashing
23