Transcript PowerPoint

SIMILARITY SEARCH
The Metric Space Approach
Pavel Zezula, Giuseppe Amato,
Vlastislav Dohnal, Michal Batko
Table of Contents
Part I: Metric searching in a nutshell
 Foundations of metric space searching
 Survey of exiting approaches
Part II: Metric searching in large collections
 Centralized index structures
 Approximate similarity search
 Parallel and distributed indexes
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
2
Approximate similarity search

Approximate similarity search overcomes problems of exact
similarity search using traditional access methods



Similarity search returns mathematically precise result sets


Moderate improvement of performance with respect to sequential
scan
Dimensionality curse
Similarity is subjective so, in some cases, also approximate result
sets satisfy the user
Approximate similarity search processes query faster at the
price of imprecision in the returned result sets

Useful for instance in interactive systems


Similarity search is an iterative process where temporary results are
used to create a new query
Improvements up to two orders of magnitude
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
3
Approximate similarity search

Approximation strategies

Relaxed pruning conditions


Data regions overlapping the query regions can be discarded
depending on the specific strategy
Early termination of the search algorithm

Search algorithm might stop before all regions have been
accessed
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
4
Approximate Similarity Search
relative error approximation (pruning condition)
1.

2.
3.
4.
5.
6.
Range and k-NN search queries
good fraction approximation
small chance improvement approximation
proximity-based approximation
PAC nearest neighbor searching
performance trials
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
5
Relative error approximation

Let oN be the nearest neighbour of q. If




d oA, q
 1 e
N
d o ,q

then oA is the (1+e)-approximate nearest neighbor
of q
This can be generalized to the k-th nearest neighbor




d okA , q
 1 e
N
d ok , q
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
6
Relative error approximation

Exact pruning strategy:
d q, p  rq  rp
rp
p
rq
q
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
7
Relative error approximation

Approximate pruning strategy:
rq
d q, p   rp
rp
1 e
p
rq
q
rq/(1+e
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
8
Approximate Similarity Search
relative error approximation (pruning condition)
1.

good fraction approximation (stop condition)
2.

3.
4.
5.
6.
Range and k-NN search queries
K-NN search queries
small chance improvement approximation
proximity-based approximation
PAC nearest neighbor searching
performance trials
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
9
Good fraction approximation

The k-NN algorithm determines the final result by
reducing distances of current result set

When the current result set belongs to a specific
fraction of the objects closest to the query, the
approximate algorithm stops

Example: Stop when current result set belongs to the 10%
of the objects closest to the query
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
10
Good fraction approximation

For this strategy we use the distance distribution
defined as
Fq ( x)  Prd (o, q)  x

The distance distribution Fq(x) specifies what is the
probability that the distance of a random object o
from q is smaller than x

It is easy to see that Fq (x) gives, in probabilistic
terms, the fraction of the database corresponding to
the set of objects whose distance from q is smaller
than x
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
11
Good fraction approximation
1
0,9
Fraction of the
data set whose
distances from q
are smaller than
d(q,ok)
0,8
0,7
0,6
0,5
0,4
Fq(x)
d(q,ok)
0,3
ok
0,2
0,1
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
q
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
12
Good fraction approximation

When Fq(d(ok,q)) < r all objects of the current result
set belong to the fraction r of the dataset
ok
q
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
13
Good fraction approximation


Fq(x) is difficult to be handled since we need to
compute it for all possible queries
It was proven that the overall distance distribution
F(x) defined as follows
F ( x)  Prd (o1 , o2 )  x

can be used in practice, instead of Fq(x), since they
have statistically the same behaviour.
F(x) can be easily estimated as a discrete function
and it can be easily maintained in main memory
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
14
Approximate Similarity Search
relative error approximation (pruning condition)
1.

good fraction approximation (stop condition)
2.


5.
6.
K-NN search queries
small chance improvement approximation (stop c.)
3.
4.
Range and k-NN search queries
K-NN search queries
proximity-based approximation
PAC nearest neighbor searching
performance trials
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
15
Small chance improvement
approximation



The M-Tree’s k-NN algorithm determines the final
result by improving the current result set
Each step of the algorithm the temporary result is
improved and the distance of the k-th element
decreases
When the improvement of the temporary result set
slows down, the algorithms can stop
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
16
Small chance improvement
approximation
f ( x) : 
 d (q, o )
A
k
0,38
0,36
0,34
Distance
0,32
0,3
0,28
0,26
0,24
0,22
0,2
0
500
1000
1500
Iteration
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
17
Small chance improvement
approximation



Function f (x) is not known a priori.
A regression curve j (x), which approximate f (x),
is computed using the least square method while
the algorithm proceeds
Through the derivative of j (x) it is possible to
decide when the algorithm has to stop
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
18
Small chance improvement
approximation

The regression curve has the following form
j ( x)  c1j1 ( x)  c2
where c1 and c2 are such that
j
 c j (i)  c
i 0

1 1
2
 f (i) 
2
is minimum
We have used both j1(x)=ln(x) and j1(x)=1/x
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
19
Regression curves
0,4
0,38
0,36
Distance
0,34
0,32
0,3
Distance
Hyperbolic Regr.
Logarithmic Regr.
0,28
0,26
0,24
0,22
0,2
0
500
1000
1500
Iteration
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
20
Approximate Similarity Search
relative error approximation (pruning condition)
1.

good fraction approximation (stop condition)
2.


K-NN search queries
proximity-based approximation (pruning cond.)
4.

6.
K-NN search queries
small chance improvement approximation (stop c.)
3.
5.
Range and k-NN search queries
Range and k-NN search queries
PAC nearest neighbor searching
performance trials
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
21
Proximity-based approximation

Regions whose probability of containing qualifying
objects is below a certain threshold are pruned even
if they overlap the query region


Proximity between regions is defined as the probability
that a randomly chosen object appears in both the regions.
This resulted in an increase of performance of two
orders of magnitude both for range queries and
nearest neighbour queries
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
22
Proximity-based approximation
R1.11
1.R
1.q
1.R
1R.2
R
1.R
3
1.3
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
23
Approximate Similarity Search
relative error approximation (pruning condition)
1.

good fraction approximation (stop condition)
2.

K-NN search queries
small chance improvement approximation (stop c.)
3.

K-NN search queries
proximity-based approximation (pruning cond.)
4.

Range and k-NN search queries
PAC nearest neighbor searching (pruning & stop)
5.

6.
Range and k-NN search queries
1-NN search queries
performance trials
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
24
PAC nearest neighbour searching
It uses the same time a relaxed branching
condition and a stop condition




The relaxed branching condition is the same used for the
relative error approximation to find
an (1+e)-approximate-nearest neighbor
In addition it halts prematurely when the probability that we
have found the (1+e)-approximate-nearest neighbor is
above the threshold d
It can only be used for 1-NN search queries
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
25
PAC nearest neighbour searching

Let us suppose that then nearest neighbour found
so far is oA
Let eact be the actual error on distance of oA

d oA, q
e act 
1
N
d o ,q
The algorithm stops if






Pre act  e   d
The above probability is obtained by computing the
distribution of the distance of the nearest neighbor.
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
26
PAC nearest neighbour searching

Distribution of the distance of the nearest neighbor
in X (of cardinality n) with respect to q:
Gq ( x)  Pro  X : d (q, o)  x  1 (1 Fq ( x))n

Given that


P re act  e   P r o  X : d (q, o A ) / d (q, o)  1  e 



 P r o  X : d (q, o)  d (q, o A ) /(1  e )  Gq d (q, o A ) /(1  e )


The algorithm halts when


Gq d (q, o A ) /(1 e )  d
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
27
Approximate Similarity Search
relative error approximation (pruning condition)
1.

good fraction approximation (stop condition)
2.

K-NN search queries
small chance improvement approximation (stop c.)
3.

K-NN search queries
proximity-based approximation (pruning cond.)
4.

Range and k-NN search queries
PAC nearest neighbor searching (pruning & stop)
5.

6.
Range and k-NN search queries
1-NN search queries
performance trials
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
28
Comparisons tests
Tests on a dataset of 11,000 objects

Objects are vectors of 45 dimensions

We compared the five approximation approaches

Range queries tested on the methods:




Relative error
Proximity
Nearest-neighbors queries tested on all methods
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
29
Comparisons: range queries
IE
Relative error
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
r=1,800
r=2,200
r=2,600
r=3,000
0
0.2
0.4
0.6
0.8
1
R
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
30
Comparisons: range queries
Proximity
7
6
IE
5
r=1,800
r=2,200
r=2,600
r=3,000
4
3
2
1
0
0.2
0.4
0.6
0.8
1
R
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
31
Comparisons NN queries
Relative error
1.6
1.5
IE
1.4
k=1
k=3
k=10
k=50
1.3
1.2
1.1
1
0
0.001
0.002
0.003
0.004
EP
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
32
Comparisons NN queries
Good fraction
800
700
600
k=1
k=3
k=10
k=50
IE
500
400
300
200
100
0
0
0.01
0.02
0.03
EP
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
33
Comparisons NN queries
IE
Small chance improvement
200
180
160
140
120
100
80
60
40
20
0
k=1
k=3
k=10
k=50
0
0.02
0.04
0.06
0.08
0.1
EP
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
34
Comparisons NN queries
Proximity
800
700
600
k=1
k=3
k=10
k=50
IE
500
400
300
200
100
0
0
0.005
0.01
0.015
0.02
0.025
0.03
EP
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
35
Comparisons NN queries
IE
PAC
500
450
400
350
300
250
200
150
100
50
0
eps=2
eps=3
eps=4
0
0.001
0.002
0.003
0.004
0.005
EP
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
36
Conclusions: Approximate similarity
search in metric spaces
These techniques for approximate similarity search
can be applied to generic metric spaces


Vector spaces are a special case of metric space.
High accuracy of approximate results are generally
obtained with high improvement of efficiency



Best performance obtained with the good fraction
approximation methods
The proximity based is a bit worse than good fraction
approximation but can be used for range queries and k-NN
queries.
P. Zezula, G. Amato, V. Dohnal, M. Batko:
Similarity Search: The Metric Space Approach
Part I, Chapter 1
37