Query Sensitive Embeddings

Download Report

Transcript Query Sensitive Embeddings

Query Sensitive Embeddings
Vassilis Athitsos,
Marios Hadjieleftheriou,
George Kollios,
Stan Sclaroff
Abstract

A common
problem
in many
types of
databases is retrieving the
most similar matches
Comparing
a
lot
of
high
dimensional
objects
can
to a query object. Finding those matches in a large database can be too slow to be practical,
be
expensive.
especially
in domains where objects are compared using computationally expensive similarity (or
distance) measures.

This paper proposes a way to reduce the
This paper proposes a novel method forof
approximate
nearest neighbor retrieval in such spaces. Our
dimensionality/cost
the comparisons…
method is embedding-based, meaning that it constructs a function that maps objects into a real
vector space. The mapping preserves a large amount of the proximity structure of the original
space, and it can be used to rapidly obtain a short list of likely matches to the query.

…by training an algorithm to give different
The main novelty
method is that it
constructs, togetherdepending
with the embedding, aon
queryweights
toof our
different
measures
the
sensitive distance measure that should be used when measuring distances in the vector space. The
query.
term “query sensitive“ means that the distance measure changes depending on the current query
object.

Tests on real-world data show better
We report experiments with an image database of handwritten digits, and a time-series database.
efficiency/accuracy
trade-offs
than the
known
In both cases, the proposed method outperforms
existing state-of-the-art
embedding
methods,
meaning
that
it
provides
significantly
better
trade-offs
between
efficiency
and
retrieval
accuracy.
best algorithms.
Authors’ Previous Work

BoostMap: A method for efficient
approximate similarity rankings.
(2004)


Athitsos, Sclaroff, Kollios
Indexing multi-dimensional timeseries with support for multiple
distance measures. (2003)

Hadjieleftheriou
Terms/Concepts

Embedding

Maps any high-dimensional object into
a d-dimensional vector.
21
53
72
9
25
101
20
35
70
…more Terms/Concepts

Classifier

Given 3 points q, a & b, which is closer
to q, a or b?
…more Terms/Concepts

Distance Measure


Metric/Non-Metric measure of the true
proximity of any 2 objects.
Metrics

How close is:
 [ 2 5 72 3 5 ] to [ 5 5 45 1 1 ]?

Euclidean (L2) Distance
(2  5) 2  (5  5) 2  (72  45) 2  (3  1) 2  (5  1) 2

[ 1 3 ] to [ 5 7 ]

Manhattan (L1) Distance
 |1-5| + |3-7|
…more Terms/Concepts

Splitter

Returns 1 if a given object is in a
particular group (defined for that
splitter), 0 otherwise.
Related Work

Hash/Tree Structures for High
Dimensional Data

Problems?
Degrades in High Dimensions
 Tree based methods rely on
Euclidian/metric properties which do not
hold in non-metric spaces.


AdaBoost

“Adaptive Boosting” generates new
classifiers based on the failures of
previous classifiers.
Motivations for Query Sensitive
Distance Measures

Lack of Contrasting

Two high dimensional objects are
unlikely to be similar in all dimensions.
4
4
5
5
6
23
2
5
?
6
57
2
20
Motivations for Query Sensitive
Distance Measures

Statistical Sensitivity

Data is rarely uniformly distributed, so
for any two objects there may be
relatively few coordinates that are
statistically significant for that object.
Simple Embeddings

It is assumed that an embedding
(F) maps an object into a vector
space that is significantly more
efficient to measure than the ‘true’
distance (Dx)
Weak Classifiers

Given a triple (q,a,b) a simple
embedding F correctly classifies said
triple > 50% of the time.
Key to Query Sensitive Embeddings



Multiple embeddings are significantly
cheaper to compute than the actual
distance between two objects.
Many weak classifiers can be combined to
create a strong classifier (proven in the
BoostMap paper)
Each classifier can be assigned a different
weight depending upon the query via
splitters.
Constructing an Embedding & Query
Sensitive Distance Measure (via BoostMap)
1.
2.
3.
4.
Specify a large family of 1D
embeddings.
Uses the embeddings to specify binary
classifiers on object triples (q,a,b)
Combine the many classifiers into a
single classifier H using AdaBoost.
Use H to define query sensitive
embeddings and distance measures.
Result: Fout & Dout




Fout is a d-dimensional embedding
composed of d 1D embeddings from H.
Dout is a distance measure of vectors
produced by Fout
Dout is like a weighted L1 measure, but is
not symmetric or a metric, but is querysensitive.
Because Dout is query-sensitive, running
Dout & Fout through AdaBoost will always
produce H.
Training


Original BoostMap chose triples
randomly.
Can achieve better results by
choosing triples similar to the
results you want to retrieve.

If you want k-nearest neighbor, choose
triples (q,a,b) such that a is a knearest neighbor of q, but b is not.
Complexity

BoostMap requires one-time
training. O(mt)


Online retrieval takes O(d) time


Other methods require no training.
Cost is “similar” to other methods.
Other Methods:



FastMap
SparseMap
Metric Map
Filter & Refine Retrieval
1.
2.
3.
Compute the embedding for the
query object and any
reference/pivot objects.
Find the database objects with the
most similar vectors.
Sort the results by the true
distance measure.
Experimental Results


Query-Sensitive Embeddings lead to
better performance than
embeddings using a global L1
distance measure.
Outperforms FastMap and the
original BoostMap
Conclusions & Further Work


Embeddings are the only family of
methods that are efficient and nonspecific.
How can this algorithm be applied
to the choosing of a meaningful
distance measure for high
dimensional vectors?