Query Sensitive Embeddings
Download
Report
Transcript Query Sensitive Embeddings
Query Sensitive Embeddings
Vassilis Athitsos,
Marios Hadjieleftheriou,
George Kollios,
Stan Sclaroff
Abstract
A common
problem
in many
types of
databases is retrieving the
most similar matches
Comparing
a
lot
of
high
dimensional
objects
can
to a query object. Finding those matches in a large database can be too slow to be practical,
be
expensive.
especially
in domains where objects are compared using computationally expensive similarity (or
distance) measures.
This paper proposes a way to reduce the
This paper proposes a novel method forof
approximate
nearest neighbor retrieval in such spaces. Our
dimensionality/cost
the comparisons…
method is embedding-based, meaning that it constructs a function that maps objects into a real
vector space. The mapping preserves a large amount of the proximity structure of the original
space, and it can be used to rapidly obtain a short list of likely matches to the query.
…by training an algorithm to give different
The main novelty
method is that it
constructs, togetherdepending
with the embedding, aon
queryweights
toof our
different
measures
the
sensitive distance measure that should be used when measuring distances in the vector space. The
query.
term “query sensitive“ means that the distance measure changes depending on the current query
object.
Tests on real-world data show better
We report experiments with an image database of handwritten digits, and a time-series database.
efficiency/accuracy
trade-offs
than the
known
In both cases, the proposed method outperforms
existing state-of-the-art
embedding
methods,
meaning
that
it
provides
significantly
better
trade-offs
between
efficiency
and
retrieval
accuracy.
best algorithms.
Authors’ Previous Work
BoostMap: A method for efficient
approximate similarity rankings.
(2004)
Athitsos, Sclaroff, Kollios
Indexing multi-dimensional timeseries with support for multiple
distance measures. (2003)
Hadjieleftheriou
Terms/Concepts
Embedding
Maps any high-dimensional object into
a d-dimensional vector.
21
53
72
9
25
101
20
35
70
…more Terms/Concepts
Classifier
Given 3 points q, a & b, which is closer
to q, a or b?
…more Terms/Concepts
Distance Measure
Metric/Non-Metric measure of the true
proximity of any 2 objects.
Metrics
How close is:
[ 2 5 72 3 5 ] to [ 5 5 45 1 1 ]?
Euclidean (L2) Distance
(2 5) 2 (5 5) 2 (72 45) 2 (3 1) 2 (5 1) 2
[ 1 3 ] to [ 5 7 ]
Manhattan (L1) Distance
|1-5| + |3-7|
…more Terms/Concepts
Splitter
Returns 1 if a given object is in a
particular group (defined for that
splitter), 0 otherwise.
Related Work
Hash/Tree Structures for High
Dimensional Data
Problems?
Degrades in High Dimensions
Tree based methods rely on
Euclidian/metric properties which do not
hold in non-metric spaces.
AdaBoost
“Adaptive Boosting” generates new
classifiers based on the failures of
previous classifiers.
Motivations for Query Sensitive
Distance Measures
Lack of Contrasting
Two high dimensional objects are
unlikely to be similar in all dimensions.
4
4
5
5
6
23
2
5
?
6
57
2
20
Motivations for Query Sensitive
Distance Measures
Statistical Sensitivity
Data is rarely uniformly distributed, so
for any two objects there may be
relatively few coordinates that are
statistically significant for that object.
Simple Embeddings
It is assumed that an embedding
(F) maps an object into a vector
space that is significantly more
efficient to measure than the ‘true’
distance (Dx)
Weak Classifiers
Given a triple (q,a,b) a simple
embedding F correctly classifies said
triple > 50% of the time.
Key to Query Sensitive Embeddings
Multiple embeddings are significantly
cheaper to compute than the actual
distance between two objects.
Many weak classifiers can be combined to
create a strong classifier (proven in the
BoostMap paper)
Each classifier can be assigned a different
weight depending upon the query via
splitters.
Constructing an Embedding & Query
Sensitive Distance Measure (via BoostMap)
1.
2.
3.
4.
Specify a large family of 1D
embeddings.
Uses the embeddings to specify binary
classifiers on object triples (q,a,b)
Combine the many classifiers into a
single classifier H using AdaBoost.
Use H to define query sensitive
embeddings and distance measures.
Result: Fout & Dout
Fout is a d-dimensional embedding
composed of d 1D embeddings from H.
Dout is a distance measure of vectors
produced by Fout
Dout is like a weighted L1 measure, but is
not symmetric or a metric, but is querysensitive.
Because Dout is query-sensitive, running
Dout & Fout through AdaBoost will always
produce H.
Training
Original BoostMap chose triples
randomly.
Can achieve better results by
choosing triples similar to the
results you want to retrieve.
If you want k-nearest neighbor, choose
triples (q,a,b) such that a is a knearest neighbor of q, but b is not.
Complexity
BoostMap requires one-time
training. O(mt)
Online retrieval takes O(d) time
Other methods require no training.
Cost is “similar” to other methods.
Other Methods:
FastMap
SparseMap
Metric Map
Filter & Refine Retrieval
1.
2.
3.
Compute the embedding for the
query object and any
reference/pivot objects.
Find the database objects with the
most similar vectors.
Sort the results by the true
distance measure.
Experimental Results
Query-Sensitive Embeddings lead to
better performance than
embeddings using a global L1
distance measure.
Outperforms FastMap and the
original BoostMap
Conclusions & Further Work
Embeddings are the only family of
methods that are efficient and nonspecific.
How can this algorithm be applied
to the choosing of a meaningful
distance measure for high
dimensional vectors?