Similarity Search with MUFIN

Download Report

Transcript Similarity Search with MUFIN

Multi Feature Indexing Network
MUFIN
Similarity Search Platform for many Applications
Pavel Zezula
Faculty of Informatics
Masaryk University, Brno
23.1.2012
MUFIN: Multi Feature Indexing Network
1
Outline of the talk
•
•
•
•
•
Why similarity
Principles of metric similarity searching
The MUFIN approach
Demo applications
Future directions
23.1.2012
MUFIN: Multi Feature Indexing Network
2
Real-Life Motivation
The social psychology view
• Any event in the history of organism is, in a sense,
unique.
• Recognition, learning, and judgment presuppose
an ability to categorize stimuli and classify
situations by similarity.
• Similarity (proximity, resemblance, communality,
representativeness, psychological distance, etc.) is
fundamental to theories of perception, learning,
judgment, etc.
23.1.2012
MUFIN: Multi Feature Indexing Network
3
Contemporary Networked Media
The digital data view
• Almost everything that we see, read, hear, write,
measure, or observe can be digital.
• Users autonomously contribute to production of global
media and the growth is exponential.
• Sites like Flickr, YouTube, Facebook host user
contributed content for a variety of events.
• The elements of networked media are related by
numerous multi-facet links of similarity.
23.1.2012
MUFIN: Multi Feature Indexing Network
4
Examples with Similarity
• Does the computer disk of a suspected criminal contain
illegal multimedia material?
• What are the stocks with similar price histories?
• Which companies advertise their logos in the direct TV
transmission of football match?
• Is it the situation on the web getting close to any of the
network attacks which resulted in significant damage in
the past?
23.1.2012
MUFIN: Multi Feature Indexing Network
5
Challenge
• Networked media is getting close to the human “factbases”
– the gap between physical and digital has blurred
• Similarity data management is needed to connect,
search, filter, merge, relate, rank, cluster, classify,
identify, or categorize objects across various
collections.
WHY?
It is the similarity which is in the world revealing.
23.1.2012
MUFIN: Multi Feature Indexing Network
6
Limitations: Data Types
We have
We need
• Attributes
• Multimedia
– Numbers, strings, etc.
– Image, video, audio
• Text (text-based)
– Documents, annotations
• Security
– Biometrics
• Medicine
– EKG, EEG, EMG, EMR, CT, etc.
• Scientific data
– Biology, chemistry, physics,
life sciences, economics
• Others
– Motion, emotion, events, etc.
23.1.2012
MUFIN: Multi Feature Indexing Network
7
Limitations: Models of Similarity
We have
We need
• Simple geometric models,
typically vector spaces
•
•
•
•
•
•
•
23.1.2012
More complex model
Non metric models
Asymmetric similarity
Subjective similarity
Context aware similarity
Complex similarity
Etc.
MUFIN: Multi Feature Indexing Network
8
Limitations: Queries
We have
We need
• Simple query
• More query types
– Nearest neighbor
– Range
– Reverse NN, distinct NN,
similarity join
• Other similarity-based
operations
– Filtering, classification, event
detection, clustering, etc.
• Similarity algebra
– May become the basis of a
“Similarity Data Management
System”
23.1.2012
MUFIN: Multi Feature Indexing Network
9
Limitations: Implementation Strategies
We have
We need
• Centralized or parallel
processing
• Scalable and distributed
architectures
• MapReduce like approaches
• P2P architectures
• Cloud computing
• Self-organized architectures
• Etc.
23.1.2012
MUFIN: Multi Feature Indexing Network
10
Search Strategy Evolution
Scalability
well established
cutting-edge
●
●
●
●
research
self-organized
peer-to-peer
distributed
parallel
centralized
grade
high
data volume - exponential
number of users (queries)
variety of data types
multi-lingual, -feature –modal queries
Determinism
exact match ► similarity
precise
► approximate
same answer ► good answer; recommendation
fixed query ► personalized; context aware
fixed infrastr. ► dynamic mapping; mobile dev.
low
23.1.2012
MUFIN: Multi Feature Indexing Network
11
Similarity Data Management System
stimuli
Similarity
Data
similarity
effectiveness Management
System
efficiency
algebra
23.1.2012
MUFIN: Multi Feature Indexing Network
12
Metric Search Grows in Popularity
Hanan Samet
Foundation of Multidimensional and
Metric Data Structures
Morgan Kaufmann, 2006
P. Zezula, G. Amato, V. Dohnal, and M. Batko
Similarity Search: The Metric Space Approach
Springer, 2006
23.1.2012
MUFIN: Multi Feature Indexing Network
13
The MUFIN Approach
MUFIN: MUlti-Feature Indexing Network
Extensibility
metric space
Scalability
P2P structure
SEARCH
infrastructure
Independence
Infrastructure as a service
23.1.2012
MUFIN: Multi Feature Indexing Network
14
Extensibility:
Metric Abstraction of Similarity
• Metric space: M = (D,d)
– D – domain
– distance function d(x,y)
x,y,z  D
• d(x,y) > 0
• d(x,y) = 0  x = y
• d(x,y) = d(y,x)
• d(x,y) ≤ d(x,z) + d(z,y)
23.1.2012
MUFIN: Multi Feature Indexing Network
- non-negativity
- identity
- symmetry
- triangle inequality
15
Examples of Distance Functions
• Lp Minkovski distance (for vectors)
• L1 – city-block distance
• L2 – Euclidean distance
• L – infinity
n
L1 ( x, y )   | xi  yi |
i 1
L2 ( x, y) 
n
2


x

y
 i i
i 1
n
L ( x, y)  max xi  yi
• Edit distance (for strings)
i 1
• minimal number of insertions, deletions and substitutions
• d(‘application’, ‘applet’) = 6
• Jaccard’s coefficient (for sets A,B)
23.1.2012
MUFIN: Multi Feature Indexing Network
d  A, B   1 
A B
A B
16
Examples of Distance Functions
• Mahalanobis distance
– for vectors with correlated dimensions
• Hausdorff distance
– for sets with elements related by another distance
• Earth movers distance
– primarily for histograms (sets of weighted features)
• and many others
23.1.2012
MUFIN: Multi Feature Indexing Network
17
Similarity Search Problem
• For X D in metric space M,
pre-process X so that the similarity queries
are executed efficiently.
No total ordering exists!
23.1.2012
MUFIN: Multi Feature Indexing Network
18
Similarity Queries
•
•
•
•
•
Range query
Nearest neighbor query
Similarity join
Combined queries
Complex queries
23.1.2012
MUFIN: Multi Feature Indexing Network
19
Similarity Range Query
r
q
• range query
– R(q,r) = { x  X | d(q,x) ≤ r }
… all museums up to 2km from my hotel …
23.1.2012
MUFIN: Multi Feature Indexing Network
20
Nearest Neighbor Query
• the nearest neighbor query
– NN(q) = x
– x  X, y  X, d(q,x) ≤ d(q,y)
• k-nearest neighbor query
k=5
– k-NN(q,k) = A
– A  X, |A| = k
– x  A, y  X – A, d(q,x) ≤ d(q,y)
q
… five closest museums to my hotel …
23.1.2012
MUFIN: Multi Feature Indexing Network
21
Similarity Join Queries
• similarity join of two data
sets
X  D,Y  D, m  0
J ( X , Y , m )  {( x, y)  X  Y : d ( x, y)  m}
m
• similarity self join  X = Y
…pairs of hotels and museums
which are five minutes walk
apart …
23.1.2012
MUFIN: Multi Feature Indexing Network
22
Combined Queries
• Range + Nearest neighbors
kNN (q, r )  {R  X , | R | k  x  R, y  X  R :
d (q, x)  d (q, y)  d (q, x)  r}
• Nearest neighbor + similarity joins
– by analogy
23.1.2012
MUFIN: Multi Feature Indexing Network
23
Complex Queries
• Find the best matches of circular shape
objects with red color
• The best match for circular shape or red color
needs not be the best match combined
• A0 algorithm
• Threshold algorithm
23.1.2012
MUFIN: Multi Feature Indexing Network
24
Partitioning Principles
• Given a set X  D in M=(D,d), basic
partitioning principles have been defined:
– Ball partitioning
– Generalized hyper-plane partitioning
– Excluded middle partitioning
– Clustering
23.1.2012
MUFIN: Multi Feature Indexing Network
25
Ball Partitioning
• Inner set: { x  X | d(p,x) ≤ dm }
• Outer set: { x  X | d(p,x) > dm }
dm
p
23.1.2012
MUFIN: Multi Feature Indexing Network
26
Generalized Hyper-plane
• { x  X | d(p1,x) ≤ d(p2,x) }
• { x  X | d(p1,x) > d(p2,x) }
p2
p1
23.1.2012
MUFIN: Multi Feature Indexing Network
27
Excluded Middle Partitioning
• Inner set: { x  X | d(p,x) ≤ dm -  }
• Outer set: { x  X | d(p,x) > dm +  }
2
dm
p
p
dm
• Excluded set: otherwise
23.1.2012
MUFIN: Multi Feature Indexing Network
28
Clustering
• Cluster data into sets
– bounded by a ball region
– { x  X | d(pi,x) ≤ ric }
23.1.2012
MUFIN: Multi Feature Indexing Network
29
Scalability: Peer-to-Peer Indexing
• Local search: M-tree, D-Index, M-Index
• Native metric techniques: GHT*, VPT*
• Transformation techniques: M-CAN, M-Chord
23.1.2012
MUFIN: Multi Feature Indexing Network
30
The M-tree [Ciaccia, Patella, Zezula,
VLDB 1997]
1)Paged organization
2)Dynamic
3) Suitable for arbitrary metric spaces
4) I/O and CPU optimization - computing d can be
time-consuming
23.1.2012
MUFIN: Multi Feature Indexing Network
31
The M-tree Idea
Metric: L2 (Euclidean)
C
A B
A
F
E
D
B
CDEF
• Depending on the metric, the “shape” of index regions changes
L1 (city-block)
23.1.2012
L (max-metric)
weighted-Euclidean
MUFIN: Multi Feature Indexing Network
quadratic form
32
M-tree: Example
o10
o5
o3
Covering
radius
o6
o11
o2
o7
o1
o8
o4
o9
Distance
to parent
o1 4.5 -.-
o1 1.4 0.0
o10 1.2 3.3
o1 0.0 o6 1.4
Leaf
Distance
to parent
entries
23.1.2012
o2 6.9 -.-
o7 1.3 3.8
o10 0.0 o3 1.2
o2 2.9 0.0
o4 1.6 5.3
o2 0.0 o8 2.9
o7 0.0 o5 1.3 o11 1.0
MUFIN: Multi Feature Indexing Network
o4 0.0 o9 1.6
33
M-tree family
•
•
•
•
•
•
Bulk loading
Slim-tree
Multi-way insertion
PM-tree
M2-tree
etc.
23.1.2012
MUFIN: Multi Feature Indexing Network
34
D-Index [Dohnal, Gennaro, Zezula,
MTA 2002]
4 separable buckets at
the first level
2 separable buckets at
the second level
exclusion bucket of
the whole structure
23.1.2012
MUFIN: Multi Feature Indexing Network
35
D-index: Insertion
23.1.2012
MUFIN: Multi Feature Indexing Network
36
D-index: Range Search
r
r
q
q
r
r
q
q
r
q
r
q
23.1.2012
MUFIN: Multi Feature Indexing Network
37
Implementation Postulates of
Distributed Indexes
• dynamism – nodes can be added and removed
• no hot-spots – no centralized nodes, no
flooding by messages (transactions)
• update independence – network update at
one site does not require an immediate
change propagation to all the other sites
23.1.2012
MUFIN: Multi Feature Indexing Network
38
Distributed
Similarity Search Structures
• Native metric structures:
– GHT* (Generalized Hyperplane Tree)
– VPT* (Vantage Point Tree)
• Transformation approaches:
– M-CAN (Metric Content Addressable Network)
– M-Chord (Metric Chord)
23.1.2012
MUFIN: Multi Feature Indexing Network
39
GHT* Address Search Tree
• Based on the Generalized Hyperplane Tree
[Uhl91]
– two pivots for binary partitioning
p3
p4
p5
p6
p6
p3
23.1.2012
p2
p2
p5
p1
p1
p4
MUFIN: Multi Feature Indexing Network
40
GHT* Address Search Tree
• Inner node
– two pivots (reference objects)
p1
• Leaf node
p3
p2
p4
p5
p6
– BID pointer to a bucket if
data stored on the current peer
BID1
BID2
BID3
NNID2
– NNID pointer to a peer if
data stored on a different peer
Peer 2
23.1.2012
MUFIN: Multi Feature Indexing Network
41
GHT* Address Search Tree
Peer 1
Peer 3
Peer 2
23.1.2012
MUFIN: Multi Feature Indexing Network
42
GHT* Range Query
• Range query R(q,r)
– traverse peer’s own AST
– search buckets for all BIDs found
– forward query to all NNIDs found
p3
p2
p5
p1
p2
p4
p5
p6
r
q
p3
p1
23.1.2012
p6
BID1
BID2
BID3
NNID2
Peer 2
p4
MUFIN: Multi Feature Indexing Network
43
AST: Logarithmic replication
• Full AST on every peer is space consuming
– replication of pivots grows in a linear way
• Store only a part of the AST:
– all paths to local buckets p
• Deleted sub-trees:
p7
– replaced by NNID
BID
of the leftmost peer
23.1.2012
1
3
p8
NNID2
p1
p2
p4
p5
p9
p10
NNID3
MUFIN: Multi Feature Indexing Network
NNID4
p11
NNID5
p12
NNID6
p6
p13
NNID7
p14
NNID8
44
AST: Logarithmic Replication (cont.)
• Resulting tree
– replication of pivots grows in a logarithmic way
p1
p3
p7
BID1
23.1.2012
p8
p4
p2
NNID5
NNID3
NNID2
MUFIN: Multi Feature Indexing Network
45
VPT* Structure
• Similar to the GHT* - ball partitioning is used
for AST
Based on the Vantage Point Tree [Yia93]
• inner nodes have one pivot and a radius
• different traversing conditions
r3
r2
p1 (r1)
p3
p2
r1
p2 (r2)
p3 (r3)
p1
23.1.2012
MUFIN: Multi Feature Indexing Network
46
M-Chord: The Metric Chord
• Transform metric space to one-dimensional domain
– Use M-Index - a generalized version of the iDistance
• Divide the domain into intervals
– assign each interval to a peer
• Use the Chord P2P protocol for navigation
• The Skip graphs distributed protocol can be used,
alternatively
23.1.2012
MUFIN: Multi Feature Indexing Network
47
M-Chord: Indexing the Distance
• iDistance – indexing technique for vector domains
– cluster analysis = centers = reference points pi
– assign iDistance keys to objects x  Ci
iDist ( x)  d ( pi , x)  i  c
•
– range query R(q,r): identify intervals of
interest
Generalization to metric spaces
– select pivots
{ p0 ,..., pn }
– then partition: Voronoi-style
23.1.2012
MUFIN: Multi Feature Indexing Network
48
M-Chord: Chord Protocol
• Peer-to-Peer navigation protocol
• Peers are responsible for intervals of keys
• (logn) hops to localize a node storing a key

M-Chord


set the iDistance domain
make it uniform: function h
mchord( x)  h(d ( pi , x)  i  c)

Use Chord on this domain
23.1.2012
MUFIN: Multi Feature Indexing Network
49
M-Chord: Range Query
• Node Nq initiates the search
• Determine intervals
– generalized iDistance
• Forward requests to peers on
intervals
• Search in the nodes
– using local organization
• Merge the received partial
answers
23.1.2012
MUFIN: Multi Feature Indexing Network
50
M-CAN: The Metric CAN
• Based on the Content-Addressable Network
(CAN)
–
a DHT navigating in an N-dimensional vector space
• The Idea:
1. Map the metric space to a vector space
– given N pivots: p1, p2 , … , pN, transform every o into
vector F(o)
2. Use CAN to
– distribute the vector space zones among the nodes
– navigate in the network
23.1.2012
MUFIN: Multi Feature Indexing Network
51
• CAN – the principles
2-dimensional vector space
CAN: Principles & Navigation
– the space is divided in zones
– each node “owns” a zone
– nodes know their neighbors
• CAN – the navigation
– greedy routing
– in every step, move to the
neighbor closer to the
target location
23.1.2012
MUFIN: Multi Feature Indexing Network
6
2
3
1
5
4
x,y
52
M-CAN: Contractiveness & Filtering
• Use the L∞ as a distance measure
– the mapping F is contractive
L ( F ( x), F ( y))  d ( x, y)
• More pivots  better filtering
– but, CAN routing is better for less dimensions
• Additional filtering
– some pivots are only used for filtering data (inside the
explored nodes)
– they are not used for mapping into CAN vector space
23.1.2012
MUFIN: Multi Feature Indexing Network
53
Infrastructure Independence:
MESSIF
Metric Similarity Search Implementation Framework
Performance statistics
Distributed index structures
Centralized index structures
Metric space (D,d)
Operations
Storage
Communication
Net
Vectors
• Lp and quadratic form
Strings
• (weighted) edit and
protein sequence
23.1.2012
Insert, delete,
range query,
k-NN query,
Incremental k-NN
Volatile memory
Persistent memory
MUFIN: Multi Feature Indexing Network
54
Applications: a Word Cloud
23.1.2012
MUFIN: Multi Feature Indexing Network
57
Concepts of the Image search
Image base
23.1.2012
MUFIN: Multi Feature Indexing Network
58
Images and their Descriptors
Image level
B
Descriptor level
R
G
23.1.2012
MUFIN: Multi Feature Indexing Network
59
CoPhIR: Content-based Photo
Image Retrieval
100M
images + metadata + MPEG-7 VDs
http://cophir.isti.cnr.it/
• Largest publicly available collection of high-quality images
metadata: 106 million images
• Each image contains:
• Five MPEG-7 VDs: Scalable Color, Color Structure, Color Layout, Edge
Histogram, Homogeneous Texture
• Other textual information: title, tags, comments, etc.
• Photos have been crawled from the Flickr photo-sharing site.
23.1.2012
MUFIN: Multi Feature Indexing Network
60
Image Search Demo
http://mufin.fi.muni.cz/imgsearch/
Extensibility
COPHIR
Scalability
M-Chord + M-Index
color structure
scalable color
color layout
edge histogram
homogeneous texture
MUFIN
SEARCH
ENGINE
infrastructure
6 x IBM server x3400 – 2 servers used
23.1.2012
MUFIN: Multi Feature Indexing Network
61
MUFIN demos
•
•
•
•
•
•
http://mufin.fi.muni.cz/imgsearch/similar
http://www.pixmac.com/
http://mufin.fi.muni.cz/twenga/random
http://mufin.fi.muni.cz/fingerprints/random
http://mufin.fi.muni.cz/subseq/random
http://mufin.fi.muni.cz/plugins/annotation
23.1.2012
MUFIN: Multi Feature Indexing Network
62
MUFIN Future Research Directions
• MUFIN - a universal similarity search technology
Scalability
P2P structures
Extensibility
metric space
• Research directions in:
– Core technology
– Applications
– A style of computing
23.1.2012
MUFIN
Search
Engine
infrastructure
Performance Tuning
MUFIN: Multi Feature Indexing Network
63
MUFIN Future Research Directions
MUFIN
Search
Engine
infrastructure
New style of computing
Cloud Computing
Similarity Search as Service
23.1.2012
MUFIN: Multi Feature Indexing Network
64
October 28, 2011
Major Applications
– Images:
•
•
•
•
•
– Signals:
Sub-image retrieval
Ranking
Annotation
Categorization
Benchmarking
• Audio recognition
• Time series similarity
– Videos:
• Event detection
– Biometrics:
• Face recognition
• Fingerprint recognition
• Gait recognition
23.1.2012
MUFIN: Multi Feature Indexing Network
65
A New Style of Computing
• From the project-oriented approach towards
similarity cloud for multimedia findability
through similarity searching
Advantages:
– Cloud makes similarity search accessible to common
users
– Computational resources are shared – users don’t
need to maintain any hardware infrastructure
– Users don’t need to care for the OS, security, software
platform, etc.
23.1.2012
MUFIN: Multi Feature Indexing Network
66