Diapositiva 1
Download
Report
Transcript Diapositiva 1
Multimedia Information
Retrieval
Problem
On the Web and in local DBs a large amount of
information is not textual:
audio (speech, music…)
images,
video,
…
How can we efficiently retrieve multimedia
information?
p. 2
Application examples
Web indexing:
Multimedia retrieval from the Web
Identify and ban (illegal or unauthorized) ads and
images
Trademark & copyright
Interactive museums
Commercial DBs
p. 3
Application examples[2]
Satellite images (military, government, …)
Medical images
Entertainment
Criminal investigation (scene analysis, face
recognition, ..)
…
p. 4
First generation multimedia information
retrieval systems
Off-line: multimedia documents are
associated with a textual description Ex.:
Manual annotation (“content descriptive
metadata”)
The text surrounding an image in the document
(e.g. figure caption)
On-line: using textual IR based on “keyword
match” (Google image)
p. 5
p. 6
immagine presa da: A. Del Bimbo, Visual Information Retrieval
Limitation of textual approach
Manual annotation on large
multimedia DBs is unfeasible
Describing a scene or an audio is
highly subjective (different
annotators might perceive/highlight
different details)
p. 7
Precision might be quite low
Google Image can retrieve up to 80% NON
RELEVANT DOCUMENTS even for specific queries
[1] Fergus, Fei-Fei, Perona, Zisserman, Learning Object Categories from
Google’s Image Search, ICCV 05
p. 8
…&Recall
Many relevant images (videos, audio) are not
retrieved
p. 9
Current state-of-the-art retrieval
models…
“Content Based” systems:
Ignore the textual phase
User query might be non-textual
Model perceptual similarity bewteen the
query and the multimedia document
Still limited to DBs (does not scale on the
Web)
p. 10
Examples of multimedia search
queries
Find a song by singing the refrain
Retrieving some soccer action frame
in a sport video
Searching a paint with some
specific detail or texture or painting
technique (e.g chiaroscuro)
…
p. 11
Current state-of-the-art retrieval
models…[2]
Automated image annotation:
Pre-processing (“information
extraction”): automatically extract
some information from the image and
associate it to some textual label
Retrieval is then a “traditional” text
retrieval
p. 12
Example of image annotation
p. 13
Image Retrieval wrt textual Retrieval
Analysis and representation of non-symbolic
information
A text can be seen as a combination of atomic
symbolic elements (words or tokens)
An image is a collection of non-symbolic
elements (pixels) and an audio is represented
as a wave ..there is no vocabulary of basic
meaning elements, as for text!
p. 14
Basic elements of a Content Based
Multimedia IR
On the users side:
The query is a multimedial object (an image a
sketch an audio frame..)
The output is an ordered list of element ranked
according to perceptual similarity wrt the query
There are a variety of optional interactive features
to visualize image collections or give a feedback
to the system
p. 15
Example of “clustered”
visualization in Google swirl
p. 16
Query by image example
The query is an
Image detail
p. 17
Query by image example [2]
Note that the query
and the detail might
not perfectly match,
e.g. the query can be
chosen from and
image prior to
a restoration of the
picture
p. 18
Query by sketch
p. 19
immagine presa da: A. Del Bimbo, Visual Information Retrieval
Basic elements of a Content Based
Multimedia IR [2]
From the “system” perspective:
Representation of the multimedia object (e.g.
what is the feature space)
Modeling the notion of perceptual similarity
(e.g., trough specific matching algorithms)
Efficient indexing of feature space (the
“vocabulary” is order of magnitude higher than for
words)
Relevance feedback and visualization interface
p. 20
MULTIMEDIA OBJECT
REPRESENTATION
p. 21
Representing an image through a
set of features
As for text, a feature is a representation,
through a vector of elements, of the image (or
a detail l’ )
If I' is an image detail, then a feature f for l’ is
defined as:
f(I') Rk,
f(I') = (v0, … vk-1)T,
k >= 1
p. 22
Representing an image through a
set of features [2]
In general, a feature is a measurable
characteristic of an image
The image is then represented using the
measurable values of its selected features f1,
…, fn
p. 23
Local and global features
I' = I: global feature (remember I image I’
detail)
I' I: local feature
Local Features :
How to select relevant image parts that we want
to represent (I‘1, I‘2, …)
Loca features allows it to cope with missing
elements, occlusions , background..
p. 24
Main problems in image
representation
Selecting features is crucial
Just as for text, the same meaning can be
conveyed by apparently very different images
(different according to specific features)
But the problem of “variability” is much harder
p. 25
variability[1]: orientation and rotation
p. 26
Michelangelo 1475-1564
Variability [2]: lightening and brightness
p. 27
Variability [3]: deformation
Xu, Beihong 1943
p. 28
Variability [4]: intra-class variability
p. 29
Selection of image focus[1]:
occlusion
p. 30
Magritte, 1957
Selection of image focus[2]: background separation
p. 31
Klimt, 1913
Example: local feature
fi(I')
I'
I
p. 32
immagine presa da: Tutorial CVPR 07
Feature Extraction
What are image features?
Primitive features
General features
Semantic features
Mean color (RGB)
Color Histogram
Color Layout, texture etc…
Domain specific features
Face recognition, fingerprint matching etc…
p. 33
Examples of “simple” features :
gray level histogram
Pixel intensity histogram in I':
The range [0, 255] is partitioned in k bin
Assign a bin to every pixel : I(p) -> divk(I(p))
f(I') = (v0, …, vk-1)T, where:
vi = # { p I’ : divk(I(p)) = i}
p. 34
Example
Frequency count of each individual color
Most commonly used color feature
representation
Image
Corresponding
histogram
p. 35
Examples of “domain-specific”
features : facial metrics
f(I) = (d1 , d2 , d3 , d4)T
p. 36
More features
shape
texture
p. 37
Feature space
If we now use n features in R, then I can be
represented as a feature vector x(I) = (f1(I),
…fn(I))T
x(I) is a point in Rn, the feature space
p. 38
Feature Space [2]
More in general, if :
Then : x(I) = (f1(I) T … fn(I) T)T is a point in
Rn*k
fi(I) Rk, (a single feature is a k multidimensional
vector )
p. 39
Ex.: Feature space(R2)
p. 40
Feature Space[3]
The concept of feature space is similar BUT
NOT IDENTICAL TO vector space model as
in traditional IR (where real values are the
tf*idf of words in document collection)
It is the most common, but not the unique,
representation in content-based multimedia IR
p. 41
SIMILARITY
p. 42
Perceptual similarity
In text retrieval, similarity between two
documents is modeled as a function of the
common words in the two documents (e.g.
cosine similarity with tf*idf feature vectors)
In multimedia retrieval a similar notion of
“distance” between vectors is applied…
p. 43
Perceptual similarity [2]
In the feature space, similarity is (inversely)
proportional to a distance measure between
feature vectors (not necessarily an Euclidean
distance): dist(x(I1),x(I2))
Given the query Q, the system output is an
image list I1, I2, … ordered according to:
I1 = arg minI dist(x(Q),x(I)), …
p. 44
Example(R2)
p. 45
Perceptual similarity [3]
Other matching algorithms use more complex
representations or more complex similarity
functions, which are usually dependent on the
type of multimedia object and retrieval tasks
p. 46
INDEXING
p. 47
Indexing
Problem: efficiently index the data of a multidimensional space?
Several data structures (as IR keyword
dictionary) are indexed using some ordering
(e.g. alphabetic ordering):
xi <= xj V xj <= xi (0 <= i,j <= N)
In Rk this cannot be done (remember every
feature is multi-dimensional!)
p. 48
k-d Tree
It is a generalization of a binary search three
with k dimensions
In each tree level we cyclically consider on of
k features
p. 49
k-d Tree [2]
Suppose we wish to index a set of N kdimensional points:
P1, …, PN, Pi Rk, Pi =(xi1, …, xik)
We select the first dimension (feature) and
find the value L1, which is the median of x11,
…, xN1
p. 50
k-d Tree [3]
The root of the tree includes L1
The left sub-tree (TL) includes the points Pi
s.t. xi1 <= L1
The left sub-tree (TR) wiIl include all the other
points
At level 1, we select the second feature and,
separately for TL e TR, we compute L2 and L3,
selected such that :
L2 i is median wrt the elements i xj12, xj22, …
of TL
L3 is the median of the elements in TR
p. 51
k-d Tree [4]
When the last k feature (point) has been
considered, we backtract and cyclically
consider agaion the first feature
Points are associated to the tree leaves
p. 52
Example
We start with a set of 2-dimensional points.
In L1, P5 ‘s x coordinate is the median of the dataset
In L2, P2 is the median of y values in the partition, and in L3 P7
We then consider again x values, and in L4 the median is again P2
etc.
p. 53
immagine presa da: Hemant M. Kakde, Range Searching using Kd Tree
IMAGE, VIDEO E AUDIO
RETRIEVAL
p. 54
..So far
We analyzed:
Query types
Feature types
Similarity functions
Indexing methods
Now we present retrieval methods
Retrieval strategies clearly depend upon the
multimedia object representation technique
p. 55
Retrieval by color: color
histograms
We can represent an image through the color
histogram of an image part I‘ (we already
seen how histograms are created for grey
images):
A single pixel can be represented with different
encodings P: RGB, HSV, CIE LAB, …
Every channel (values range) is partitioned in k
bin:
f(I') = (r0,…, rk-1, g0, …, gk-1, b0, …, bk-1)T,
ri = # { p in I’: divk(R(p)) = i },
gi = # { p in I’: divk(G(p)) = i },
bi = # { p in I’: divk(B(p)) = i }
p. 56
Color histograms [2]
Alternatively , we divide RGB in k3 bin:
f(I') = (z0,…, zh-1)T, h= k3 the # of combinations of 3
values
If zi represents the triple of RGB values (i1, i2, i3),
then :
zi = # { p in I’: divk(R(p)) = i1 and divk(G(p)) = i2
and divk(B(p)) = i3 }
p. 57
Color histogram [3]: example (4
bins)
p. 58
immagine presa da: Wikipedia
Retrieval by texture
p. 59
Statistical Approach
Tamura features: based on the
analysis of the local intensity
distribution of the image, in
order to measure perceptual
characteristics of the feature,
such as a
Contrast
Granularity
Direction
p. 60
Video retrieval
A video is a sequence of images
Every image is called FRAME
p. 62
Elements of a video
Frame: a single image
Shot: A sequence of frames taken from a
single camera
Scene: a set of consecutive shots that
reflect the same space, time and action
p. 63
Videosequence segmentation
If we can automatically identify “editing
effects” (cuts, dissolvenze, ….) among shots,
we can the automatically partition a video in
shots
Identifying scenes is much more complicated,
since this is a “semantic” concept
p. 64
Video search
Videos can be represented efficiently using
“key frame” which are representative of every
shot
A key frame can then be treated and
processed as a “still image”:
We can then apply all what we have just seen for
single images
p. 65
Video search
Alternatively, we can search in a video a
specific “motion” (e.g., a specific trajectory of
a soccer action, …)
p. 66
Audio retrieval
Several types of audio:
Spoken audio
A whatever audio signal within the frequence range that
can be perceived by the human ears (e.g. a
thunderstorm)
Music:
We must model the different instruments , musical effects,
etc.
p. 67
Audio Query types
Query by example: the input is an audio file,
used to search “similar” files
Query by humming:
User sings the searched melody
p. 68
Represenattion and similarity
The feature space can be obtained using e.g.
histograms obtained from the spectral
representation of the signal
Perceptual similarity is computed as the
distance among multidimensional points, as
for images
Distance metrics: Euclidean, Mahalanobis,
histogram distance measures (Histogram
Intersection, Kullback-Leibler divergence, kisquare, etc.)
p. 69
Putting all together: combine different perceptive
elements
p. 70
Content Based systems:
limitations
All the information concerning the target
multimidia objects is provided by the query
(e.g., a given shape, or color, or audio signal)
p. 71
[2] Content Based systems:
limitations
Even if the representation and matching techniques are
sophisticated, it is difficult to distinguish shape changes
that are still referring to the searched object from noise
p. 72
Limitations of Content Based
systems [3]
Human brain can distinguish among different shapes of
the same object only after having seen several objects of
the same type in different positions
To obtain similar perfromances, artificial systems need
to be trained to recognize objects using machine
learning algorithms for:
Automated image annotation
Automated image classification
p. 73
references
A. Del Bimbo, Visual Information Retrieval,
Morgan Kaufmann Publishers, Inc. San
Francisco, California", 1999
Forsyth, Ponce, Computer Vision, a Modern
Approach 2003
p. 74
references[2]
Smeulders et al., Content-Based Image Retrieval at
the End of Early Years, IEEE PAMI 2000
Long et al., Fundamentals of Content-based Image
Retrieval, in: D. D. Feng, W. C. Siu, H. J. Zhang
(Ed.),Multimedia Information Retrieval &
Management-Technological Fundamentals and
Applications, Springer-Verlag, New York(2003)
Foote et al., An Overview of Audio Information
Retrieval, ACM Multimedia Systems, 1998
Hemant M. Kakde, Range Searching using Kd Tree,
2005
p. 75