Statistical Models for Automatic Video Annotation & Retrieval

Download Report

Transcript Statistical Models for Automatic Video Annotation & Retrieval

Image Retrieval Basics
Uichin Lee
KAIST KSE
Slides based on “Relevance Models for Automatic Image and Video
Annotation & Retrieval” by R. Manmatha (UMASS)
How do we retrieve images?
• Using Content-based Image Retrieval (CBIR) systems
– Hard to represent information needs using abstract image features;
color percentages, color layout and textures.
How do we retrieve images?
• IBM QBIC system example using color
How do we retrieve images?
• Use Google image search !
–Google uses filenames, surrounding text and
ignores contents of the images.
How do we retrieve images ?
• Using manual annotations
– Libraries, Museums
– Manual annotation is expensive.
CREATED/PUBLISHED:1940 August
NOTES : Store or cafe with soft drink signs:
Coca-Cola, Orange-Crush, Royal Crown,
Double Cola and Dr. Pepper.
SUBJECTS: Carbonated beverages
Advertisements Restaurants United States-Mississippi—Natchez Slides--Color
CALL NUMBER: LC-USF35-115
Picture from Library of Congress
American Memory Collections
How to retrieve images/videos?
• Retrieval based on similarity search of visual
features (similar to traditional IR w/ visterms)
– Doesn’t support textural queries
– Doesn’t capture “semantics”
• Automatically annotate images then retrieve
based on the textual annotations
Example Annotations:
Tiger, grass.
Content based image retrieval
Image
Database
Extracted
Features
Compute
Similarity
Extracted
Features
Query
Rank
Images
Visterms: image vocabulary
• Can we represent all the images with a
finite set of symbols?
– Text documents consist of words
– Images consist of visterms
V123 V89 V988
V4552 V12336 V2
V765 V9887
Construction of visterms
1.
2.
3.
4.
Segment images into visual segments (e.g., Blobworld,
Normalized-cuts algorithm.)
Extract features from segments
Cluster similar segments (k-means)
Each cluster is a visterm
Images
Segments
Visterms (=blob-tokens)
V1
V3
V2
V4
V1
V5
…
V6
…
Segmentation
• Segment images into parts (tile or regions)
Break image down into simple geometric shapes
Tiling
(a) 5 tiles
(b) 9 tiles
(c) 5 regions
(d) 9 regions
Regioning
Break Image down into visually coherent areas
Image features
• Information about color or texture or shape
which are extracted from an (part of) image
are known as image features
• Features
– Color (e.g., Red), Texture (e.g., Sandy), Shape
– SIFT (Scale-invariant feature transform)*
–…
90
80
70
60
50
40
30
20
10
0
Red
Color histogram
Orange
Texture
David G. Lowe “Distinctive image features from scale-invariant keypoints” (IJCV 2004)
Discrete visterms
• Segmentation vs. rectangular partition
– Tiling vs. regioning
• Results - rectangular partition performs better
than segmentation!
– Model learned over many images.
– Vs. segmentation over one image.
Automatic annotation & retrieval
• Automatically annotate unseen images
– A training set of annotated images
• Do not know which word corresponds to which part of
image.
– Compute visterms (based on image features)
– Learn a model and annotate a set of test
– Learn all annotations at the same time
• Retrieval based on the annotation output
– Use query likelihood language model
– Rank test images according to the likelihoods
Correspondence (matching)
Tiger
grass
V1
V3
Tiger
grass
Maui
People
Dance
V2
V4
V6
Maui
People
Dance
See
Sand
See_Lion
V5
V12
V321
See
Sand
See_Lion
• Now we want to find relationship between
visterms and words.
– P( Tiger | V1 ), P( V1 | Tiger ), P( Maui | V3,V4 )
Correspondence models
•
•
•
•
•
•
Co-occurrence model
Translation model
Normalized & regularized model
Cross media relevance model
Continuous relevance model
Multiple Bernoulli model…
Co-occurrence models
• Mori et al. 1999
• Create the cooccurrence table
using a training set of
annotated images
• Tend to annotate with
high frequency words
• Context is ignored
– Needs joint probability
models
w1
w2
w3
w4
V1
12
2
0
1
V2
32
40
13
32
V3
13
12
0
0
V4
65
43
12
0
P( w1 | v1 ) = 12/(12+2+0+1)=0.8
P( v3 | w2 ) = 12/(2+40+12+43)=0.12
Cross media relevance models
•
Estimating Relevance Model – the joint distribution of
words and visterms
P( w | I )  P( w | b1...bm ) 
•
P( w, b1...bm )
 P(w, b1...bm )
w
Training:
– Joint distribution computed as an expectation over the training set J
– P(w, b1, b2, .., bm) = ∑P(J)P(w,b1,..,bm|J)
•
Annotation:
– Compute P(w | I) for different w.
– Annotate the image with every possible w in the vocabulary with
associated probabilities (or pick top k words)
•
Retrieval:
– Given a query Q, find the prob of drawing Q from image I: P(Q | I)
– Rank images according to this prob.
J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Relevance Using Cross-Media
Relevance Models, In Proc. SIGIR’03.