Transcript Document

Combining Multiple
Representations on the
TRECVID Search Task
Arjen P. de Vries
Thijs Westerveld
Tzvetanka I. Ianeva
ICASSP, May 21 2004
Introduction
• Video Retrieval should take advantage
of information from all available sources
and modalities
– …but so far ASR best for almost any query
• LL11@TRECVID2003:
Combining information sources
– Different models/modalities
– Multiple example images
ICASSP, May 21 2004
‘Language Modelling’
approach to IR
Docs
Models
ICASSP, May 21 2004
Retrieval
Models
Calculate
conditional
probabilities of
observing query
samples given
each model in
the collection
P(Q|M1)
Query
P(Q|M2)
P(Q|M3)
P(Q|M4)
ICASSP, May 21 2004
Static Model
• Indexing
– Estimate a Gaussian
Mixture Model from each
keyframe (using EM)
– Fixed number of
components (C=8)
– Feature vectors contain
colour, texture, and
position information from
pixel blocks:
<x,y,DCT>
ICASSP, May 21 2004
Dynamic Model
1
.5
0
ICASSP, May 21 2004
• Indexing:
• GMM of multiple
frames (N=29)
around keyframe
• Feature vectors
extended with timestamp in [0,1]:
<x,y,t,DCT>
Dynamic Model
ICASSP, May 21 2004
Dynamic Model Advantages
• More training data for models
• Reduced dependency upon selecting
appropriate keyframe
• Some spatio-temporal aspects of shot
are captured
– (Dis-)appearance of objects
ICASSP, May 21 2004
Experimental Set-up
• Build models for each shot
– Static, Dynamic, Language
• Build Queries from topics
– Construct simple keyword text query
– Select visual example
– Rescale and compress example images to
match video size and quality
ICASSP, May 21 2004
Combining Modalities
• Independence assumption textual/visual
– P(Qt,Qv|Shot) = P(Qt|LM) * P(Qv|GMM)
• Combination works if
both runs useful
[CWI:TREC:2002]
• Dynamic run more
useful than static run
ICASSP, May 21 2004
Run
MAP
ASR only
Static only
Static+ASR
.130
.022
.105
Dynamic only
.022
Dynamic+ASR .132
Combining Modalities
Dynamic: Higher Initial Precision
ICASSP, May 21 2004
Dow Jones Topic (120)
ICASSP, May 21 2004
Dow Jones Topic (120)
• “Dow Jones Industrial Average
rise day points”
+
=
ICASSP, May 21 2004
Dow Jones Topic (120)
ICASSP, May 21 2004
Arafat topic (103)
ICASSP, May 21 2004
Arafat Topic (103)
ICASSP, May 21 2004
Baseball topic (101)
(102)
Basketball
ICASSP, May 21 2004
Basketball Topic
ICASSP, May 21 2004
Merging Run Results
ICASSP, May 21 2004
Merging Run Results
• Combining
(conflicting)
examples difficult
Combined
[CWI:TREC:2002]
• Single example 
Miss relevant shots
• Round-Robin
Merging
1
2
3
4
5
6
7
8
9
10
ICASSP, May 21 2004
1
2
3
4
5
6
7
8
9
10
1
1
2
2
3
3
4
4
.
.
Merging Run Results
+ASR
• Combining
(conflicting)
examples
difficult
Single
[CWI:TREC:2002]
.022
1
• Single
All example 
Miss relevant shots
Selected
• Round-Robin
Merging
Best
2
3
4
5
6
7
8
9
10
.031
.039
.050
ICASSP, May 21 2004
.132
1
2
3
4
5
6
7
8
9
10
.149
.151
.155
Combined
1
1
2
2
3
3
4
4
.
.
Flames (112)
ICASSP, May 21 2004
Flames Topic (112)
ICASSP, May 21 2004
Conclusions
• For most topics, neither the static nor
the dynamic Working
visual model captures the
user information
need sufficiently…
hypothesis:
Matching against
• …averaged
over
25 topics however, it is
both
modalities
better to use both modalities than ASR
gives robustness
only
ICASSP, May 21 2004
Conclusions
• Dynamic captures visual similarity better
– Thanks to spatio-temporal aspects?
• Experiments with full covariance matrix for
<x,y,t>-dims
• Static model of KF is too fragile
– Dependency on single KF?
• To be tested by ranking max(all I-frames in
shot)
– Not enough training data?
ICASSP, May 21 2004
Conclusions
• Visual aspects of an information need
are best captured by using multiple
examples
• Combining results for multiple (good)
examples in round-robin fashion, each
ranked on both modalities, gives nearbest performance for almost all topics
ICASSP, May 21 2004