Transcript Document
Combining Multiple Representations on the TRECVID Search Task Arjen P. de Vries Thijs Westerveld Tzvetanka I. Ianeva ICASSP, May 21 2004 Introduction • Video Retrieval should take advantage of information from all available sources and modalities – …but so far ASR best for almost any query • LL11@TRECVID2003: Combining information sources – Different models/modalities – Multiple example images ICASSP, May 21 2004 ‘Language Modelling’ approach to IR Docs Models ICASSP, May 21 2004 Retrieval Models Calculate conditional probabilities of observing query samples given each model in the collection P(Q|M1) Query P(Q|M2) P(Q|M3) P(Q|M4) ICASSP, May 21 2004 Static Model • Indexing – Estimate a Gaussian Mixture Model from each keyframe (using EM) – Fixed number of components (C=8) – Feature vectors contain colour, texture, and position information from pixel blocks: <x,y,DCT> ICASSP, May 21 2004 Dynamic Model 1 .5 0 ICASSP, May 21 2004 • Indexing: • GMM of multiple frames (N=29) around keyframe • Feature vectors extended with timestamp in [0,1]: <x,y,t,DCT> Dynamic Model ICASSP, May 21 2004 Dynamic Model Advantages • More training data for models • Reduced dependency upon selecting appropriate keyframe • Some spatio-temporal aspects of shot are captured – (Dis-)appearance of objects ICASSP, May 21 2004 Experimental Set-up • Build models for each shot – Static, Dynamic, Language • Build Queries from topics – Construct simple keyword text query – Select visual example – Rescale and compress example images to match video size and quality ICASSP, May 21 2004 Combining Modalities • Independence assumption textual/visual – P(Qt,Qv|Shot) = P(Qt|LM) * P(Qv|GMM) • Combination works if both runs useful [CWI:TREC:2002] • Dynamic run more useful than static run ICASSP, May 21 2004 Run MAP ASR only Static only Static+ASR .130 .022 .105 Dynamic only .022 Dynamic+ASR .132 Combining Modalities Dynamic: Higher Initial Precision ICASSP, May 21 2004 Dow Jones Topic (120) ICASSP, May 21 2004 Dow Jones Topic (120) • “Dow Jones Industrial Average rise day points” + = ICASSP, May 21 2004 Dow Jones Topic (120) ICASSP, May 21 2004 Arafat topic (103) ICASSP, May 21 2004 Arafat Topic (103) ICASSP, May 21 2004 Baseball topic (101) (102) Basketball ICASSP, May 21 2004 Basketball Topic ICASSP, May 21 2004 Merging Run Results ICASSP, May 21 2004 Merging Run Results • Combining (conflicting) examples difficult Combined [CWI:TREC:2002] • Single example Miss relevant shots • Round-Robin Merging 1 2 3 4 5 6 7 8 9 10 ICASSP, May 21 2004 1 2 3 4 5 6 7 8 9 10 1 1 2 2 3 3 4 4 . . Merging Run Results +ASR • Combining (conflicting) examples difficult Single [CWI:TREC:2002] .022 1 • Single All example Miss relevant shots Selected • Round-Robin Merging Best 2 3 4 5 6 7 8 9 10 .031 .039 .050 ICASSP, May 21 2004 .132 1 2 3 4 5 6 7 8 9 10 .149 .151 .155 Combined 1 1 2 2 3 3 4 4 . . Flames (112) ICASSP, May 21 2004 Flames Topic (112) ICASSP, May 21 2004 Conclusions • For most topics, neither the static nor the dynamic Working visual model captures the user information need sufficiently… hypothesis: Matching against • …averaged over 25 topics however, it is both modalities better to use both modalities than ASR gives robustness only ICASSP, May 21 2004 Conclusions • Dynamic captures visual similarity better – Thanks to spatio-temporal aspects? • Experiments with full covariance matrix for <x,y,t>-dims • Static model of KF is too fragile – Dependency on single KF? • To be tested by ranking max(all I-frames in shot) – Not enough training data? ICASSP, May 21 2004 Conclusions • Visual aspects of an information need are best captured by using multiple examples • Combining results for multiple (good) examples in round-robin fashion, each ranked on both modalities, gives nearbest performance for almost all topics ICASSP, May 21 2004