International Conference on Multimedia & Expo, July 10, 2006

Download Report

Transcript International Conference on Multimedia & Expo, July 10, 2006

Exploiting Human Abilities in
Video Retrieval Interfaces
Maximizing the Synergy between Man and Machine
Alex Hauptmann
School of Computer Science
Carnegie Mellon University
Carnegie Mellon
Background
Automatic video analysis for
detection/recognition is still quite poor
• Consider baseline (random guessing)
• Improvement is limited
• Consider near-duplicates (trivial similarity)
• Does not generalize well over video sources
• Better than nothing
Need humans to make up for this shortcoming!
Carnegie Mellon
Differences to VIRAT
• Most interface work done on broadcast TV
• Harder:
• Unconstrained subject matter
• Graphics, animations, photos
• Broadcasters
• Many different, short shots out of context
• Easier:
• Better resolution
• Conventions in editing, structure
• Audio track
• Keyframes are typical unit of analysisCarnegie Mellon
“Classic” Interface Work
• Interactive Video Queries
• Fielded text query matching capabilities
• Fast image matching with simplified interface for
launching image queries
• Interactive Browsing, Filtering, and Summarizing
• Browsing by visual concepts
• Quick display of contents and context in
synchronized views
Carnegie Mellon
Informedia Client Interface Example
Carnegie Mellon
Interface options
Christel08
Carnegie Mellon
Suggesting Related Concepts
Zavesky07
Carnegie Mellon
Carnegie Mellon
Fork Browser
Snoek09
Carnegie Mellon
“Classic” Video Interface Results
• Concept browsing and image search used
frequently
• Novices still have lower performance than experts
• Some topics cause “interactivity” to be one-shot
query with little browsing/exploration
• Classic Informedia interface including concept
browsing often good enough that user never
proceeds to any additional text or image query
• “Classic Informedia” scored highest of those testing
with novice users in TRECVID evaluations
Carnegie Mellon
Visual Browsing
Carnegie Mellon
Augmented Video Retrieval
The computer observes the user
and LEARNS, based on what is marked as relevant
The system can learn:
• What image characteristics are relevant
• What text characteristics (words) are relevant
• What combination weights should be used
We exploit the human’s ability to quickly mark relevant
video and the computer’s ability to learn from given
examples
Carnegie Mellon
Combining Concept Detectors for Retrieval
Text
Audio
Motion
Image
multimodal
question
Find Pope
John Paul
Video library
…Closed
Caption
Audio
Feature
Motion
Feature
Color
Feature
Face
Building … (3k)
Knowledge Source/API
Output1 Output2
Outputn
Combination of diverse knowledge sources
Final ranked list for interface
Carnegie Mellon
Why Relevance Feedback?
• Limited training data
• Untrained sources are useful for some specific searches
Query: finding some boats/ships
Q-Type: general object
Txt: 0.5, Img: 0.3, Face: -0.5
(Learned from training set)
Outdoor: ?, Ocean: ?
(Unable to be learned)
Carnegie Mellon
Probabilistic Local Context Analysis (pLCA)
Yan07
• Goal: Refine results of the current query
• Method: assume the combination parameters of “un-learned”
sources υ to be latent variables and compute P(yj|aj,Dj)
• Discover useful search knowledge based on initial results Ai
Query
υ1:?, υ2:?,…, υN:?
A1
Y1 Video1
Initial
Search A2
Result
Y2 Video 2
Am
Ym Video M
Carnegie Mellon
Undirected Model and Parameter Estimation
Compute the posterior probability of document relevance Y
given initial results A based on an undirected graphical model
1
P( y | a; D, Q) 
Z


P
(

|
Q
;
v
)

exp
y
a

y

f
D
,
Q



l
j l l
j
 j j
d
 
l


l
j 1
MD
0
l
Variational inference, i.e., iterate until
convergence and approximate
P(yj|aj) by qyj
 Maximize w.r.t. variational para. of Y,



q yj  1  exp  a j   q l fl ( D j , Q)  
l



υ
A1
Y1
D1
A2
Y2
D2
Am
Ym
Dm
1
 Maximize w.r.t. variational para. of v
q l   l0   qyj fl ( D j , Q)
j
Carnegie Mellon
Automatic vs Interactive Search
Carnegie Mellon
Extreme Video Retrieval
• Automatic retrieval baseline for ranking order
• Two methods of presentation:
• System-controlled Presentation - Rapid Serial
Visual Presentation (RSVP)
• User-controlled Presentation – Manual
Browsing with Resizing of Pages
Carnegie Mellon
System-controlled Presentation
• Rapid Serial Visual Presentation (RSVP)
• Minimizes eye movements
• All images in same location
• Maximizes information transfer: System  Human
•
•
•
•
•
Up to 10 key images/second
1 or 2 images per page
Presentation intervals are dynamically adjustable
Click when relevant shot is seen
Mark previous page also as relevant
• A final verification step is necessary
Carnegie Mellon
User-controlled presentations
• Manual Browsing with Resizing of Pages
• Manually page through images
• User decides to view next page
• Vary the number of images on a page
• Allow chording on the keypad to mark shots
• A very brief final verification step
Carnegie Mellon
MBRP - Manual Browsing with
Resizable Pages
Carnegie Mellon
Extreme QA with RSVP
3x3 display
1 page/second
Numpad
chording to
select shots
Carnegie Mellon
Mindreading – an EEG interface
• Learn
Relevant/Non-Relevant
• 5 EEG probes
• Simple features
• Too slow
• 250 ms/image
• Significant recovery
time after hit
• Relevance feedback
Carnegie Mellon
Summarizing Video: Beyond Keyframes
• BBC Rushes
• Unedited video for TV series production
• Summarize as video in 1/50th of total
• Note the non-scalable target factor
• Lots of smart analysis
• Clustering, salience, redundancy, importance
• Best performance for retrieval was to play
every frame
Carnegie Mellon
Speed-up summarization results
Carnegie Mellon
Surveillance Event Detection
• Interesting stuff is rare
• Detection accuracy is limited
• Monitor many streams
Carnegie Mellon
Surveillance Event Detection
• Need action
not key frame
• Difficult for humans
• Combine speed-up
with automatic analysis
• Slow down when interesting stuff happens
Carnegie Mellon
Summary
Interfaces have much to contribute in retrieval
• We don’t know what is best
• Task-specific
• User-specific
• System-dependent
• Collaborative search
• Combining “best of current systems”
• Simpler is usually better (Occam’s razor)
General principles are difficult to find
Carnegie Mellon
Questions?
Carnegie Mellon