Transcript 投影片 1

Informedia at TRECVID 2003:
Analyzing and Searching Broadcast News Video
TRECVID 2003
Carnegie Mellon University
A. Hauptmann, R.V. Bron, M.-Y. Chen, M.Christel, P.
Duygulu, C. Huang, R. Jin, W.-H. Lin, T. Ng, N. Moraveji,
N. Papernick, C.G.M. Snoek, G. Tzanetakis, J. Yang, R.
Yang, and H.D. Wactlar
Overview (1/3)
TRECVID 2003
Shot boundary determination
identify the shot boundaries in the given video clip(s)
Story segmentation
identify the story boundary and types (miscellaneous or news)
High-level feature extraction
Outdoors, news subject face, People, Building, Road, Animal..
Search
Given the search test collection, a multimedia statement of info.
need (topic), return a ranked list of common reference shots
from the test collection
Overview (2/3)
Search
Interactive Search
Manual Search
Overview (3/3)
Semantic Classifiers
most are trained on keyframes
Interactive Search
allow more effective browsing and visualization of the
results of text queries using a variety of filter strategies
Manual Search
use multiple retrieval agents (color, texture, ASR, OCR
and some of the classifiers, e.g. anchor, PersonX)
Negative Pseudo-relevance
Co-retrieval
Even text-based baseline using the OKAPI formula
performed better other groups
Extracted Features and Non-TRECVID Metadata
Classifiers for Anchors and Commercials (1/3)
Audio Features
These features assist the extraction of the following
medium-level audio-based features: music, male speech,
female speech, and noise.
Based on the magnitude spectrum calculated using a
Short Time Fourier Transform.
consist of features that summarize the overall spectral
characteristics:
Spectral Centroid, Rolloff, Relative Subband energies and the
Mel Frequency Cepstral Coefficients
male/female: using Average Magnitude Difference
Function (AMDF)
Extracted Features and Non-TRECVID Metadata
Classifiers for Anchors and Commercials (2/3)
Low-level Image Features
The color feature is the mean and variance of each
color channels in HSV (Hue-Saturation-Value) color
space in a 5*5 image tessellation.
Another low-level feature is the canny edge direction
histogram.
Face Features
Schneiderman’s face detector algorithm
Size and position of the largest face are used as
additional face features
Extracted Features and Non-TRECVID Metadata
Classifiers for Anchors and Commercials (3/3)
Text-based features
the most reliable high-level feature
Automatic Speech Transcripts (ASR), Video Optical
Character Recognition (VOCR)
Video OCR (VOCR)
Manber and Wu’s approximate string matching
technique, e.g. “Clinton” may retrieve “Cllnton”,
“Ciintonfi”, “Cltnton” and “Clinton”
However, incorrect text like “EIICKINSON” (for
“DICKINSON”), and “Cincintoli” (for “Cincinnati”)
Fisher Linear Discriminant for
Anchors and Commercials (1/2)
Multimode combination approach: use FLD to every
feature set and synthesize new feature vectors
Using these synthesized feature vectors to represent the
content and then apply standard feature vector
classification approaches.
Two different SVM-based classifiers:
anchor: color histogram, face info., and speaker info.
commercial: color histogram and audio feature
Fisher Linear Discriminant for
Anchors and Commercials (2/2)
FLD weights for anchor detection
Anchor and Commercial classifier result
Feature Classifiers (1/7)
Baseline SVM Classifier with Common Annotation Data
SVM with the power=2 polynomial
use only image features (no face)
perform a video based cross
validation with portions of the
common annotation data
Outdoors
Buildings
Roads
Vegetation
Cars
Aircraft
Sports
Weather News
Physical violence
Animals
MAP
0.112
0.071
0.028
0.112
0.040
0.059
0.051
0.017
0.012
0.017
Feature Classifiers (2/7)
Building Detection
explore a classifier by adapting man-made structure detection
method by Kumar and Hebert
this method produces binary detection outputs for each of 22*16
grids, extract 5 features from the binary detection outputs.
number of positive grids;
area of the bounding box that includes all the positive grids;
x and y coordinates of the center of the mass of the bounding grids;
ratio of the width and height;
compactness
462 are images used as positive examples, and 495 images are used
as negative examples, by FLD, SVM
MAP 0.042 (man-made structures) vs. 0.071 (baseline SVM)
Feature Classifiers (3/7)
Plane Detection using additional still image data
use image features described above
3368 plane examples are selected from web, Corel data
set and from the University of Oxford data set as
positive examples
3516 negative examples
By FLD and SVM, MAP 0.008 vs. 0.059 (baseline)
Car Detection
modify the Schneiderman face detector algorithm
Outperform the baseline with MAP 0.114 vs. 0.040
Feature Classifiers (4/7)
Zoom Detection
use MPEG motion vectors to estimate the probability
of a zoom pattern
MAP 0.632
Female Speech
use an SVM trained on the LIMSI provided speech
features, together with the face characteristics
MAP 0.465
Feature Classifiers (5/7)
Text and Timing for
Weather News, Outdoors,
Sporting Event, Physical
Violence and Person X
Classifiers
Model only based on
text info. are better
than random baselines
on the development
data
Feature Classifiers (6/7)
Timing info. is the implicit temporal structure of the
broadcast news, especially weather reports and sports.
Feature Classifiers (7/7)
For each shot, both predictions from text-based and
timing-based classifiers have to be considered
Except for weather news, the results suggest the text info.
of the broadcast news in the shot may not be enough to
detect these high-level features.
News Subject Monologues (1/2)
Based on the LIMSI speech annotations they developed a
voice over detector and a frequent speaker detector
VOCR is applied to extract overlaid text in the hoping of
finding people names
News Subject Monologues (2/2)
Another feature measures the average amount of motion in
a camera shot, based on frame difference
also use commercial and anchor detectors
combine individual detectors and features by using two
well-known classifier combination schemes, namely
stacking and bagging
MAP 0.616
Finding Person X in Broadcast News (1/3)
Use text info. from a transcript
and face info.
Relationship between the name
of person x and time
S: one shot; TS: key frame;
TO: time of person namel;
pname (S )   (TS  TO )
Ptext (S )   Pname ( S )   Panchor ( S )
Finding Person X in Broadcast News (2/3)
More limited face recognition based on video shot
collect sample faces {F1, F2, …, Fn} for person X
and all faces {f1, f2, …, fm} of i-frames in the news shot which Ptext is
larger than zero
build the eigenspace for those faces
{f1, f2, …, fm, F1, F2, …, Fn} and represent them by the eigenfaces
{eigf1, eigf2, …, eigfm, eigF1, …, eigFn}
combination rank score and estimate which shots has high possibility
to contain that face
1 n
1
R(eigfi )  
n j 1 rj (eigf i )
S face ( S ) 
1
R(eigfi  S )

k
Finding Person X in Broadcast News (3/3)
Using “Madeleine Albright” as person x, we obtained 20
faces from a Google image search as sample query faces.
Learning Combination Weights in
Manual Retrieval (1/5)
Shot-based video retrieval,
a set of features is extracted
each shot is associated with
a vector of individual
retrieval scores from
different media search
modules
finally, these retrieval
scores are fused into a final
ordered list via some
aggregation algorithm
Learning Combination Weights in
Manual Retrieval (2/5)
use the weighted Borda fuse model as the basic combination
approach for multiple search modules, i.e. for each shot its
n
final score is y  i 1 wi si
Similarity Measures
For video frame, a harmonic mean of the Euclidean
distances from each query images (color, texture, edge)
is computed to be the distance between query and video
frames
For text, CC and OCR transcripts is done using the
OKAPI BM-25 formula

Learning Combination Weights in
Manual Retrieval (3/5)
Negative Pseudo-Relevance Feedback (NPRF)
NPRF is effective at providing a more adaptive similarity
measure for image retrieval
Propose a better strategy to sample negative examples,
that is inspired by the Maximal Marginal Relevance
Maximal Marginal Irrelevance (MMIR)
MMIR  arg min  Sim1 ( Di , Q)  (1   ) max Sim2 ( Di , D j ) 
Di T \ S 
D j S

Learning Combination Weights in
Manual Retrieval (4/5)
The Value of Intermediate-level Detectors
Text-based feature is good at global ranking and other
features is useful in refining the ranking afterwards
Learning Weights for each Modality in Video
Retrieval
Baseline: Setting weights based on query types
Person query: w=(text 2, face 1, color 1, anchor 0)
Non-person query: w=(text 2, face -1, color 1, anchor -1)
Aircraft and animal: w=(text 2, face -1, edge 1, anchor -1)
Learning Combination Weights in
Manual Retrieval (5/5)
Learning weights using training labeled set
Supervised learning algorithm in the development set
Co-Retrieval
a set of video shots are first labeled as relevant shots using textbased features, and the results are augmented by learning with the
other visual and intermediate level features
Experimental results
Interactive TREC Video Retrieval
Evaluation for 2003 (1/2)
This interface has the following features:
Storyboards of images spanning across video story
segments
Emphasizing matching shots to a user’s query to reduce
the image count
Resolution and layout under the user control
Additional filtering provided through shot classifiers
Display of filter count and distribution to guide
manipulation of storyboard views
Interactive TREC Video Retrieval
Evaluation for 2003 (2/2)
Conclusions
We believe the browsing interfaces and image-based
search improvements made for 2003 led to the increase in
performance for the new system, as these strategies
allowed relevant content to be found having no associated
narrative or text metadata.