Content Based Multimedia Signal Processing

Download Report

Transcript Content Based Multimedia Signal Processing

Content Based Multimedia
Signal Processing
Yu Hen Hu
University of Wisconsin – Madison
Outline
• Multimedia content description
Interface (MPEG-7)
• Video content features
• Spoken content features
• Multimedia indexing, and retrieval
• Multimedia summary, filtering
• Other applications
MPEG-7 Overview
• Large amount of digital
contents are available
• Easy to create, digitize,
and distribute audiovisual content
• Family album syndrome
– Need organize, index,
retrieval
• Information overloading
– Need filtering
• MPEG-7 Objective
Provide inter-operability
among systems and
applications used in
generation, management,
distribution, and consumption
of audio-visual content
description.
Help user to identify, retrieve,
or filter audio-video
information.
Potential Application of MPEG-7
• Summary,
– Generation of multimedia
program guide or content
summary
– Generation of content
description of A/V archive to
allow seamless exchange
among content creator,
aggregator, and consumer.
• Filtering
– Filter and transform
multimedia streams in
resource limited
environment by matching
user preference, available
resource and content
description.
• Retrieval
– Recall music using samples
of tunes
– Recall pictures using
sketches of shape, color
movement, description of
scenario
• Recommendation
– Recommend program
materials by matching user
preference (profile) to
program content
• Indexing
– Create family photo or video
library index
Content descriptions
• Descriptors
– MPEG-7 contains
standardized descriptors
for audio, visual, generic
contents.
– Standardize how these
content features are
being characterized, but
not how to extract.
– Different levels of syntax
and semantic
descriptions are available
• Description Scheme
– Specify the structure and
relations among different
A/V descriptors
• Description Definition
Language (DDL)
– Standardized language
based on XML (eXtended
Markup Language) for
defining new Ds and
DSs; extending or
modifying existing Ds
and Dss.
Visual Color Descriptors
• Color space: HSV (huesaturation-value)
– Scalable color descriptor
(SCD): color histogram
(uniform 255 bin) of an
image in HSV encoded
by Haar transform.
• Color layout descriptor:
– spatial distribution of
color in an arbitrarily
shaped region.
• Dominant color
descriptor (DCD):
– colors are clustered first.
• Color structure
descriptor (CSD):
– scan 8x8 block in slide
window, and count
particular color in
window.
• Group of Frame/Group
of Picture color
descriptor
Visual Texture Descriptor
• Texture Browsing D.
– Regularity:
• 0: irregular; 3: periodic
– Directionality
• Up to 2 directions
• 1-6 in 30O increment
– Coarseness
• 0: fine; 3: coarse
• Edge histogram D.
– 16 sub-images
– 5 (edge direction)
bins/sub-image
• Homogeneous Texture
D. (HTD)
– Divide frequency space
into 30 bins (5 radial, 6
angular)
– 2D Gabor filter bank
applied to each bin
– Energy and energy
deviation in each bin
computed to form
descriptor.
Visual Shape Descriptor
• 3D Shape D. – Shape
spectrum
– Histogram (100 bins,
12bits/bin) of a shape index,
computed over 3D surface.
– Each shape index measures
local convexity.
• Region-based D.: Art
– Angular radial transform
– Shape analysis based on
moments
– ART basis:
Vnm(, ) = exp(jm)Rn()
Rn() = 2 cos(n) n 0
=1 n=0
• Contour based shape
descriptor
– Curvature scale space
(CSS)
– N points/curve, successively
smoothed by [0.25 0.5 0.25]
till curve become convex.
– Curvature at each point
form a curvature at that
scale.
– Peaks of each scale are
used as feature
• 2D/3D descriptors
– Use multiple 2D descriptors
to describe 3D shape
Visual Motion Descriptor
• Motion activity D.
–
–
–
–
Video
Intensity
segment
Direction of activity
Spatial distribution of activity
Mosaic
Camera
Temporal distribution of
motion
activity
• Camera motion
–
–
–
–
–
–
Panning
Booming (lift up)
Tracking
Tilting
Zooming
Rolling (around image
center)
– Dollying (backward)
Motion
activity
Motion
region
trajectory
Warping
parameter
• Warping (w.r.t. mosaic)
• Motion trajectory
Parametric
motion
MPEG-7 Audio Content Descriptors
• 4 classes of audio signals
–
–
–
–
Pure music
Pure speech
Pure sound effect
Arbitrary sound track
• Audio descriptors
– Silence Ds: silencetype
– Sound effect Ds:
• Audio Spectrum
• Sound effect features
– Spoken content Ds:
•
•
•
•
Speaker type
Link type
Extraction info type
Confusion info type
– Timbre Ds:
• Instrument
• Harmonic instrument
• Percussive instrument
– Melody contour Ds
• Contour
• Meter
• beat
Spoken content description
Speech
waveform
Audio
processing
ASR
MPEG-7
Encoder
•
–
–
•
–
–
•
Header

Word lexicon (vocabulary)
Phone lexicon:
•
Goal: To support potentially
erroneous decoding extracted using
an automatic speech recognition
system for robust retrieval.
lattice
Spoken content Header
IPA (international phonetic
association. Alphabet)
SAMPA (speech assessment method
phonetic alphabet)
Phone confusion statistics
Speaker
Spoken content lattice (word or
phone)
–
–
Lattice Node
Word and phone link

lattice
BORE
P=0.6
IS
P=0.7
HIS
P=0.3
Use of Content Features
• Multimedia information
retrieval
– Create searchable
archive of A/V materials,
e.g. album, digital library
– Real world examples:
•
•
•
•
•
call routing
Technical support
On-line manual
Shopping
Multimedia on demand
• Filtering
– Automated email sorter
– Personalized information
portal
• Enhance low-level
signal processing
– Coding and trans-coding
– Post-processing
Content-based Retrieval
Query
Module
Retrieval
Module
Input
Module
Feature
extraction
Feature
comparison
Feature
Database
Feature
extraction
Interactive
Query
Formation
Browsing
&
Feedback
Image
Database
Multimedia
data
User
Output
Multimedia CBR System Design Issues
• Requirement analysis
– How the multimedia materials are to be used
– Determines what set of features are needed.
• Archiving
– How should individual objects are stored? Granularity?
• Indexing (query) and retrieving
– With multi-dimensional indices, what is an effective and efficient
retrieval method?
– What is a suitable perceptually-consistent similarity measure?
• User interface
– Modality? Text or spoken language or others?
– Interactive or batch? Will dialogue be available?
Multimedia Archiving
• Facts:
– Often in compressed format and needs large
storage space
– Content index will also occupy storage space
• Issues
– Granularity must match underlying file system
– Logical versus physical segmentation
– File allocation on file system must support multiple
stream access and low latency
Indexing and Retrieving
• Index
– A very high dimensional
binary vector
– Encoding of content
features
– Text-based content can
be represented with term
vectors
– A/V content features can
be either Boolean vectors
or term vectors
• Retrieval
– Retrieval is a pattern
classification problem
– Use index vector as the
feature vector
– Classify each object as
relevant and irrelevant to
a query vector (template)
– A perceptually consistent
similarity measure is
essential
Term Vector Query
• Each document is represented by a specific term vector
• A term is a key-word or a phrase
• A term vector is a vector of terms. Each dimension of the vector
corresponding to a term.
• Dimension of a term vector = total number of distinct terms.
• Example:
Set of terms = [tree, cake, happy, cry, mother, father, big, small]
document = “Father gives me a big cake. I am so happy”, “mother
planted a small tree”
Term vectors: [ 0, 1, 1, 0, 0, 1, 1, 0], [1, 0, 0, 0, 1, 0, 0, 1]
Inverse Term Frequency Vector
– A probabilistic term vector representation.
– Relative Term Frequency (within a document)
tf (t,d) = count of term t / # of terms in document d
– Inverse document Frequency
df(t) = total count of document/ # of doc contain t
– Weighted term frequency
dt = tf(t,d) · log [ df(t)]
– Inverse document frequency term vector D = [d1, d2, … ]
ITF Vector Example
Document 1: The weather is great these days.
Document 2: These are great ideas
Document 3: You look great
Eliminate: The, is, these, are, you
Term
Weather
great
day
idea
look
tf(t,1)
1/6
1/6
1/6
0
0
tf(t,2)
0
1/4
0
1/4
0
tf(t,3)
0
1/3
0
0
1/3
df(t)
3
1
3
3
3
D1
D2
D3
0.08 0.00 0.00
0.00 0.00 0.00
0.08 0.00 0.00
0.00 0.12 0.00
0.00 0.00 0.16
Human Computer Interface
Voice, gesture
push button/key
expression, eye
Command
HCI is a match-maker: Matching
the needs of human and computers
Sensation: visual
audio, pressure
smell: virtual
environment
Data
Basic HCI Design Principles
• Consistency: Same command means the same
thing
•
•
•
•
Intuition: Metaphor that is familiar to the user
Adaptability: Adapt to user’s skill, style
Economy: Use minimum efforts to achieve a goal
Non-intrusive: Do not decide for user without
asking
• Structure: Present only relevant information to
user in a simple manner.
User Models
• User Profiles:
–
–
–
–
Categorize users using features relevant to tasks
Static features: age, sex, etc.
Dynamic features: activity logs, etc.
Derived features: skill levels, preferences, etc.
• Use of Profiles for HCI
– Adaptation: Customize HCI for different category
of users
– Better understanding of user’s needs
Principles of Dialogue Design
•
•
•
•
Feedback: Always acknowledge user’s input
Status: Always inform users where are they in the system
Escape: Provide a graceful way to exit half way.
Minimal Work: Minimize amount of input user must
provide
• Default: Provide default values to minimize work
• Help: Context sensitive help
• Undo: Allow user to make unintentional mistake and
correct it
• Consistency:
Performance Evaluation
• Document retrieval problem is a
hypothesis testing problem:
H0: di is relevant to q (r=1)
H1: di is irrelevant to q (r=0)
• Type I error (Pe1=P{r=0|H0})
Relevant but not retrieved.
• Type II error (Pe2 =P{r=1|H1}) :
Irrelevant but retrieved.
Contingency table for evaluating
retrieval
Retrieved Not retrieved
Relevant
w
x
Irrelevant
y
z
• Precision Recall Curve
– P(recision) = w/(w+y) is a
measure of specificity of the
result
– R(ecall) = w/(w+x) is an indicator
of completeness of the result.
• Operating curve
– Pe1 = x/(w+x) = 1 – R
– Pe2 = y/(y+z) = F(allout)
• Expected search length =
average # of documents need to
be examined to retrieve a given
number of relevant documents.
• Subjective criteria
Example: MetaSEEk
• MetaSEEk-A meta-search engine
– Purpose: retrieving images
– Method: Select and interface with multiple on-line
image search engines
– Search Principle: Performance of different query
classes of search engines and their search
options
A. B. Benitez, M. Beigi, and S.-F. Chang, Using Relevance
Feedback in Content-Based Image Metasearch, IEEE Internet
Computing, Vol. 2, No. 4, pp. 59-69, July/August 1998
Basic idea of MetaSEEk
• Classify the user queries into different clusters by
their visual content.
• Rank the different search engines according to their
performance for the different classes of user queries
• Select the search engines and search options
according to their rank for the specific query cluster
• Display the search results to User
• Modify these performance according to the user
feedback
Overview-Basic components of a
meta-search engine
Content-Based Visual Query (1)
• Advantage
– Ease of creating, capturing and collecting digital
imaginary
• Approaches
– Extract significant features (Color, Texture, Shape,
Structure)
– Organize Feature Vectors
– Compute the closeness of the feature vectors
– Retrieve matched or most similar images
Content-Based Visual Query (2)
Improve Efficiency
• Keyword-based search
– Match images with particular subjects and narrow
down the search scope
• Clustering
– Classify images into various categories based on
their contents
• Indexing
– Applied to the image feature vectors to support
efficient access to the database
Cluster the visual data
• K-means algorithm
– Simplicity
– Reduced computation
• Tamura algorithm (for text)
• For Color, feature vector are calculated using
the color histogram
• Using Euclidean distance
Conceptual structure of the
meta-search database.
Multimedia summary and filtering
• Summary
– Text: email reading
– Image: caption
generation
– Video: high-lights, story
board
• Issues:
–
–
–
–
Segmentation
Clustering of segments
Labeling clusters
Associate with syntactic
and semantic labels
• Filtering
– Same as retrieval: filter
out irrelevant objects
based on a given
criterion (query)
– Often need to be
performed based on
content features
• E.g. filtering traffic
accidents or law
violations from traffic
monitoring videos
Content based Coding and Post-processing
• Different coding
decisions based on low
level content features
– coding mode (inter/intra
selection)
– motion estimation
• Object based coding
– Encoding different
regions (VOP) separately
– Using different coder for
different types of regions
• Multiple abstraction layer
coding
– An analysis/synthesis
approach
– Synthesize low level
contents from higher level
abstraction
• E.g. texture synthesis
• Content based postprocessing
– Identify content types and
en synthesize low level
content