Transcript 投影片 1
Video Processing
Wen-Hung Liao
6/2/2005
5/15/2005
Outline
Basics of Video
Video Processing
Video coding/compression/conversion
Digital video production
Video special effects
Video content analysis
Summary
Basics of Video
Component video
Composite video
Digital video
Component Video
Higher-end video systems make use of three
separate video signals for the red, green, and blue
image planes. Each color channel is sent as a
separate video signal.
Most computer systems use Component Video, with
separate signals for R, G, and B signals.
For any color separation scheme, Component Video gives
the best color reproduction since there is no crosstalk
between the three channels.
This is not the case for S-Video or Composite Video,
discussed next.
Component video, however, requires more bandwidth and
good synchronization of the three components.
Composite Video
Color (chrominance) and intensity (luminance) signals are mixed
into a single carrier wave.
Chrominance is a composition of two color components (I and Q,
or U and V).
In NTSC TV, e.g., I and Q are combined into a chroma signal,
and a color subcarrier is then employed to put the chroma signal
at the high-frequency end of the signal shared with the luminance
signal.
The chrominance and luminance components can be separated
at the receiver end and then the two color components can be
further recovered.
When connecting to TVs or VCRs, Composite Video uses only
one wire and video color signals are mixed, not sent separately.
The audio and sync signals are additions to this one signal.
Since color and intensity are wrapped into the same signal, some
interference between the luminance and chrominance signals is
inevitable.
S-Video
As a compromise, (Separated video, or Super-video, e.g., in SVHS) uses two wires, one for luminance and another for a
composite chrominance signal.
As a result, there is less crosstalk between the color information
and the crucial gray-scale information.
The reason for placing luminance into its own part of the signal is
that black-and-white information is most crucial for visual
perception.
In fact, humans are able to differentiate spatial resolution in grayscale images with a much higher acuity than for the color part of
color images.
As a result, we can send less accurate color information than
must be sent for intensity information | we can only see fairly
large blobs of color, so it makes sense to send less color detail.
Digital Video
The advantages of digital representation for video
are many.
For example:
Video can be stored on digital devices or in memory, ready
to be processed (noise removal, cut and paste, etc.), and
integrated to various multimedia applications;
Direct access is possible, which makes nonlinear video
editing achievable as a simple, rather than a complex, task;
Repeated recording does not degrade image quality;
Ease of encryption and better tolerance to channel noise.
Chroma Subsampling
Since humans see color with much less spatial
resolution than they see black and white, it makes
sense to decimate the chrominance signal.
Interesting (but not necessarily informative!) names
have arisen to label the different schemes used.
To begin with, numbers are given stating how many
pixel values, per four original pixels, are actually
sent:
The chroma subsampling scheme 4:4:4 indicates that no
chroma subsampling is used: each pixel's Y, Cb and Cr
values are transmitted, 4 for each of Y, Cb, Cr.
Chroma Subsampling (2)
The scheme 4:2:2 indicates horizontal subsampling
of the Cb, Cr signals by a factor of 2. That is, of four
pixels horizontally labeled as 0 to 3, all four Ys are
sent, and every two Cb's and two Cr's are sent, as
(Cb0, Y0)(Cr0,Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and
so on (or averaging is used).
The scheme 4:1:1 subsamples horizontally by a
factor of 4.
The scheme 4:2:0 subsamples in both the horizontal
and vertical dimensions by a factor of 2.
Chroma Subsampling (3)
RGB/YUV Conversion
http://www.fourcc.org/index.php?http%3A//www.four
cc.org/intro.php
RGB to YUV Conversion
Y = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16
Cr = V = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128
Cb = U = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128
YUV to RGB Conversion
B = 1.164(Y - 16) + 2.018(U - 128)
G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128)
R = 1.164(Y - 16) + 1.596(V - 128)
Video Coding Standards
MPEG Standards (1, 2,4,7,21)
MPEG-1: VCD
MPEG-2: DVD
MPEG-4: video objects
MPEG-7: Multimedia database
MPEG-21: framework
H.26x series (H.261,H.263,H.264): video
conferencing
Digital Video Production
Tools: Adobe Premiere, After Effects,…
Resources:
http://www.cc.gatech.edu/dvfx/resources.htm
Examples:
http://www.cc.gatech.edu/dvfx/videos/dvfx200
5.html
Video Special Effects
Examples:
EffectTV: http://effectv.sourceforge.net/
FreeFrame:
http://freeframe.sourceforge.net/gallery.html
Types of Special Effects
Applying to the whole image frame
Applying to part of the image (edges, moving
pixels,…)
Applying to a collection of frames
Applying to detected areas
Overlaying virtual objects:
at pre-determined locations
in response to user’s position
Video Content Analysis
Event detection
For indexing/searching
To obtain high-level semantic description of
the content.
Image Databases
Problem: accessing and searching large databases
of images, videos and music
Traditional solutions: file IDs, keywords, associated
text.
Problems:
can’t query based on visual or musical properties
depends on the particular vocabulary used
doesn’t provide queries by example
time consuming
Solution: content-based retrieval using automatic
analysis tools (see http://wwwqbic.almaden.ibm.com)
Retrieval of images by similarity
Components:
Extraction of features or image signatures and efficient
representation and storage
A set of similarity measures
A user interface for efficient and ordered representation of
retrieved images and to support relevance feedback
Considerations
Many definitions of similarity are possible
User interface plays a crucial role
Visual content-based retrieval is best utilized when
combined with traditional search
Image features for similarity
definition
Color similarity
Similarity: e.g., “distance” between color histograms
Should use perceptually meaningful color spaces (HSV,
Lab...)
Should be relatively independent of illumination (color
constancy)
Locality:“find a red object such as this one
Texture similarity
Texture feature extraction (statistical models)
Texture qualities: directionality, roughness, granularity...
Shape Similarity
Must distinguish between similarity between actual geometrical
2-D shapes in the image and underlying 3-D shape
Shape features: circularity, eccentricity, principal axis orientation...
Spatial similarity
Assumes images have been (automatically or manually)
segmented into meaningful objects (symbolic image)
Considers the spatial layout of the objects in the scene
Object presence analysis
Is this particular object in the image?
Main components of retrieval
system
Database population: images and videos are
processed to extract features (color, texture, shape,
camera and object motion)
Database query: user composes query via graphic
user interface. Features are generated from
graphical query and input to matching engine
Relevance feedback: automatically adjusts existing
query using information fed back by user about
relevance of previously retrieved objects
Video parsing and
representation
Interaction with video using conventional
VCR-like manipulation is difficult - need to
introduce structural video analysis
Video parsing
Temporal segmentation into elemental units
Compact representation of elemental unit
Temporal segmentation
Fundamental unit of video manipulation: video shots
Types of transition between shots:
Abrupt shot change
Fades: slow change in brightness
Dissolve
Wipe: pixels from second shots replace those of previous
shot in regular patterns
Other factors of image change:
Motion, including camera motion and object motion
Luminosity changes and noise
Representation of Video
Video database population has three major components:
Shot detection
Representative frame creation for each shot
Derivation of layered representation of coherently moving
structures/objects
A representative frame (R-frame) is used for:
population: R-frame is treated as a still image for representation
query: R-frames are basic units initially returned in video query
Choice of R-frame:
first - middle - last frame in video shot
sprite built by seamless mosaicing all frames in a shot
Video soundtrack analysis
Image/sound relationships are critical to the perception and
understanding of video content. Possibilities:
Speech, music and Foley sound, detection and representation
Locutor identification and retrieval
Word spotting and labeling (speech recognition)
A possible query could be: “find the next time this locutor is
again present in this soundtrack”
Video scene analysis
500-1000 shots per hours in typical movies
One level above shot: sequence or scene (a series of
consecutive shots constituting a unit from the narrative point of
view)