Video Rewrite Driving Visual Speech with Audio

Download Report

Transcript Video Rewrite Driving Visual Speech with Audio

Video Rewrite
Driving Visual Speech with Audio
Christoph Bregler
Michele Covell
Malcolm Slaney
Presenter: Jack jeryes
3/3/2008
What is video rewrite?
use existing footage to create new
video of a person mouthing words
that he did not speak in the
original footage
Example:
Why video rewrite?
movie dubbing :
to sync the actors’ lip motions to
the new soundtrack
Teleconferencing
Special effects
Approach
Learn from example footage how
a person’s face changes during
speech
(dynamics and idiosyncrasies)
Stages
Video rewrite have two statges:
Analysis stage
Synthesis stage
Analysis stage:
use the audio track to segment the video into triphones.
Vision techniques find the head orientation ,
mouth & chin shape and position in each image
Synthesis stage:
segments new audio and uses it to select triphones from the video model.
Based on labels from the analysis stage, the new mouth images
are morphed into a new background face
Analysis for video modeling
the analysis stage creates an annotated
database of example video clips, derived
from unconstrained footage. (video model)
-Annotation Using Image Analysis
-Annotation Using Audio Analysis
Annotation Using Image Analysis
As face moves within the frame,
need to know
-mouth position
-lip shapes at all times.
Using eigenpoints (good for low resolution)
Eigenpoints :
A small set of hand-labeled facial images is used
to train subspace models.
Given a new image, the eigenpoint models
tell us the positions of points on the lips and jaw
Eigenpoints (cont.)
54 eigenpoints for each image :
34 on the mouth
20 on the chin and jaw line.
Only 26 images hand labeled
26 / 14,218 about 0.2%
Extended the hand-annotated dataset by
morphing pairs to form intermediate
images
Eigenpoints (cont.)
Eigenpoints doesn’t allow variety of
motions.
thus, warp each face image into a
standard reference plane,
prior to eigpoints labeling
Use affine transform to minimize the
mean-squared error between a large
portion of the face image and a facial
template
Mask to estimate global warp
Each image is warped to account for changes in the head’s
position, size, and rotation. The transform minimizes the
difference between the transformed images and the face
template. The mask (left) forces the minimization to
consider only the upper face (right).
global mapping…
Once the best global mapping is found, it is
inverted and applied to the image, putting that
face into the standard coordinate frame.
We then perform eigenpoints analysis on this
pre-warped image to find the fiduciary points.
Finally, we back-project the fiduciary points
through the global warp to place them on the
original face image
Annotation Using Audio Analysis
All the speech segmented into sequences
of phonemes
the /T/ in “beet” looks different from
the /T/ in “boot.”
Consider coarticulation
Annotation Using Audio Analysis
Use triphones: collections of three
sequential phonemes
“teapot” is split into :
/SIL-T-IY/
/T-IY-P/ /IY-P-AA/
/P-AA-T/ and /AA-T-SIL/
Annotation Using Audio Analysis
While synthesize a video,
-Emphasize the middle of each triphone.
-Cross-fade the overlapping regions of
neighboring triphones
Synthesis using a video model
segments new audio and uses it to select triphones from the video model.
Based on labels from the analysis stage, the new mouth images
are morphed into a new background face
Synthesis using a video model
background, head tilts and the eyes blink
taken from the source footage in the same
order as they were shot
the triphone images include the mouth,
chin, and part of
the cheeks,
use illumination-matching techniques
to avoid visible seams
Selection of Triphone Videos
choosing a sequence of clips that
approximates the desired transitions
and shape continuity
Selection of Triphone Videos
Given a triphone in the new speech
utterance, we compute a matching distance
to each triphone in the video database
error  Dp  1   Ds
Dp = phoneme-context distance
Ds = lip-shape distance
Dp = phoneme-context distance
Dp is based on categorical distances
between phoneme categories and
between viseme classes
Dp= waited sum (viseme-distance ,
phonemic-distance)
26 viseme classes :
1- /CH/ /JH/ /SH/ /ZH/
2- /K/ /G/ /N/ /L/ /T/ /D/
3- /P/ /B/ /M/
..
Dp = phoneme-context distance
-Phonemic-distance ( /P/ , /P/ ) = 0
same phonemic category
-Viseme-distance ( /P/ ,/IY/ ) = 1
different viseme classes
Dp ( /P/ ,/B/ ) = between 0-1
same viseme class
different phonemic category
Ds = lip-shape distance
Ds ,measures how closely the mouth
Contours match in overlapping segments
of adjacent triphone videos
In “teapot” :
/IY/ and /P/ in /T-IY-P/
shall match the contours for
/IY/ and /P/ in /IY-P-AA/
Ds = lip-shape distance
Euclidean distance frame by frame between
4-elements feature vector
(overall lip width , overall lip high,
inner lip height, height of visible teeth)
Stitching all Together
The remaining task is to stitch the triphone
videos into the background sequence