Transcript Slayt 1

Toshiba Update 28/02/2006
Emotional Speech Modelling and
Synthesis
Zeynep Inanoglu
Machine Intelligence Laboratory
CU Engineering Department
Supervisor: Prof. Steve Young
Toshiba Update 28/02/2006
Agenda


Project Motivation
Review of Work on Intonation Modelling
–
–
–

Alternative Intonation Labels
–
–

–

Prosodizer Labels
Lexical Stress Labels
Controlling More Parameters
–

Intonation Models and Training
Intonation Synthesis from HMMs
Intonation Adaptation and Perceptual Tests
Pitch Synchronous Harmonic Model
Transplantation of pitch, duration and voice quality
Emotional Speech Data Collection
Summary and Future Direction
Toshiba Update 28/02/2006
Review: Project Motivation



To synthesize or resynthesize speech with desired emotional
expressivity.
Initial focus on pitch modelling.
Intonation (F0) contours have two distinct functions:
–
–


Convey the prominence structure, sentence modality.
Convey signals about the emotional states.
The interaction between the two functions are largely
unexplored. (Banziger & Scherer, 2005)
Goal:
–
–
–
–
Choose building blocks of intonation
Model them statistically
Adapt models them to different emotions.
Generate intonation contours from models.
Toshiba Update 28/02/2006
Review: Intonation Modelling

Basic Models
–
–


Seven basic models: A (accent), U (unstressed), RB (rising
boundary), FB (falling boundary), ARB, AFB, SIL
3 state, single mixture, left-to-right HMMs
Data: Boston Radio Corpus. (48 minutes of speech, female speaker)
Features: Mean-normalized raw f0 and energy values as well as
differentials.

Context-Sensitive Models
–
–
–
Tri-unit models (U+A-RB)
Full-context models (U+A-RB::vowel_pos=2::num_a=1::…..)
Decision tree-based parameter tying was performed for contextsensitive models.
Toshiba Update 28/02/2006
Review: Generation from Models

The goal is to generate an optimal
sequence of F0 values directly from
syllable HMMs given the intonation
models:
P(O |  )   P(O | Q,  ) P(Q |  )
allQ


This results in a sequence of mean state
values.
Cepstral parameter generation algorithm
of HTS system for interpolated F0
generation (Tokuda et al, 1995)
ot  [ ftT , ftT , ftT ]

Differential F0 features are used as
constraints in contour generation.
Results in smoother contours.
Toshiba Update 28/02/2006
Review: Generation From Models
a
I
a
saw
u
him
a
I
a
saw
u
him
u
yes
a
yes
u
ter
u
ter
u
day
fb
day
Toshiba Update 28/02/2006
Review: Model Adaptation with MLLR

Adapt models with Maximum Likelihood Linear Regression (MLLR) .

Adaptation data from Emotional Prosody Corpus which consists of four
syllable phrases in a variety of emotions.

Happy and sad speech were chosen for this experiment.
Neutral
Sad
Happy
Toshiba Update 28/02/2006
Review: Perceptual Tests



Utterances with sad contours were identified 80% of the time. This
was significant. (p<0.01)
Listeners formed a bimodal distribution in their ability to detect happy
utterances. Overall only 46% of the happy intonation was identified as
happier than neutral. (Smiling voice is infamous in literature)
Happy models worked better with utterances with more accents and
rising boundaries - the organization of labels matters!!!
Toshiba Update 28/02/2006
Alternative Intonation Labels


Manual intonation labels are subjective and their creation
time-consuming.
Evaluate alternative labelling methods
–
Automatic TOBI labels generated by Prosodizer.
Prosodizer generated labels are converted to the seven basic units.
–
Lexical Stress Labels
one (primary stress), two (secondary stress), zero (no-stress), sil (silence)
–
Evaluation on Boston Radio Corpus female speaker f2b
Mean Sq. Error
Manual
Prosodizer
Lexical
30.60
31.83
34.56
Toshiba Update 28/02/2006
Perceptual Investigation of Other Emotions
sil-one+one
..one.. ..zero..
zero-one+sil
Boredom
Contempt
Disgust
Interest
Cold Anger
Panic
Hot Anger
Toshiba Update 28/02/2006
Controlling More Parameters

Pitch Synchronous Harmonic Model (Hui Ye, 2004)
–
–
–
–


Size of analysis/synthesis window equal to one pitch period.
Represent each frame as a sum of harmonically related sinusoids.
(amplitudes and phases)
For voiced frames, acquire LSF representations of the vocal tract.
Better framework to manipulate pitch, duration and voice quality.
Implemented pitch, duration and voice quality
transplantation.
Set up framework for emotion conversion. (prosody,
duration and vocal tract)
Toshiba Update 28/02/2006
Transplantation with PSHM

Duration Transplantation

Pitch Transplantation per phone
For each phone:
1. Compute pitch
alignment
2. Recompute spectral
envelope
3. Restore time
Toshiba Update 28/02/2006
Transplantation with PSHM

Vocal Tract Transplantation
Alignment of frames based on DTW of MFCC distance.
– Convert LSF parameters to LPC.
– Filtering of the source harmonics with the target LPC.
– Computation of new sinusoidal amplitudes.
Neutral
Spectral Envelopes for /eh/
–
Happy
Toshiba Update 28/02/2006
Transplantation Results
Source
Target
Neutral
Happy
Neutral
Sad
Happy
Angry
Pitch
Pitch+Duration
LSF
Pitch+Duration+LSF
Conversion of voice quality improves target emotion perception in all
transplantations.
LSF transplantation driving factor in anger, while both LSF and prosody
transplantation plays an important role in happy and sad.
Toshiba Update 28/02/2006
Emotional Speech Data Collection




4 emotions: Happy, sad, surprised, angry.
Two speakers: 1 male & 1 female. (Suzanne Park, Matthew Johnson)
Toshiba TTS Training Corpus.
Happy & Sad: 1250 sentences
–
–
–

Surprise & Anger: 625 sentences.
–
–
–

900 from the phonetically balanced short sentences.
300 long sentences.
25 questions & 25 exclamations.
300 phonetically balanced short sentences.
300 long sentences
25 questions
Neutral data collection for the male speaker. (1250)
Toshiba Update 28/02/2006
Emotional Speech Data Collection

Emotion elicitation by context prompting.
“I like a party with an atmosphere”
Happy: You have just arrived at the best party in town.
Sad: You never get invitations to good parties any more.

Expected recording time
–
–
–

12 days for the female speaker. ( 2 weeks and a half non-stop)
15 days for the male speaker. (Twice a week for two months)
6-hour days
Post processing:
–
–
–
Phonetic Alignment
Pitch Marks
Text Analysis
-Syllable Boundaries
-Prosodizer labels
Toshiba Update 28/02/2006
Future Direction


Data collection and labelling.
Experiments with emotion conversion
–
–
–
–


Prosody conversion based on HMM models
Voice quality conversion.
Joint modelling of prosody and voice quality in emotional speech.
Investigation of voice source and its effects on emotion.
Integration of speech modification techniques into a TTS
framework.
Comparison of speech modification techniques with unitselection techniques.