Transcript Slide 1

Facial expression as an input annotation modality
for affective speech-to-speech translation
Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen
University College Dublin
Facial expression as an input annotation modality
for affective speech-to-speech translation
Introduction
Expressive speech synthesis in human
interaction
Speech-to-speech translation: audiovisual
input, affective state does not need to be
predicted from text
Facial expression as an input annotation modality
for affective speech-to-speech translation
Introduction
Goal: Transferring paralinguistic
information from source to target language
by means of an intermediate, symbolic
representation: facial expression as an
input annotation modality.
FEAST: Facial Expression-based Affective
Speech Translation
Facial expression as an input annotation modality
for affective speech-to-speech translation
output
audio
speech
recognition
content
analysis
translation
video
paralinguistic processing
linguistic processing
(mock-up)
input
System Architecture of FEAST
expressive
synthesis
facial
analysis
emotion
classification
style
selection
Facial expression as an input annotation modality
for affective speech-to-speech translation
Face detection and analysis
SHORE library for real-time face detection and analysis
http://www.iis.fraunhofer.de/en/bf/bsy/produkte/shore/
Facial expression as an input annotation modality
for affective speech-to-speech translation
Emotion classification and style selection
Aim of the facial expression analysis in FEAST
system: a single decision regarding the
emotional state of the speaker over each
utterance
Visual emotion classifier, trained on segments of
the SEMAINE database, with input features from
SHORE
Facial expression as an input annotation modality
for affective speech-to-speech translation
Expressive speech synthesis
Expressive unit-selection synthesis using the
open-source synthesis platform MARY TTS
German male voice dfki-pavoque-styles:
Cheerful
Depressed
Aggressive
Neutral
Facial expression as an input annotation modality
for affective speech-to-speech translation
The SEMAINE database (semaine-db.eu)
Audiovisual database collected to study natural
social signals occurring in English conversations
Conversations with four emotionally stereotyped
characters:
Poppy (happy, outgoing)
Obadiah (sad, depressive)
Spike (angry, confrontational)
Prudence (even tempered, sensible)
Facial expression as an input annotation modality
for affective speech-to-speech translation
Evaluation experiments
1. Does the system accurately classify emotion
on the utterance level, based on the facial
expression in the video input?
2. Do the synthetic voice styles succeed in
conveying the target emotion category?
3. Do listeners agree with the cross-lingual
transfer of paralinguistic information from the
multimodal stimuli to the expressive synthetic
output?
Facial expression as an input annotation modality
for affective speech-to-speech translation
Experiment 1: Classification of facial expressions
happy
88
6
0
6
sad
17
52
13
17
angry
4
17
67
13
neutral
31
8
23
38
happy
sad
angry
neutral
535 utterances used for
training, 107 for testing
English video
intended emotion
Support Vector Machine
(SVM) classifier trained
on utterances of the
male operators from the
SEMAINE database
predicted emotion
Facial expression as an input annotation modality
for affective speech-to-speech translation
Experiment 2: Perception of expressive synthesis
Perception experiment with 20 subjects
Listen to natural and synthesised stimuli and
choose which voice style describes the
utterance best:
Cheerful
Depressed
Aggressive
Neutral
Facial expression as an input annotation modality
for affective speech-to-speech translation
Experiment 2: Results
87
0
1
12
43
3
4
50
depressed
1
96
0
3
6
39
1
54
aggressive
0
1
97
2
1
0
72
27
neutral
8
18
3
71
12
6
12
70
depressed
aggressive
neutral
cheerful
depressed
aggressive
neutral
German synthesis
cheerful
intended style
German natural speech
cheerful
perceived style
Facial expression as an input annotation modality
for affective speech-to-speech translation
Experiment 3: Adequacy for S2S translation
Perceptual experiment with 14 bilingual
participants
24 utterances from SEMAINE operator data and
their corresponding translation in each voice style
Listeners were asked to choose which German
translation matches the original video best.
Facial expression as an input annotation modality
for affective speech-to-speech translation
Examples - Poppy (happy)
N
C
A
D
Facial expression as an input annotation modality
for affective speech-to-speech translation
Examples - Prudence (neutral)
N
C
A
D
Facial expression as an input annotation modality
for affective speech-to-speech translation
Examples - Spike (angry)
N
C
A
D
Facial expression as an input annotation modality
for affective speech-to-speech translation
Examples - Obadiah (sad)
N
C
A
D
Facial expression as an input annotation modality
for affective speech-to-speech translation
cheerful
80
2
14
4
depressed
10
76
0
14
aggressive
17
1
82
0
neutral
56
5
6
33
depressed
aggressive
neutral
English video/German TTS
cheerful
intended emotion in video
Experiment 3: Results
selected voice style
Facial expression as an input annotation modality
for affective speech-to-speech translation
Conclusion
Preserving the paralinguistic content of a
message across languages is possible with
significantly greater than chance accuracy
Visual emotion classifier performed with an
overall 63.5% accuracy
Cheerful/happy is often mistaken for neutral
(conditioned by the voice)
Facial expression as an input annotation modality
for affective speech-to-speech translation
Future Work
Extending the classifier to compute the prediction
of the affective state of the user based on acoustic
and prosodic analysis as well as facial
expressions.
Demonstration of the prototype system that takes
live input through a webcamera and microphone.
Integration of a speech recogniser and a machine
translation component
Questions?