Why predict emotions? Feature granularity levels

Download Report

Transcript Why predict emotions? Feature granularity levels

1
Why predict emotions?
Poster by Greg Nicholas. Adapted from paper by Greg Nicholas, Mihai Rotaru, & Diane Litman
Affective computing – direction for improving spoken dialogue systems
• Emotion detection (prediction)
• Emotion handling
2
Feature granularity levels
Detecting emotion: train a classifier on features extracted from user turns.
Turn Level
Types of features:
Word Level
Previous work uses mostly features computed over the entire turn.
Amplitude
[1] uses pitch features computed at the word-level
Approximations of pitch
contours
Approximation of pitch
contour
Lexical
Pitch
Duration
Efficient but offers a coarse approximation of the pitch contour.
We concentrate on Pitch features to detect Uncertainty
3
Problems classifying the overall turn emotion
Turn-level is simple:
• Labeling granularity = turn
• One set of features per turn
Example student turn: “The force of the truck”
Turn-level
speech:
The force of the truck
Word-level is more complicated:
• Label granularity mismatch: label at turn
level, features at word level
• Variable number of features per turn
Turn-level
speech:
Overall turn
prediction:
……
(One set)
Word-level
feature set:
“the force of the truck”
predict
(One prediction)
Uncertain
Word-level
feature set:
…
… …
…
(Five sets)
Word-level
predictions:
“the” “force” “of” “the” “truck”
?
predict
Uncertain
5
Comparison of recall and precision for
predicting uncertain turns
ITSPOKE dialogues
• Domain: qualitative physics tutoring
• Backend: WHY2-Atlas, Sphinx2 speech recognition,
Cepstral text-to-speech
Corpus comparison with previous study [1]
…
…
…
…
“the”
“force”
“of”
“the”
“truck”
Technique 2: Predefined subset of
sub-turn units (PSSU)
Combine: Concatenate features from 3 words (first,
middle, last) into a conglomerate feature set
Train & predict: turn-level model with turn’s emotion
Word-level
feature set:
predict
Non-uncertain
Uncertain
Non-uncertain
Uncertain Non-uncertain
PSSU feature
set:
combine
Overall turn
prediction:
Non-uncertain (3/5)
Issues:
• Turn  Word level labeling assumption
• Majority voting is a very simple scheme
[1] showed that the WLEM method works better than
turn-level
…
…
…
…
…
“the”
“force”
“of”
“the”
“truck”
combine
…
…
“the”
“of”
…
“truck”
Predict
Non-uncertain
Issues:
• Might lose details from discarded words
Used in [2] at breath-group level but not at word level
0.80
0.70
precision
0.60
recall
f-measure
0.50
Previous
Current
0.40
220
511
2.32
Emotional/
Non-emotional
129/91 (E/nE)
9854
27548
2.80
Uncertain/
Non-uncertain
2189/7665 (U/nU)
0.30
77.79%
…
Overall turn
prediction:
(One prediction)
Recall/Precision
Corpus
58.64%
…
Overall turn
prediction:
Experimental Results
# of turns
# of words
words/turns
Emotional
classification
Class
distribution
Baseline
Train: word-level model with turn’s emotion label
Predict: emotion label of each word
Combine: majority voting of predictions
extract
4
Techniques to solve this problem
Technique 1: Word-level emotion
model (WLEM)
The force of the truck
extract
Turn-level
feature set:
Offers a better approximation of the pitch contour (e.g. captures
the big changes in uttering the word “great.”)
Turn-level
Word-level
(WLEM)
Word-level
(PSSU)
• Turn-level: Medium recall/precision
• WLEM: Best recall, lowest precision
• Tends to over-generalize
• PSSU: Good recall, best precision
• Much less over-generalization, overall
best choice
Overall prediction accuracy
Turn-level
Word-level
(WLEM)
Word-level
(PSSU)
81.97 (0.09)
82.53 (0.07)
84.11 (0.05)
Baseline: 77.79%
• WLEM word-level slightly improves
upon turn-level (+0.56%)
• PSSU word-level show a much better
improvement (+2.14%)
• Overall, PSSU is best according to this
metric as well
Future work
Many alterations could further improve these techniques:
• Annotate each individual word for certainty instead of whole turns
• Include other features pictured above: lexical, amplitude, etc.
• Try predicting in a human-human dialogue context
• Better combination techniques (e.g. confidence weighting)
• More selective choices for PSSU than the middle word of the turn
(e.g. longest word in the turn, ensuring the word chosen has
domain-specific content)
[1] M. Rotaru and D. Litman, "Using Word-level Pitch Features to Better Predict Student Emotions during
Spoken Tutoring Dialogues," Proceedings of Interspeech, 2005.
[2] J. Liscombe, J. Hirschberg, and J. J. Venditti, "Detecting Certainness in Spoken Tutorial Dialogues,”
Proceedings of Interspeech, 2005.
6