Getting to the Heart of the Matter ; Nick Campbell LREC

Download Report

Transcript Getting to the Heart of the Matter ; Nick Campbell LREC

Getting to the Heart of the Matter;
(or, “Speech is more than just
the expression of text or language”)
Nick Campbell
LREC
2004
ATR Network Informatics Labs
Keihanna Science City, Kyoto, Japan
[email protected]
Overview …
• This talk addresses the current
needs for so-called ‘emotion’ in
speech, but points out that the
issue is better described as the
expression of ‘relationships’ and
‘attitudes’ rather than by the
currently held raw (or big-six)
emotional states.
Comment
– For decades now, we have been producing and
improving methods for the input and output of speech
signals by computer, but the market seems very slow
to take up these technologies.
– In spite of the early promises for human-computer
voice-based interactions, the man or woman in the
street has yet to make much use of this technology in
their daily lives.
– There are already applications where computers
mediate in human spoken communications, but in only
a few limited domains.
– Our technology appears to have fallen short of its
earlier promises!
The latest buzz-word in speech
technology research : ‘emotion’
• Why is it that the latest promises make so much
of the word ‘emotion’?
• Perhaps because the current technology is
based so much upon written text as the core for
its processing?
• Speech recognition is evaluated by the extent to
which it can ‘accurately’ transliterate a spoken
utterance; and speech synthesis is driven, in the
majority of case, just from the input text alone.
Real interactive speech
(cf read-speech)
• “spontaneous speech is ill-formed and often
includes redundant information such as
disfluencies, fillers, repetitions, repairs, and word
fragments” S. Furui 2003(and many others)
• But we don’t just talk text!
– natural speech is interactive, so we show
relationships as much as we give information …
• And we don’t just talk sentences …
– grunts are common!
Example Dialogue
(a person talking to a robot)
• The human speaks
• The robot employs speech recognition
– (and presumably some form of processing)
then replies using speech synthesis
– (which the human supposedly understands)
• The interaction is ‘successful’ if the robot
responds in an intended manner
Example dialogue 1
• Excuse me
• Yes, can I help you?
• Errm, I’d like you to come here and take a
look at this …
• Certainly, wait a minute please.
• Can you see the problem?
• No, I’m afraid I don’t understand what you
are trying to show me.
• But look at this, here …
• Oh yes, I see what you mean!
Example dialogue 2
•
•
•
•
•
•
•
•
Oi!
Uh?
Here !
Oh …
Okay?
Eh?
Look!
Ah!
Which do we want?
• As engineers:
– The former – we can do it now
• As humans:
– The latter – it’s what we are used to
• And the robots?
– They should behave in the least obtrusive
way – naturally!
How should we talk with
robots?
• First, let’s take a look at how we talk with
each other …
not using actors – but real people
– in everyday conversational situations …
• Labov : the Observer’s Paradox
– interactions lose their naturalness when an
observer intrudes!
Overcoming the
Observer’s Paradox
analysis of a very large corpus of spoken
interaction
• The JST/Crest ESP project
科学技術振興事業団報 第131号 : 「高度メディア社会の生活情報技術」
「表現豊かな発話音声のコンピュータ処理システム」
JST/CREST ESP Project
表現豊かな発話様式
Nick Campbell
ATR人間情報科学研究所
研究代表者
Project Goals
• Speech technology
– Speech synthesis with 'feeling'
– Speaking-style feature analysis/detection
• Corpus of spontaneous speech
– 1000 hours of natural speech
• Scientific contribution
– Paralinguistics & communication
Progress to date
•
•
•
•
More than 1000 hours recorded
500 hours speech collected
250 hours transcribed
75 hours labelled
– 25 voices
• Interfaces & specs are evolving
• We foresee some very new unitselection techniques being developed
The
‘Pirelli-calendar’
approach
in 1970 a team of photographers took 1000 rolls of 36exposure film on location to an island in the Pacific in order
to produce a calendar of twelve (glamour) images.
-> similarly, if we record an ‘almost infinite’ corpus of
speech, and develop techniques to extract the
interesting portions, then we will produce data which
is both representative and sufficient for studying the
full range of speaking-styles used in ordinary human
communication.
long-term recordings:
daily interactive speech
• MD & small head-mounted lavalier mic
• conversations with parents/husband/
friends/colleagues/clinic/others
• Japanese native-language speakers both sexes,
mixed ages, mixed scenarios
• Recording over a continuing period,
– speaking-style correlates of changes in
familiarity/interlocutor to be studied.
問題
提案
解決
Transcription
「発話様式」の次元による尺度
Labelling emotion
強い active
「怒り」
anger
「歓び」
joy
暗い
negative
明るい
positive
「悲しみ」
sadness
「承諾」
acceptance
弱い passive
形容詞/動詞一語で
Free input
______
Discourse Act Labelling
a
あいさつ
greeting
b
会話終了
closing
c
自己紹介
introduce-self
d
話題紹介
introduce-topic
e
情報提供
give-information
f
意見、希望
give-opinion
g
応答肯定
affirm
h
応答否定
negate
i
受け入れ
accept
j
拒絶
reject
k
了解、理解、納得
acknowledge
l
割り込み, 相づち
interject
m
感謝
thank
n
謝罪
apologize
o
反論
argue
p
提案、申し出
suggest, offer
q
気づき
notice
s
つなぎ
connector
r
t
u
w
x
y
z
依頼、命令
文句
褒める
独り言
言い詰まり
演技
繰り返し
request-action
complain
flatter
talking-to-self
disfluency
acting
repeat
r* 要求
request (a~z)
v* 確認を与える
verify (a~z)
*? よく分からない場合
(see LREC 2004, 09-SE Wednesday 4pm)
Acoustic Analysis / Visualisation tool
Quasi-syllable
boundaries
Boundaries of
quasi-syllabic Nuclei
Phonetic
labels
(if
available)
Sonorant
Energy
contour
F0 contour
(a)
Variance
in deltaCepstrum
(b)
Formant /
FFT
Cepstral
distance
Glottal AQ
pressed

breathy
Composite
(a & b)
measure of
reliability
Estimated vocal-tract
area-functions
Voice Quality & Affect
• 13,604 conversational utterances
• 1 female Japanese speaker (age 32)
• listener/speech-act/emotion labels
• Interlocutor:
Child
Family
139 3623
Friends Others Self
9044
632
116
Listener
relations
Talking to:
•
•
•
•
•
child
family
friends
others
self
NAQ & F0
by family
m1 - mother
m2 - father
m3 - baby girl
m4 - husband
m5 - big sister
m6 - nephew
m8 - aunt
Meaningful speech is a
uniquely human characteristic,
but …
• Apes use gestural communication, but not for
communicating propositional content.
• Birds and seals can mimic human sounds, but
their tunes don't contain semantic meaning.
• Bees can communicate precise geographical
locations with their dances ….
• African wild dogs show a high degree of social
organisation, and they use body postures and
the prosody of their barks to guide the hunt and
keep the pack together.
Human language
development
• It is likely that early humans used their voices in
similar ways to the hunting dogs, and that the use
of voice to complement or replace face-to-face
communication (and touch) for social interaction
and reassurance pre-dated propositional
communication.
– In this case, prosody would have been a
precursor to meaningful speech, which
developed later.
Language as Distal
Communication
• The ‘park or ride’ hypothesis (Ross, 2001)
– the development of language in humans.
“Human mothers would have had to put down their
helpless but heavy babies (who had difficulty in
clinging on by themselves) in order to forage for
food, but they maintained contact with each other
through voice, or tone-of-voice” (my italics)
– This distal communication would have reassured both
mother and child that all was well, even though they
might actually be out of direct sight of each other.
Non-linguistic speech
“it is all too tempting to think of
language as consisting of a set (infinite,
of course) of independent meaning-form
pairs. This way of thinking has become
habitual in modern linguistics” (Hurford
1999)
– But part of being human, and of taking
one's place in a social network, also
involves making inferences about the
feelings of others and having an empathy
for those feelings. (me)
“Motherese”
“If the origins of human language, or distal
communication, can be traced back to the
music of motherese, or infant-directed
prosody, then it is easy to speculate that the
sounds of the human voice replaced the
vision of the face (and body) for the
identification of social and security-related
information.”
(Falk, 2003, my italics).
Prosody and
Cognitive Neurology
“Just as stereoscopic vision yields more
than the simple sum of input from the two
eyes alone, so binaural listening probably
gives us more than just the sum of the text
and its linguistic prosody alone” (Auchlin
2003)
–
Language may be predominantly processed
in the left brain, but much of its prosody is
processed in the right.
Right-brain prosody
• Several studies have confirmed that
understanding of propositional content activates
the prefrontal cortex bilaterally, more on the left
than on the right, and that, in contrast,
responding to emotional prosody activates the
right prefrontal cortex more.
• “the right frontal lobe is perhaps particularly
critical, maybe because of its central role in the
neural network, for social cognition, including
inferences about feelings of others and empathy
for those feelings.” (Pandya et al, 1996)
• See also Monrad-Krohn (1945~), etc …
Binaural Speech Processing
(an extreme view!)
• Information coming into the right ear and the
left ear is processed separately in the brain
before being perceived as speech.
• Speculation:
– if the left brain (right ear) is more tuned for linguistic
processing, and the right brain (left ear) more tuned
for affective processing, then it is likely that the
separate activation of the two hemispheres gives an
extra-linguistic ‘depth’ to an utterance.
(but cf
telephones!)
A two-tiered view of speech
communication
two types of utterance :
• I-type express linguistic information
• A-type express affect
– The former can be well described by their text
alone; but the latter also need prosodic info.
– any utterance may contain both I-type and A-type
information, but is primarily of one type or the
other.
– The expressivity of an utterance is realised
through a socially-motivated framework that
determines its physical characteristics.
a framework for utterance specification
Self + Other + Event
• An utterance is realised as an event (=E*)
taking place within the framework of mood
and interest (=S) and friend and friendly (=O)
constraints
• mood & interest, friend & friendly :
– If motivation or interest in the content of the
utterance
is high, then the speech is typically
more expressive.
If the speaker is in a good
mood then more so …
– If the listener (other) is a friend, then the speech is
more relaxed, and in a friendly situation, then even
more so
* I-type or A-type
Realising Conversational
Speech Utterances
U = (S,O) | E
U
utter ance
--
-+
+-
++
other
O
++
+-+
-self
S
Event
(Ef f ect)
E
{
I a b
A c d
}
: giving
: getting
I : information
A : affect
Discussion
• Our analysis of a very large corpus of natural
spontaneous conversational speech indicates
that both Information & Affect may be realised
in parallel in speech, for both social and
historical reasons
• Speech synthesis (and recognition) should
soon start to take these two different types of
communication into consideration
- i.e., not emotion, but function & interaction
Conclusion
• Speech conveys multiple tiers of information, not all
of which are considered in present linguistic or
speech technology research.
• Prosody has an important and extra-linguistic
communicative function which can be explained by
language evolution and cognitive neurology
• If speech technology is to consider ‘language-inaction’ as well as ‘language-as-system’ then those
levels of information which cannot be accurately
portrayed by a transcription of the speech alone,
must be taken into consideration.
Thank you
speech
language
expression
noise
Monrad-Krohn
• uses of speech prosody categorised into four main
groups:
– i) Intrinsic prosody, for the intonation contours which
distinguish e.g., a declarative from an interrogative sentence,
– ii) Intellectual prosody, for the intonation which gives a
sentence its particular situated meaning by placing emphasis
on certain words rather than others,
– iii) Emotional prosody, for expressing anger, joy, and the
other emotions, and
– iv) Inarticulate prosody, which consists of grunts or sighs and
conveys approval or hesitation
Definition of the Glottal AQ (Amplitude Quotient)
-- figures taken from Alku et al. (JASA, August 2002) -Estimated glottal-flow waveforms…
AQ = fac / dpeak = T2
1 – Breathy phonation
~ “effective decay time” (Fant et al., 1994)
Stylised, triangular glottal-flow waveform
glottal-flow
waveform
2 – Pressed phonation
glottal-flow
derivative
With thanks to Parham Mokhtari
Normalised Amplitude Quotient (NAQ)
-- Alku et al. (2002) --
NAQ = AQ / T0 = AQ . F0
 Normalises the approximately inverse-relationship with F0
 Yields a parameter more closely associated with phonation quality
 NAQ is closely related to the glottal Closing Quotient (CQ), but is
more reliably measured than CQ (Alku et al., 2002) !
Speaker FIA
Speaker FAN