Speech acoustics and phonetics

Download Report

Transcript Speech acoustics and phonetics

From speech signal acoustics
to perception
Louis C.W. Pols
Institute of Phonetic Sciences (IFA)
Amsterdam Center for Language
and Communication (ACLC)
NATO-ASI “Dynamics of Speech
Production and Perception”
Il Ciocco, Tuscany, Italy, July 4, 2002
Overview

how do we perceive (speech) dynamics?





from psychoacoustics to speech perception
(lack of) context; robustness; continuity
V and C reduction; coarticulation


The Intelligent Ear. On the Nature of Sound
Perception, by Reinier Plomp (2002)
perceptual compensation for artic. undershoot?
speech efficiency
conclusions
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
2
Various scientific preferences

several biases have affected the history of
(speech &) hearing research (Plomp, 2002):




dominance of sinusoidal tones as stimuli
preference for microscopic approach (e.g.,
phoneme discrimination rather than intelligibility)
emphasis on psychophysical (rather than
cognitive) aspects of hearing
clean stimuli in the lab rather than the acoustic
reality of the outside world (disruptive sounds)
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
3
Psychoacoustics - speech perc.







duration, pitch, loudness, timbre, direction
absolute and masked threshold, jnd, discrim.
continuity
complexity (pure - complex tone, voicing)
effect of context, meaning (intell.), freq. occ.
phoneme: more text-guided than perceived
speech perceptual tasks:

phoneme —> sent. identif.; discrim.; matching
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
4
Detection thresholds and jnd
simple, stationary signals
phenomenon
threshold
of hearing
threshold/
jnd
0 dB at 1000 Hz
remarks
frequency dependent
threshold
constant energy Energy =
of duration
at 10 – 300 ms
Power x Duration
frequency
1.5 Hz
more when < 200
discrimination
at 1000 Hz
ms
frequency
1.5
Hz
intensity
0.5 – 1 dB
up to 80 dB SL
discrimination
temporal
 5 ms at 50 ms duration dependent
discrimination
masking
psychophysical
tuning curve
pitch of
low pitch
many peculiarities
complex tones
gap detection
more at low freq. for
 3 ms for
wide-band noise narrow-band noise
July 4, 2002
multi-harmonic,
single-formant-like
periodic signals
phenomenon threshold/
jnd
formant
frequency
F2
3-5%
formant
amplitude
overall
intensity
formant
bandwidth
BW
F0 (pitch)
 3 dB
 1.5 dB
20 - 40 %
0.3 - 0.5 %
From speech signal acoustics to perception, Il Ciocco
remarks
one formant only
% with more
3< -3 5%
experienced subjects
F2 in synthetic vowel
synthetic vowel,
mainly F1
one-formant vowel
20 - 40%
synthetic vowel
5
Perceiving speech-like trans.

Ph.D thesis A. van Wieringen (1995)



stimulus characteristics





“Perceiving dynamic speechlike sounds. Psychoacoustics and speech perception”
see also vWie & Pols, Acustica 84 (1998) 520-528
(segmented and/or reversed) natural or synthetic
tone glide; single- or multi-formant transition
isolated trans.; initial or final trans. with steady st.
converg. or diverg. trans. (var. duration or slope)
task: jnd/DL; matching; abs. ident.; classif.
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
6
Difference limens in en dpoint frequency (Hz)
DL for short speech-like transitions
240
complex
initial
180
final
120
Complex
Complex
Single
Single
Single-isolated
Tone glide
Tone glide
60
simple
0
20
short
30
40
Transition duration (ms)
50
longer trans.
Adopted from van Wieringen & Pols (1998), Acta Acustica 84, 520-528
“Discrimination of short and rapid speechlike transitions”
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
7
Perceiving (speech) dynamics


vowel perception w/w or w/o transitions?
our claims (vSon, IFA Proc. 17 (1993)):




only evidence for compensatory processes, i.e.
perceptual-overshoot and dynamic-specification,
when in an appropriate context
synthetic isolated dynamic formant tracks lead to
perceptual undershoot (=averaging)
silent center studies are ambiguous
concl.: info in formant dynamics is only used
when V’s are heard in appropriate context
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
8
frequency --> Hz
on-
Dynamic tokens
off glide
on-
F2
 F1 =-225 Hz
F1
 F1 = 225 Hz
complete
frequency --> Hz
off glide
Stationary (reference)
tokens
complete
 F2 =-375 Hz
F2
 F2 = 375 Hz
<
25, 50
100, 150 ms
>
<
25, 50
100, 150 ms
time --> ms
>
< 6.3, 12.5, 25, >
50, 100, 150 ms
F1
Vowel identification



compare V responses for dynamic stimuli
with those for static stimuli
calculate net shift in V responses per onglide
(CV), complete (CVC), or offglide (VC)
result: responses average over the trailing
part of the formant track
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
10
Perceptual undershoot
50
% Net shift ->
40
F1 =-225Hz
Net shift in vowel
responses to tokens
F2=-375Hz with curved formant
tracks vs. stationary
F2 = 375Hz tokens. All values
significant, except
small open triangles
30
20
10
0
-10
-20
-30
F1 = 225Hz
-40
X
-50
25
50
100
150
Token duration -> ms
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
11
Effect of local context



“Perisegmental speech improves consonant and
vowel identification”, vSon & Pols, Speech Comm.
29,1-22 (1999)
also “Phoneme recognition as a function of task
and context”, IFA Proc. 24, 27-38 (2001) and Proc.
SPRAAC, 25-30 (2001)
also Pols & vSon (1993), “Acoustics and perception
of dynamic vowel segments”, Speech Comm. 13,
135-147
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
12
V and C identification




gated tokens from 120 CVC speech fragments
taken from a long text reading
50 ms V kernel, + V trans., + C part (L/R)
stimuli randomized; V identification (17 Ss)
and Ci and Cf identification (15 Ss)
results:



phoneme identification benefits from extra speech
left context more beneficial than right context
better identification when also other member of
pair was identified correctly (context effect)
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
13
Vowel identification
(91) CV
+10 ms
–10
Consonant identification
(106) CCV
(91) CV
(56) CCT
(41) CT
Transition
+25 ms
S
0
–10
50 ms Transition
a
50
100
150
Time -> ms
CVC (152)
VC (91)
(112)
V
Kernel (50)
+10 ms
VCC (106)
VC (91)
TCC (56)
TC (41)
+25 ms
l
200
233
1.5
N
All
2040
+ Accent 1003
– Accent 1037
Errors -> %
30
1.0
20
0.5
10
+
0
Kernel
+
VC
Log2 Perplexity -> bits
40
+
V*
*
CV
*
CVC
0.0
Stimulus type
Error rates of vowel identification for the individual stimulus
token types. Long-short vowel errors (/α-a:, -o:/) are ignored
Errors -> %
70
Other segment is
Correct
Error
N = 1680
60
50
40
30
CV
VC
20
10
0
C
V
V
C
V and C in CV tokens were identified better when the
other member of the pair was identified correctly
Effect of (lack of) context

100 Dutch listeners identifying V segments

“Vowel contrast reduction”, K-vBeinum (1980)
3 conditions
M1
M2
F1
F2
Av.
isolated V
(3)
%
ASC
95.2
433
88.9
404
88.0
447
86.4
634
89.6
480
words
(5)
%
ASC
88.1
406
78.8
320
84.9
374
85.3
529
84.3
407
unstr., free conv. %
(10)
ASC
31.2
174
28.7
119
33.3
209
38.9
255
33.0
189
n
ASC = 1/n Σ |LFi - LFi|2 (total variance), LFi = 100
i=1
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
10log
Fi
17
Human word intelligibility vs. noise
from Ph.D thesis
H. Steeneken (1992)
‘On measuring and
predicting speech
intelligibility’
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
18
Robustness to degraded speech


speech = time-modulated signal in frequency bands
relatively insensitive to (spectral) distortions



temporal smearing of envelope modulation



prerequisite for digital hearing aid
modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz
ca. 4 Hz max. in modulation spectrum  syllable
LP>4 Hz and HP<8 Hz little effect on intelligibility
spectral envelope smearing
for BW>1/3 oct masked SRT starts to degrade
(for references, see keynote paper Pols in Proc. ICPhS’99)

July 4, 2002
From speech signal acoustics to perception, Il Ciocco
19
Some examples

partly reversed speech (Saberi & Perrott, Nature, 4/99)






fixed duration segments time reversed or shifted in time
perfect sentence intelligibility up to 50 ms
(demo: every 50 ms reversed
original
)
low frequency modulation envelope (3-8 Hz) vs. acoustic
spectrum
syllable as information unit? (S. Greenberg)
gap and click restoration (Warren)
gating experiments
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
20
Continuity, especially
while masked




continuity effect (Miller & Licklider), auditory
induction (Warren), pulsation threshold
2000
(Houtgast)
Hz
also for gliding tones
1200
also for complex tones
also for pitch
900



fission, fusion
segregation, streaming
—> time
500
phonemic restoration
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
21
V and C reduction, coarticulation



spectral variability is not random but,
at least partly, speaker-, style-, and
context-specific
read - spontaneous; stressed - unstressed
not just for vowels, but also for consonants



duration; spectral balance
intervocalic sound energy difference
F2 slope difference; locus equation
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
22
C-duration
C error rate
35
65
Read
Read
Spontaneous
60
55
50
Error rate -> %
Duration -> ms
30
Spontaneous
25
20
15
10
5
0
45
Stressed Unstressed
p
0.001
0.006
Stressed Unstressed
Total
0.001
Mean consonant duration
p
0.001
0.001
Total
0.001
Mean error rate for C identification
791 VCV pairs (read & spontan.; stressed & unstr. segments; one
male); C-identification by 22 Dutch subjects
Adopted from van Son & Pols (Eurospeech’97)
Perception of ac. V reduction

Ph.D thesis Dick van Bergem (1995)



lexical V reduction: Fr /betõ/ vs. Du /b@tOn/
acoustic V reduction:


“Acoustic and lexical vowel reduction”
Du ‘miljoen’ as /mIljun/ or as /m@ljun/
identify the unstressed vowels (as V or @)




by 20 listeners (8M, 12 F)
in 47 words (cond. W and S)
or 20 words (cond. P), like ‘milJOEN’ or ‘biosCOOP’
spoken by 20 male speakers (2280 stimuli)
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
24
4 reduction
stages for 20
% schwa
speakers
responses
5%
on /I/ by
20 listeners
36%
60%
69%
model
prediction
for schwa
in this m-l
context
adapted from
vBergem (1995)
Conclusion:
Vowel
reduction is
not
centralization
but
contextual
assimilation
Speech efficiency

speech is most efficient if it contains only the
information needed to understand it:
“Speech is the missing information” (Lindblom, JASA ‘96)

less information needed for more predictable things:


shorter duration and more spectral reduction for highfrequent syllables and words
C-confusion correlates with acoustic factors (duration,
CoG) and with information content (syll./word freq.)
I(x) = -log2(Prob(x)) in bits
(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
26
Correlation between consonant
confusion and 4 measures indicated
-0.40
*
<- Correlation coefficient
-0.35
-0.30
-0.15
*
*
-0.25
-0.20
Duration
CoG
I(syllable)
I(word)
*
20 min. R/S
12 k syll.
8k words
791 VCV R/S
*
+
+
+
+
- 308 lex. str.
**
-0.10
-0.05
0
Dutch male sp.
Read —
Read + Spont — Spont +
Adopted from van Son et al. (Proc. ICSLP’98)
All
*
- 483 unstr.
C ident. 22 Ss
Conclusions




perceiving speech (segments) very much
depends on speech quality and context
isolated segments is also a kind of context
only ‘proper’ interpretation of formant
transitions (perceptual compensation for
spectro-temporal undershoot) when
presented in an appropriate context
reduced V are best perceived as schwa if
transitions are contextually assimilated
July 4, 2002
From speech signal acoustics to perception, Il Ciocco
28