Flexible, Robust, and Efficient Human Speech Processing
Download
Report
Transcript Flexible, Robust, and Efficient Human Speech Processing
Human and
Machine Performance
in Speech Processing
Louis C.W. Pols
Institute of Phonetic Sciences / ACLC
University of Amsterdam, The Netherlands
(Apologies: this presentation resembles keynote at ICPhS’99, San Fransisco, CA)
IFA
Herengracht 338
Amsterdam
welcome
Heraeus-Seminar
“Speech Recognition and Speech Understanding”
April 3-5, 2000, Physikzentrum Bad Honnef, Germany
Overview
Phonetics and speech technology
Do recognizers need ‘intelligent ears’?
What is knowledge?
How good is human/machine speech recogn.?
How good is synthetic speech?
Pre-processor characteristics
Useful (phonetic) knowledge
Computational phonetics
Discussion/conclusions
Phonetics Speech Technology
AFFINITY to: phonetics
from:
phonetics
speech
technology
source / filter
individuality
context
prosody
more data
new models
probabilities
speech vs. NLP
speech
technology
human performance
specific knowledge
regularities
multiple features
EU FPV, DARPA
applications
user orientation
evaluation
Machine performance
more difficult, if ……..
test condition deviates from training
condition, because of:
-
nativeness and age of speakers
size and content of vocabulary
speaking style, emotion, rate
microphone, background noise, reverberation,
communication channel
- nonavailability of certain features
however, machines get never tired, bored or
distracted
Do recognizers need
intelligent ears?
intelligent ears front-end pre-processor
only if it improves performance
humans are generally better speech
processors than machines, perhaps system
developers can learn from human behavior
robustness at stake (noise, reverberation,
incompleteness, restoration, competing
speakers, variable speaking rate, context,
dialects, non-nativeness, style, emotion)
What is knowledge?
phonetic knowledge
probabilistic knowledge from databases
fixed set of features vs. adaptable set
trading relations, selectivity
knowledge of the world, expectation
global vs. detailed
see video
(with permission from Interbrew Nederland NV)
Video is a metaphor for:
from global to detail (world Europe Holland
North Sea coast Scheveningen beach
young lady drinking Dommelsch beer)
sound speech speaker English utterance
‘recognize speech’ or ‘wreck a nice beach’
zoom in on whatever information is available
make intelligent interpretation, given context
beware for distracters!
Human auditory sensitivity
stationary vs. dynamic signals
simple vs. spectrally complex
detection threshold
just noticeable differences
Detection thresholds and jnd
simple, stationary signals
phenomenon
threshold/
jnd
threshold
of hearing
0 dB at 1000 Hz
threshold
of duration
frequency
discrimination
intensity
discrimination
temporal
discrimination
masking
constant energy
at 10 – 300 ms
1.5 Hz
at 1000 Hz
0.5 – 1 dB
frequency
pitch of
complex tones
gap detection
remarks
frequency dependent
multi-harmonic,
single-formant-like
periodic signals
phenomenon threshold/
jnd
formant
frequency
3-5%
formant
amplitude
overall
intensity
formant
bandwidth
F0 (pitch)
3 dB
F2
Energy =
Power x Duration
more when < 200
ms
up to 80 dB SL
1.5 Hz
5 ms at 50 ms
duration dependent
psychophysical
tuning curve
low pitch
many peculiarities
3 ms for
wide-band noise
more at low freq. for
narrow-band noise
BW
remarks
one formant only
< 3 % with more
experienced subjects
F2 in synthetic vowel
3 - 5%
1.5 dB
20 - 40 %
0.3 - 0.5 %
synthetic vowel,
mainly F1
one-formant vowel
20 - 40%
synthetic vowel
Table 3 in Proc. ICPhS’99 paper
Difference limens in en dpoint frequency (Hz)
DL for short speech-like transitions
240
complex
180
120
Complex
Complex
Single
Single
Single-isolated
Tone glide
Tone glide
60
simple
0
20
short
30
40
Transition duration (ms)
50
longer trans.
Adopted from van Wieringen & Pols (Acta Acustica ’98)
How good is
human / machine speech recognition?
corpus
TI digits
alphabet
Resource
Management
NAB
description
read digits
read letters
read
sentences
read
sentences
Switchboard spontaneous
CSR
telephone
conversations
Switchboard idem
wordspotting
vocabulary
size
10
26
1,000
recognition % word error
perplexity
machine human
10
0.72
0.009
26
5
1.6
60-1,000
17
2
5,000unlimited
2,000unlimited
45-160
6.6
0.4
80-150
43
4
20
keywords
-
31.1
7.4
Adapted from Lippmann (SpeCom, 1997)
How good is
human / machine speech recognition?
machine SR surprisingly good for certain tasks
machine SR could be better for many others
- robustness, outliers
what are the limits of human performance?
- in noise
- for degraded speech
- missing information (trading)
Human word intelligibility vs. noise
humans
start to
have some
trouble
Adopted from Steeneken (1992)
recognizers
have
trouble!
Robustness to degraded speech
speech = time-modulated signal in frequency bands
relatively insensitive to (spectral) distortions
- prerequisite for digital hearing aid
- modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz
temporal smearing of envelope modulation
- ca. 4 Hz max. in modulation spectrum syllable
- LP>4 Hz and HP<8 Hz little effect on intelligibility
spectral envelope smearing
- for BW>1/3 oct masked SRT starts to degrade
(for references, see paper in Proc. ICPhS’99)
Robustness to degraded speech
and missing information
partly reversed speech (Saberi & Perrott, Nature, 4/99)
- fixed duration segments time reversed or shifted in time
- perfect sentence intelligibility up to 50 ms
(demo: every 50 ms reversed
original
)
- low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum
- syllable as information unit? (S. Greenberg)
gap and click restoration (Warren)
gating experiments
How good is synthetic speech?
(not main theme of this seminar, however,
still attention for synthesis and dialogue)
good enough for certain applications
could be better in most others
evaluation:
application-specific
or multi-tier required
interesting experience: Synthesis workshop
at Jenolan Caves, Australia, Nov. 1998
Workshop evaluation procedure
participants as native listeners
DARPA-type procedures in data preparations
balanced listening design
no detailed results made public
3 text types
- newspaper sentences
- semantically unpredictable sentences
- telephone directory entries
42 systems in 8 languages tested
Screen for newspaper sentences
Some global results
it worked!, but many practical problems
(for demo see http://www.fon.hum.uva.nl)
this seems the way to proceed and to expand
global rating (poor to excellent)
- text analysis, prosody & signal processing
and/or more detailed scores
transcriptions subjectively judged
- major/minor/no problems per entry
web site access of several systems
(http://www.ldc.upenn.edu/ltts/)
Phonetic knowledge to improve
speech synthesis
(supposing concatenative synthesis)
control emotion, style, voice characteristics
perceptual implications of
- parameterization (LPC, PSOLA)
- discontinuities (spectral, temporal, prosody)
improve naturalness (prosody!)
active adaptation to other conditions
- hyper/hypo, noise, comm. channel, listener impairment
systematic evaluation
Desired pre-processor characteristics
in Automatic Speech Recognition
basic sensitivity for stationary and dynamic sounds
robustness to degraded speech
- rather insensitive to spectral and temporal smearing
robustness to noise and reverberation
filter characteristics
- is BP, PLP, MFCC, RASTA, TRAPS good enough?
- lateral inhibition (spectral sharpening); dynamics
what can be neglected?
- non-linearities, limited dynamic range, active elements,
co-modulation, secondary pitch, etc.
Caricature of present-day speech
recognizer
trained with a variety of speech input
- much global information, no interrelations
monaural, uni-modal input
pitch extractor generally not operational
performs well on average behavior
- does poorly on any type of outlier (OOV, non-native, fast
or whispered speech, other communication channel)
neglects lots of useful (phonetic) information
heavily relies on language model
Useful (phonetic) knowledge
neglected so far
pitch information
(systematic) durational variability
spectral reduction/coarticulation (other than multiphone)
intelligent selection from multiple features
quick adaptation to speaker, style & channel
communicative expectations
multi-modality
binaural hearing
Useful information:
durational variability
4626
Root /iy/
factor
level
95
m ean
39
0
R
0
S
0
Lw
796
2
711
s.d.
2
1588
1494
83
95
109
31
36
46
0
37
816
1
735
2
0
37
719
1
729
2
46
78
89
91
87
104
98
98
119
104
25
36
25
29
40
34
33
54
42
91
1
2
529
3
117
0
79
52
1
70
2
180
3
433
0
14
1
22
2
1
80
91
75
80
94
136
101
101
83
107
99
26
30
22
25
27
50
25
42
24
36
0
0
Lu
1
1
1544
count
52
0
50
1
12
2
8
0
134
2
46
0
374
1
37
2
22
94
126
186
121
98
111
96
156
90
27
46
52
23
25
24
37
58
27
Adopted from Wang (1998)
Useful information:
durational variability
4626
Root /iy/
factor
level
m ean
39
s.d.
normal rate=95
0
1
1544
R
0
S
796
1
2
711
2
1588
1494
83
95
109
31
36
46
primary stress=104
0
37
816
1
735
count
overall average=95 ms
95
2
0
37
719
1
729
2
46
78
89
91
87
104
98
98
119
104
25
36
25
29
40
34
33
54
42
word final=136
0
Lw
91
1
2
529
3
117
0
79
52
1
70
2
180
3
433
0
14
1
22
2
1
80
91
75
80
94
136
101
101
83
107
99
26
30
22
25
27
50
25
42
24
36
0
utterance final=186
0
Lu
52
0
50
1
12
2
8
0
134
2
46
0
374
1
37
2
22
94
126
186
121
98
111
96
156
90
27
46
52
23
25
24
37
58
27
Adopted from Wang (1998)
Useful information:
V and C reduction, coarticulation
spectral variability is not random but, at least
partly, speaker-, style-, and context-specific
read - spontaneous; stressed - unstressed
not just for vowels, but also for consonants
-
duration
spectral balance
intervocalic sound energy difference
F2 slope difference
locus equation
C-duration
C error rate
35
65
Read
Read
Spontaneous
60
55
50
Error rate -> %
Duration -> ms
30
Spontaneous
25
20
15
10
5
0
45
Stressed Unstressed
p
0.001
0.006
Stressed Unstressed
Total
0.001
Mean consonant duration
p
0.001
0.001
Total
0.001
Mean error rate for C identification
791 VCV pairs (read & spontan.; stressed & unstr. segments; one male)
C-identification by 22 Dutch subjects
Adopted from van Son & Pols (Eurospeech’97)
Other useful information:
pronunciation variation (ESCA workshop)
acoustic attributes of prominence (B. Streefkerk)
speech efficiency (post-doc project R. v. Son)
confidence measure
units in speech recognition
- rather than PLU, perhaps syllables (S. Greenberg)
quick adaptation
prosody-driven recognition / understanding
multiple features
Speech efficiency
speech is most efficient if it contains only the
information needed to understand it:
“Speech is the missing information” (Lindblom, JASA ‘96)
less information needed for more predictable things:
- shorter duration and more spectral reduction for highfrequent syllables and words
- C-confusion correlates with acoustic factors (duration,
CoG) and with information content (syll./word freq.)
I(x) = -log2(Prob(x)) in bits
(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))
Correlation between consonant
confusion and 4 measures indicated
-0.40
*
<- Correlation coefficient
-0.35
-0.30
-0.15
*
*
-0.25
-0.20
Duration
CoG
I(syllable)
I(word)
*
20 min. R/S
12 k syll.
8k words
*
+
+
+
+
791 VCV R/S
**
-0.10
*
308 lex. str. (+)
483 unstr. (–)
-0.05
0
Dutch male sp.
C ident. 22 Ss
Read — Read + Spont — Spont +
Adopted from van Son et al. (Proc. ICSLP’98)
All
+ p 0.01
p 0.001
Computational Phonetics
(first suggested by R. Moore, ICPhS’95 Stockholm)
duration modeling
optimal unit selection (like in concatenative synthesis)
pronunciation variation modeling (SpeCom Nov. ‘99)
vowel reduction models
computational prosody
information measures for confusion
speech efficiency models
modulation transfer function for speech
Discussion / Conclusions
speech technology needs further improvement
for certain tasks (flexibility, robustness)
phonetic knowledge can help if provided in an
implementable form; computational phonetics
is probably a good way to do that
phonetics and speech / language technology
should work together more closely, for their
mutual benefit
this Heraeus-seminar is a possible platform for
that discussion