Flexible, Robust, and Efficient Human Speech Processing

Download Report

Transcript Flexible, Robust, and Efficient Human Speech Processing

Flexible, Robust, and Efficient
Human Speech Processing
Versus Present-day Speech
Technology
Louis C.W. Pols
Institute of Phonetic Sciences / IFOTT
University of Amsterdam
The Netherlands
IFA
Herengracht 338
Amsterdam
welcome
My pre-predecessor: Louise Kaiser
Secretary of First International Congress of Phonetic Sciences
Amsterdam, 3-8 July 1932
Amsterdam ICPhS’32
Jac. van Ginneken, president
L. Kaiser, secretary
A. Roozendaal, Treasurer
Subjects:
- physiology of speech and voice
(experimental phonetics in its strict meaning)
- study of the development of speech and voice in the individual; their
evolution in the history of mankind; the influence of heridity
- anthropology of speech and voice
- phonology
- linguistic psychology
136 participants
- pathology of speech and voice
from 16 countries
- comparative physiology of the sounds of animals 43 plenary papers
- musicology
24 demonstrations
Amsterdam ICPhS’32
Some of the participants:
prof. Daniel Jones, London: The theory of phonemes, and its
importance in Practical Linguistics
Sir Richard Paget, London: The Evolution of Speech in Men
prof. R.H. Stetson, Oberlin: Breathing Movements in Speech
prof. Prince N. Trubetzkoy, Wien: Charakter und Methode der
systematischen phonologischen Darstellung einer gegebenen Sprache
dr. E. Zwirner, Berlin-Buch:
- Phonetische Untersuchungen an Aphasischen und Amusischen
- Quantität, Lautdauerschätzung und Lautkurvenmessung (Theorie
und Material)
-----------------------------------------------------------------
2nd, London ‘35; 3rd, Ghent’38; 4th, Helsinki ‘61; 5th, Münster ‘64;
Overview
Phonetics and speech technology
 Do recognizers need ‘intelligent ears’?
 What is knowledge?
 How good is human/machine speech recogn.?
 How good is synthetic speech?
 Pre-processor characteristics
 Useful (phonetic) knowledge
 Computational phonetics
 Discussion/conclusions

Phonetics  Speech Technology
AFFINITY to: phonetics
from:
phonetics
speech
technology
source / filter
individuality
context
prosody
more data
new models
probabilities
speech vs. NLP
speech
technology
human performance
specific knowledge
regularities
multiple features
EU FPV, DARPA
applications
user orientation
evaluation
Do recognizers need
intelligent ears?
intelligent ears  front-end pre-processor
 only if it improves performance
 humans are generally better speech
processors than machines, perhaps system
developers can learn from human behavior
 robustness at stake (noise, reverberation,
incompleteness, restoration, competing
speakers, variable speaking rate, context,
dialects, non-nativeness, style, emotion)

What is knowledge?
phonetic knowledge
 probabilistic knowledge from databases
 fixed set of features vs. adaptable set
 trading relations, selectivity
 knowledge of the world, expectation
 global vs. detailed
 see video

(with permission from Interbrew Nederland NV)
Video is a metaphor for:
from global to detail (world  Europe  Holland
 North Sea coast  Scheveningen  beach
 young lady  drinking Dommelsch beer)
 sound  speech  speaker  English  utterance
 ‘recognize speech’ or ‘wreck a nice beach’
 zoom in on whatever information is available
 make intelligent interpretation, given context
 beware for distracters!

Human auditory sensitivity
stationary vs. dynamic signals
 simple vs. spectrally complex
 detection threshold
 just noticeable differences
 see Table 3 in paper

Detection thresholds and jnd
simple, stationary signals
phenomenon
threshold/
jnd
threshold
of hearing
0 dB at 1000 Hz
threshold
of duration
frequency
discrimination
intensity
discrimination
temporal
discrimination
masking
constant energy
at 10 – 300 ms
1.5 Hz
at 1000 Hz
0.5 – 1 dB
frequency
pitch of
complex tones
gap detection
remarks
frequency dependent
multi-harmonic,
single-formant-like
periodic signals
phenomenon threshold/
jnd
formant
frequency
3-5%
formant
amplitude
overall
intensity
formant
bandwidth
F0 (pitch)
 3 dB
F2
Energy =
Power x Duration
more when < 200
ms
up to 80 dB SL
1.5 Hz
 5 ms at 50 ms
duration dependent
psychophysical
tuning curve
low pitch
many peculiarities
 3 ms for
wide-band noise
more at low freq. for
narrow-band noise
BW
remarks
one formant only
< 3 % with more
experienced subjects
F2 in synthetic vowel
3 - 5%
 1.5 dB
20 - 40 %
0.3 - 0.5 %
synthetic vowel,
mainly F1
one-formant vowel
20 - 40%
synthetic vowel
Difference limens in en dpoint frequency (Hz)
DL for short speech-like transitions
240
complex
180
120
Complex
Complex
Single
Single
Single-isolated
Tone glide
Tone glide
60
simple
0
20
short
30
40
Transition duration (ms)
50
longer trans.
Adopted from van Wieringen & Pols (Acta Acustica ’98)
How good is
human / machine speech recognition?
corpus
TI digits
alphabet
Resource
Management
NAB
description
read digits
read letters
read
sentences
read
sentences
Switchboard spontaneous
CSR
telephone
conversations
Switchboard idem
wordspotting
vocabulary
size
10
26
1,000
recognition % word error
perplexity
machine human
10
0.72
0.009
26
5
1.6
60-1,000
17
2
5,000unlimited
2,000unlimited
45-160
6.6
0.4
80-150
43
4
20
keywords
-
31.1
7.4
Adapted from Lippmann (SpeCom, 1997)
How good is
human / machine speech recognition?
machine SR surprisingly good for certain tasks
 machine SR could be better for many others

- robustness, outliers

what are the limits of human performance?
- in noise
- for degraded speech
- missing information (trading)
Human word intelligibility vs. noise
humans
start to
have some
trouble
Adopted from Steeneken (1992)
recognizers
have
trouble!
Robustness to degraded speech
speech = time-modulated signal in frequency bands
 relatively insensitive to (spectral) distortions

- prerequisite for digital hearing aid
- modulating spectral slope: -5 to +5 dB/oct, 0.25-2 Hz

temporal smearing of envelope modulation
- ca. 4 Hz max. in modulation spectrum  syllable
- LP>4 Hz and HP<8 Hz little effect on intelligibility

spectral envelope smearing
- for BW>1/3 oct masked SRT starts to degrade
(for references, see paper in Proc. ICPhS’99)
Robustness to degraded speech
and missing information

partly reversed speech (Saberi & Perrott, Nature, 4/99)
- fixed duration segments time reversed or shifted in time
- perfect sentence intelligibility up to 50 ms
(demo: every 50 ms reversed
original
)
- low frequency modulation envelope (3-8 Hz) vs.
acoustic spectrum
- syllable as information unit? (S. Greenberg)
gap and click restoration (Warren)
 gating experiments

How good is synthetic speech?
good enough for certain applications
 could be better in most others
 evaluation:
application-specific
or multi-tier required
 interesting experience: Synthesis workshop
at Jenolan Caves, Australia, Nov. 1998

Workshop evaluation procedure
participants as native listeners
 DARPA-type procedures in data preparations
 balanced listening design
 no detailed results made public
 3 text types

- newspaper sentences
- semantically unpredictable sentences
- telephone directory entries

42 systems in 8 languages tested
Screen for newspaper sentences
Some global results

it worked!, but many practical problems
(for demo see http://www.fon.hum.uva.nl)
this seems the way to proceed and to expand
 global rating (poor to excellent)

- text analysis, prosody & signal processing
and/or more detailed scores
 transcriptions subjectively judged

- major/minor/no problems per entry

web site access of several systems
(http://www.ldc.upenn.edu/ltts/)
Phonetic knowledge to improve
speech synthesis
(suppose concatenative synthesis)
 control emotion, style, voice characteristics
 perceptual implications of
- parameterization (LPC, PSOLA)
- discontinuities (spectral, temporal, prosody)
improve naturalness (prosody!)
 active adaptation to other conditions

- hyper/hypo, noise, comm. channel, listener impairment

systematic evaluation
Desired pre-processor characteristics
in Automatic Speech Recognition
basic sensitivity for stationary and dynamic sounds
 robustness to degraded speech

- rather insensitive to spectral and temporal smearing
robustness to noise and reverberation
 filter characteristics

- is BP, PLP, MFCC, RASTA, TRAPS good enough?
- lateral inhibition (spectral sharpening); dynamics

what can be neglected?
- non-linearities, limited dynamic range, active elements,
co-modulation, secondary pitch, etc.
Caricature of present-day speech
recognizer

trained with a variety of speech input
- much global information, no interrelations
monaural, uni-modal input
 pitch extractor generally not operational
 performs well on average behavior

- does poorly on any type of outlier (OOV, non-native, fast
or whispered speech, other communication channel)
neglects lots of useful (phonetic) information
 heavily relies on language model

Useful (phonetic) knowledge
neglected so far
pitch information
 (systematic) durational variability
 spectral reduction/coarticulation (other than multiphone)
 intelligent selection from multiple features
 quick adaptation to speaker, style & channel
 communicative expectations
 multi-modality
 binaural hearing

Useful information:
durational variability
4626
Root /iy/
factor
level
95
m ean
39
0
R
0
S
0
Lw
796
2
711
s.d.
2
1588
1494
83
95
109
31
36
46
0
37
816
1
735
2
0
37
719
1
729
2
46
78
89
91
87
104
98
98
119
104
25
36
25
29
40
34
33
54
42
91
1
2
529
3
117
0
79
52
1
70
2
180
3
433
0
14
1
22
2
1
80
91
75
80
94
136
101
101
83
107
99
26
30
22
25
27
50
25
42
24
36
0
0
Lu
1
1
1544
count
52
0
50
1
12
2
8
0
134
2
46
0
374
1
37
2
22
94
126
186
121
98
111
96
156
90
27
46
52
23
25
24
37
58
27
Adopted from Wang (1998)
Useful information:
durational variability
4626
Root /iy/
factor
level
m ean
39
s.d.
normal rate=95
0
1
1544
R
0
S
796
1
2
711
2
1588
1494
83
95
109
31
36
46
primary stress=104
0
37
816
1
735
count
overall average=95 ms
95
2
0
37
719
1
729
2
46
78
89
91
87
104
98
98
119
104
25
36
25
29
40
34
33
54
42
word final=136
0
Lw
91
1
2
529
3
117
0
79
52
1
70
2
180
3
433
0
14
1
22
2
1
80
91
75
80
94
136
101
101
83
107
99
26
30
22
25
27
50
25
42
24
36
0
utterance final=186
0
Lu
52
0
50
1
12
2
8
0
134
2
46
0
374
1
37
2
22
94
126
186
121
98
111
96
156
90
27
46
52
23
25
24
37
58
27
Adopted from Wang (1998)
Useful information:
V and C reduction, coarticulation
spectral variability is not random but, at least
partly, speaker-, style-, and context-specific
 read - spontaneous; stressed - unstressed
 not just for vowels, but also for consonants

-
duration
spectral balance
intervocalic sound energy difference
F2 slope difference
locus equation
C-duration
C error rate
35
65
Read
Read
Spontaneous
60
55
50
Error rate -> %
Duration -> ms
30
Spontaneous
25
20
15
10
5
0
45
Stressed Unstressed
p
0.001
0.006
Stressed Unstressed
Total
0.001
Mean consonant duration
p
0.001
0.001
Total
0.001
Mean error rate for C identification
791 VCV pairs (read & spontan.; stressed & unstr. segments; one male)
C-identification by 22 Dutch subjects
Adopted from van Son & Pols (Eurospeech’97)
Other useful information:
pronunciation variation (ESCA workshop)
 acoustic attributes of prominence (B. Streefkerk)
 speech efficiency (post-doc project R. v. Son)
 confidence measure
 units in speech recognition

- rather than PLU, perhaps syllables (S. Greenberg)
quick adaptation
 prosody-driven recognition / understanding
 multiple features

Speech efficiency

speech is most efficient if it contains only the
information needed to understand it:
“Speech is the missing information” (Lindblom, JASA ‘96)

less information needed for more predictable things:
- shorter duration and more spectral reduction for highfrequent syllables and words
- C-confusion correlates with acoustic factors (duration,
CoG) and with information content (syll./word freq.)
I(x) = -log2(Prob(x)) in bits
(see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))
Correlation between consonant
confusion and 4 measures indicated
-0.40
*
<- Correlation coefficient
-0.35
-0.30
-0.15
*
*
-0.25
-0.20
Duration
CoG
I(syllable)
I(word)
*
20 min. R/S
12 k syll.
8k words
*
+
+
+
+
791 VCV R/S
**
-0.10
*
- 308 lex. str.
- 483 unstr.
-0.05
0
Dutch male sp.
C ident. 22 Ss
Read — Read + Spont — Spont +
Adopted from van Son et al. (Proc. ICSLP’98)
All
Computational Phonetics
(R. Moore, ICPhS’95 Stockholm)
duration modeling
 optimal unit selection (like in concatenative synthesis)
 pronunciation variation modeling
 vowel reduction models
 computational prosody
 information measures for confusion
 speech efficiency models
 modulation transfer function for speech

Discussion / Conclusions
speech technology needs further improvement
for certain tasks (flexibility, robustness)
 phonetic knowledge can help if provided in an
implementable form; computational phonetics
is probably a good way to do that
 phonetics and speech/language technology
should work together more closely, for their
mutual benefit
 this conference is the ideal platform for that
