Determining Which Acoustic Features Contribute Most to Speech

Transcript Determining Which Acoustic Features Contribute Most to Speech

Determining Which Acoustic Features
Contribute Most to Speech Intelligibility
John-Paul Hosom
Alexander Kain
Akiko Kusumoto
[email protected]
Center for Spoken Language Understanding (CSLU)
OGI School of Science & Engineering
Oregon Health & Science University (OHSU)
0 / 50
image from http://www.ph.tn.tudelft.nl/~vanwijk/Athens2004/niceones/images.html
1 / 50
Outline
1. Introduction
2. Background: Speaking Styles
3. Background: Acoustic Features
4. Background: Prior Work on Clear Speech
5. Objectives of Current Study
6. Methods
7. Results
8. Conclusion
2 / 50
1. Introduction
Motivation #1:
• Difficult to understand speech in noise,
especially with a hearing impairment.
• When people speak clearly, speech becomes
more intelligible.
• Automatic enhancement of speech could be used
in next-generation hearing aids.
• Attempts to modify speech by computer to improve
intelligibility not yet very successful.
• Need to understand which parts of signal should be
modified and how to modify them.
3 / 50
1. Introduction
Motivation #2:
• Even best current model for computer speech
recognition does not provide sufficiently accurate
results.
• Current research applies new mathematical
techniques to this model, but techniques are
generally not motivated by studies of human
speech perception.
• A better understanding of how acoustic features
contribute to speech intelligibility could guide
research on improving computer speech recognition.
4 / 50
1. Introduction
Research Objective:
To identify the relative contribution of acoustic
features to intelligibility by examining
conversational and clear speech.
Long-Term Goals:
• Accurately predict speech intelligibility from
acoustic features,
• Integrate most effective features into computer
speech-recognition models,
• Develop novel signal-processing algorithms for
hearing aids.
5 / 50
2. Speaking Styles: Production
“Conversational speech” and “clear speech” easily
produced with simple instructions to speakers.
• Conversational (CNV) speech:
“read text conversationally as in daily
communication.”
• Clear (CLR) speech:
“read text clearly as if talking to a hearingimpaired listener.”
6 / 50
2. Speaking Styles: Perception
• To compare CNV and CLR speech intelligibility,
same sentences read in both styles, then listened to
by group of subjects.
• Intelligibility measured as the percentage of
sentences that are correctly recognized by listener.
• CLR speech increases intelligibility for a variety of:
 Listeners,
(young listeners, elderly listeners)
 Speech materials,
(meaningful sentences, nonsense syllables)
 Noise conditions.
(white noise, multi-talker babble noise)
7 / 50
Outline
1. Introduction
2. Background: Speaking Styles
3. Background: Acoustic Features
4. Background: Prior Work on Clear Speech
5. Objectives of Current Study
6. Methods
7. Results
8. Conclusion
8 / 50
3. Acoustic Features: Representations
• Acoustic Features
 Duration (length of each distinct sound)
 Energy
 Pitch
 Spectrum (spectrogram)
 Formants
 Residual (power spectrum without formants)
9 / 50
3. Acoustic Features: Waveform
• Time-Domain Waveform
“Sound originates from the motion or vibration
of an object. This motion is impressed upon
the [air] as a pattern of changes in pressure.”
amplitude
[Moore, p. 2]
time (msec)
waveform for word “two”
10 / 50
3. Acoustic Features: Energy
• Energy
Energy is proportional to the square of the
pressure variation. Log scale is used to reflect
human perception.
 t N 1 2

10 log10   xn N 
 n t

xn = waveform sample x at time point n
N = number of time samples
waveform
energy
11 / 50
3. Acoustic Features: Pitch
• Pitch
Pitch (F0) is rate of vibration of vocal folds.
vocal tract
tongue
vocal folds
(larynx)
amplitude
nasal tract
1
fundamenta
l
F0 

frequency periodicity
time
Airflow through vocal folds
Speech Production Apparatus
(from Olive, p. 23)
12 / 50
3. Acoustic Features: Pitch
• Pitch
Pitch (F0) is rate of vibration of vocal folds.
amplitude
time (msec)
Pitch=117 Hz
Pitch=83 Hz
13 / 50
3. Acoustic Features: Spectrum
• Phoneme:
Abstract representation of basic unit of speech
(“cat”: /k q t/).
• Spectrum:
What makes one phoneme, /e/, sound different from
another phoneme, /i/?
Different shapes of the vocal tract: /E/ is produced
with the tongue low and in the back of the mouth;
/i/ with tongue high and toward the front.
14 / 50
3. Acoustic Features: Spectrum
• Source of speech is pulses of air from vocal folds.
• This source is filtered by vocal tract “tube”.
• Speech waveform is result of filtered source signal.
• Different shapes of tube create different filters,
different resonant frequencies, different phonemes.
/e/
(from Ladefoged, p. 58-59)
/i/
15 / 50
3. Acoustic Features: Spectrum
Resonant frequencies identified by frequency
analysis of speech signal. Fourier Transform
expresses a signal in terms of signal strength at
different frequencies:
X( f )  



t  
t  
e
x(t )e
 j 2ft
dt 
x(t )cos(2ft )  j sin(2ft )dt
j
 cos( )  j sin( )

S ( f )  10 log10 ( X ( f ) )
2
16 / 50
3. Acoustic Features: Spectrum
The time-domain waveform and power spectrum
can be plotted like this (/e/):
timedomain
amplitude
90
spectral
power
(dB)
10
0 Hz
frequency (Hz)
4000 Hz
17 / 50
3. Acoustic Features: Spectrum
The time-domain waveform and power spectrum
can be plotted like this (/e/):
timedomain
amplitude
90
spectral
power
(dB)
10
F0=95 Hz
0 Hz
frequency (Hz)
4000 Hz
18 / 50
3. Acoustic Features: Spectrum
The resonant frequencies, or formants, are clearly
different for vowels /e/ and /i/.
Spectral envelope is important for phoneme identity
(envelope = general spectral shape, no harmonics).
envelope
0
1K
2K
/e /
3K
4K
0
1K
2K
/i /
3K
4K
19 / 50
3. Acoustic Features: Formants
Formants (dependent on vocal-tract shape) are
independent of pitch (rate of vocal-fold vibration).
/e /
F0=80 Hz
/e /
F0=160 Hz
0
1K
2K
3K
4kHz
20 / 50
3. Acoustic Features: Formants
• Formants specified by frequency, and numbered in
order of increasing frequency. For /e/, F1=710,
F2=1100.
• F1, F2, and sometimes F3 often sufficient for
identifying vowels.
• For vowels, sound source is air pushed through
vibrating vocal folds. Source waveform is filtered
by vocal-tract shape. Formants correspond to
these filters.
• Digital model of a formant can be implemented
using an infinite-impulse response (IIR) filter.
21 / 50
3. Acoustic Features: Formants
Formant frequencies (averages for English):
3500
3000
2500
2000
2890
2560 2490 2490
2540
2400
2380
2250
2250
1920
1770 1660
1500
1000
500
280
550
400
1200 1100
1030
870
690 600 710
450 310
0
iy
ih
eh
(from Ladefoged, p. 193)
ae
ah
aa
uh
uw
22 / 50
3. Acoustic Features: Coarticulation
frequency
time
u
e
r
frequency
j
“you are”: /j u e r/
23 / 50
3. Acoustic Features: Coarticulation
frequency
time
u
e
r
frequency
j
“you are”: /j u e r/
24 / 50
3. Acoustic Features: Coarticulation
frequency
time
25 / 50
3. Acoustic Features: Vowel Neutralization
When speech is uttered quickly, or is not clearly
enunciated, formants shift toward neutral vowel:
(from van Bergem 1993 p. 8)
26 / 50
Outline
1. Introduction
2. Background: Speaking Styles
3. Background: Acoustic Features
4. Background: Prior Work on Clear Speech
5. Objectives of Current Study
6. Methods
7. Results
8. Conclusion
27 / 50
4. Prior Work: Acoustics of Clear Speech
1. Pitch (F0): more variation, higher average.
2. Energy: Consonant-vowel (CV) energy ratio
increases for stops (/p/, /t/, /k/, /b/, /d/, /g/).
3. Pauses: Longer in duration and more frequent.
4. Phoneme and sentence duration: longer.
• However, correlation between a characteristic of
an acoustic feature and intelligibility does not mean
the characteristic causes increased intelligibility.
• For example, fast speech can be just as intelligible
as slow speech; longer sentence duration not a
cause of increased intelligibility.
28 / 50
4. Prior Work: Speech Modification
• Lengthen phoneme durations [e.g. Uchanski 1996]
• Insert pauses at phrase boundaries or word
boundaries [e.g. Gordon-Salant 1997; Liu 2006].
• Amplify consonant energy in consonant-vowel
(CV) contexts [Gordon-Salant, 1986; Hazan, 1998].
Positive results at sentence level reported in
only one case, using extreme modification.
(Hazan 1998, 4.2% improvement)
29 / 50
5. Objectives: Background
Summary of Current State:
• CLR speech intelligibility higher than CNV speech.
• Speech has acoustic features that interact in
complex ways.
• Correlation between acoustic features and
intelligibility has been shown, but causation not
demonstrated.
• Signal modification of CNV speech shows little or
no intelligibility improvement.
• Reason for inability to dramatically improve CNV
speech intelligibility not known.
30 / 50
5. Objectives of Current Study
Objectives of Current Study:
1. To validate that CLR speech is more intelligible
than CNV speech for our speech material,
2. To process CNV speech so that intelligibility is
significantly closer to CLR speech,
We propose a hybridization algorithm that
creates “hybrid” (HYB) speech using features
from both CNV and CLR speech
3. To determine acoustic features of CLR speech
that cause increased intelligibility.
31 / 50
Outline
1. Introduction
2. Background: Speaking Styles
3. Background: Acoustic Features
4. Background: Prior Work on Clear Speech
5. Objectives of Current Study
6. Methods
7. Results
8. Conclusion
32 / 50
6. Methods: Hybridization Algorithm
Hybridization:
• Input: parallel recordings of a sentence spoken in
both CNV and CLR styles.
• Signal processing replaces certain acoustic
features from CNV speech with those of CLR
speech.
• Output: synthetic speech signal.
•
Uses Pitch-Synchronous Overlap Add (PSOLA)
for pitch and/or duration modification. [Moulines and
Charpentier, 1990].
33 / 50
6. Methods: Hybridization with PSOLA
Original CNV speech
Pitch Modification
alter distance between
glottal pulses
scale 2.0  lengthen duration
scale 2.0  raise pitch
Modified Signal Original Signal
Duration Modification
duplicate or eliminate
glottal pulses
a
a
b
c
33ms
a
b
a
d
b
66ms
c
c
d
d
a
b
a
b
c
b
25ms
c
c
34 / 50
6. Methods: Hybridization Algorithm
CNV Speech
CLR Speech
Phoneme
Labelling
Pitch marking
Voicing
Phoneme
Labelling
Voicing
Pitch marking
Placement of
Auxiliary Marks
Placement of
Auxiliary Marks
Phoneme Alignment between
CLR and CNV Speech
Stage 1:
Database
Preparation
Parallelization between
CLR and CNV Speech (features P, N)
F0 (F)
F0 (F)
Long-term
Energy (E)
Long-term
Energy (E)
Phoneme
Duration (D)
Hybrid
Configuration
Phoneme
Duration (D)
Spectrum (S)
Spectrum (S)
Pitch Synchronous Overlap Add (PSOLA)
Output: HYB Speech
Stimuli: CLR-D
Stages 2 and 3:
Feature Analysis
and Selection
Stage 4:
Waveform
Synthesis
35 / 50
6. Methods: Hybridization Algorithm
CNV Speech
CLR Speech
Phoneme
Labelling
Pitch marking
Voicing
Phoneme
Labelling
Voicing
Placement of
Auxiliary Marks
Pitch marking
Placement of
Auxiliary Marks
Phoneme Alignment between
CLR and CNV Speech
Stage 1:
Database
Preparation
Parallelization between
CLR and CNV Speech (features P, N)
For each sentence (CLR and CNV recordings):
• Manually label phoneme identity and locations.
• Match phonemes in CLR and CNV recordings.
• Identify location of each glottal pulse.
36 / 50
6. Methods: Hybridization Algorithm
• Extract acoustic features:
Spectrum (S), F0 (F), Energy (E), Duration (D)
• For each feature, select from CNV or CLR for
generating speech waveform.
CLR
CNV
F0 (F)
F0 (F)
Long-term
Energy (E)
Long-term
Energy (E)
Phoneme
Duration (D)
Hybrid
Configuration
Phoneme
Duration (D)
Spectrum (S)
Spectrum (S)
Pitch Synchronous Overlap Add (PSOLA)
Output: HYB Speech
Stimuli: CLR-D
Stages 2 and 3:
Feature Analysis
and Selection
Stage 4:
Waveform
Synthesis
37 / 50
6. Methods: Hybridization Algorithm
• Use PSOLA to generate waveform using selected
features with spectrum at each glottal pulse.
• Output is HYB speech, named according to
features taken from CLR speech, e.g. CLR-D.
CLR
CNV
F0 (F)
F0 (F)
Long-term
Energy (E)
Long-term
Energy (E)
Phoneme
Duration (D)
Hybrid
Configuration
Phoneme
Duration (D)
Spectrum (S)
Spectrum (S)
Pitch Synchronous Overlap Add (PSOLA)
Stimulus: CLR-D
Stages 2 and 3:
Feature Analysis
and Selection
Stage 4:
Waveform
Synthesis
38 / 50
6. Methods: Speech Corpus
• Public database of sentences, syntactically and
semantically valid.

Ex: His shirt was clean but one button was gone.

5 keywords (underlined) for measuring
intelligibility.

Long enough to test effects of prosodic
features (combination of duration, energy,
pitch).

Short enough to minimize memory effects.
• One male speaker read text material with both
CNV and CLR speaking styles.
39 / 50
6. Methods: Perceptual Test
For each listener:
1. Audiometric test (to ensure normal hearing),
2. Find optimal noise level for this listener,
3. Measure intelligibility of CLR, CNV, and HYB
speech.
For finding optimal noise levels and measuring
intelligibility, the listener’s task is to repeat the
sentence aloud.
40 / 50
6. Methods: Finding Optimal Noise Level
• Total energy of each sentence normalized (65 dBA).
• To avoid “ceiling effect,” sentences played with
background noise (12-speaker babble noise).
• To normalize performance differences between
listeners, noise set to a specific level for each
listener.
• Noise level set so that each listener correctly
identifies CNV sentences 50% of the time.
decreasing noise level
41 / 50
6. Methods: Measuring Intelligibility
• 48 sentences per subject
• Correct response for sentence when at least 4 of 5
keywords correctly repeated by listener.
Intelligibility (%) =
# of sentences correctly identified
x100
# of sentences presented
42 / 50
6. Methods: Listeners
• Subjects:
 12 listeners with normal hearing
 age 19 – 40 (mean 29.17)
 Average noise level -0.24 dB SNR
• Significance Testing
 Paired t-test with p < 0.05
43 / 50
6. Methods: Features
• Energy and pitch always taken from CNV speech.
• Test importance of other two acoustic features:
 spectrum (for phoneme identity)
 duration (for syntactic parsing)
• Test co-dependence of spectrum and duration.
Speech Waveform
Spectrum
Residual Formants
Prosody
Duration Energy Pitch
44 / 50
6. Methods: Stimuli
• Conditions:
1. CNV Original
2. HYB Speech, CLR-Dur
3. HYB Speech, CLR-Spec
4. HYB Speech, CLR-DurSpec
5. CLR Original
Speech Waveform
Spectrum
Residual Formants
Prosody
Duration Energy Pitch
45 / 50
Outline
1. Introduction
2. Background: Speaking Styles
3. Background: Acoustic Features
4. Background: Prior Work on Clear Speech
5. Objectives of Current Study
6. Methods
7. Results
8. Conclusion
46 / 50
7. Results
Mean Intelligibility (%)
• 10% difference between CNV and CLR-Dur
• 11% difference between CNV and CLR-Spec
• 18% difference between CNV and CLR-DurSpec
• 25% difference between CNV and CLR
*
*
100
*
80
74
60
40
64
75
82
89
* = significant
difference,
compared
with CNV
20
47 / 50
8. Conclusion
Results of Objectives:
1. To validate that CLR speech is more intelligible than
CNV speech,
Confirmed: 25% absolute difference (significant).
2. To process CNV speech so that intelligibility is
significantly closer to CLR speech,
Confirmed: 18% absolute improvement (significant).
3. To determine acoustic features of CLR speech that
cause increased intelligibility.
Spectrum and combination of Spectrum and
Duration are effective. Duration alone almost
significant.
48 / 50
8. Conclusion
Conclusions:
1. The single acoustic feature that yields greatest
intelligibility improvement is the spectrum, but it
contributes less than half of possible improvement.
2. Duration alone yields improvements almost as
good as spectrum alone. (Prior work indicates,
however, that total sentence duration and pause
patterns are not important for intelligibility.)
3. The combination of duration and spectrum does
not quite yield the intelligibility of CLR speech;
further work to determine if difference due to (a)
pitch, (b) energy, (c) signal-processing artifacts.49 / 50
8. Conclusion
Long-Term Goals:
• Identify more specific features that contribute to
speech intelligibility and their degree of contribution,
Speech Waveform
Spectrum
Residual Formants
Prosody
Duration Energy Pitch
• Evaluate different speakers and listener groups,
• Accurately predict speech intelligibility from
acoustics,
• Integrate most effective features into signalprocessing and speech-recognition algorithms.
50 / 50
Thank you!
CSLU will have job opening(s) in Summer/Fall 2007
for phonetic transcription, syntactic labeling. If
interested, please e-mail to [email protected]
or [email protected]
(also, special thanks to my dog, Nayan…)
51 / 50