Lombard Speech

Download Report

Transcript Lombard Speech

Speaking Style Conversion
Dr. Elizabeth Godoy
Speech Processing Guest Lecture
December 11, 2012
Apply VC principles to a different problem…
2
E.Godoy, Speaking Style Conversion
December 11, 2012
Speech Intelligibility Context
Speech is often heard in adverse conditions



Noisy environments
Listener has difficulty hearing/understanding
noise
no noise
Example of speech with environmental barriers:
 the speech is not very intelligible!
How to transform speech to make it more intelligible…?


3
To make speech synthesis systems more effective
E.Godoy, Speaking Style Conversion
December 11, 2012
Intelligible Speaking Styles
Lombard speech
I.


Speaker is immersed in noise
Human reflex to increase the speech loudness
normal
Clear speech
II.


Listener faces barrier (noise, hearing, language,…)
Speaker adapts strategy to increase speech clarity
casual
4
Lombard
clear
E.Godoy, Speaking Style Conversion
December 11, 2012
VC to improve speech intelligibility?
Voice Conversion



Modify speech to change the speaker identity
Learn transformation from source-to-target speaker
Speaking Style Conversion



Modify speech to improve intelligibility
Determine transformation from normal-to-intelligible style
Spectral Envelope: still very important!

5
E.Godoy, Speaking Style Conversion
December 11, 2012
Overview: Analyses-to-Modifications
Acoustic analyses to identify (mainly spectral)
characteristics of Lombard & Clear styles
I.
i.
ii.
II.
Result of analyses inspire spectral modifications to
improve intelligibility
i.
ii.
6
Average Spectra
Vowel Spaces
Spectral energy band boosting (corrective filters)
Formant shifting (frequency warping)
E.Godoy, Speaking Style Conversion
December 11, 2012
Corpora
Lombard-normal: Grid




8 speakers (4 male, 4 female)
50 sentences each
LombardNinf96: most extreme (Lu & Cooke)
Clear-casual: LUCID read sentences




7
8 speakers (4 male, 4 female)
50 sentences each
Read speech: most exaggerated (Baker & Hazan)
E.Godoy, Speaking Style Conversion
December 11, 2012
Average Relative Spectra
Recall Amplitude Scaling in DFWA

log( Aq ( f ))  log( S qy ( f ))  log( S qx (Wq1 ( f )))
Average Relative spectra is similar:



difference between normal (X) and intelligible (Y) style
Average across all frames
log( S R ( f ))  log( S qY ( f ))  log( S qX ( f ))
8
E.Godoy, Speaking Style Conversion
December 11, 2012
Average Relative Spectra (by Speaker)
GRID Average Relative Spectra for each speaker
LUCID Average Relative Spectra for each speaker
5
4
2
dB
dB
0
-5
0
-2
-10
-4
8
8
6
6000
4
4000
0
Hz
Lombard-normal
9
6000
4
4000
2000
2
speaker index
6
2000
2
speaker index
0
Hz
Clear-casual
E.Godoy, Speaking Style Conversion
December 11, 2012
Average Relative Spectra (Overall)
Average Relative Spectra: All frames, All speakers
6

Lombard speech:
Spectral energy boosting
“where formants are”
(~500-4500Hz)

Clear speech: Varies
depending on speaker
strategy, extent of
differences mild overall
Lombard-normal
Clear-casual
4
2
dB
0
-2
-4
-6
-8
0
1000
10
2000
3000
4000
Hz
5000
6000
7000
8000
E.Godoy, Speaking Style Conversion
December 11, 2012
Vowel Spaces (average for all speakers)
Lombard-normal: Vowel Space, ALL Speakers
Clear-casual: Vowel Space, ALL Speakers
2400
2600
normal
lombard
2200
casual
clear
2400
2200
2000
1800
F2 (Hz)
F2 (Hz)
2000
1600
1800
1600
1400
1400
1200
1200
1000
1000
350
400


11
450
500
550
F1 (Hz)
600
650
700
800
300
350
400
450
500
550
600
F1 (Hz)
650
700
Lombard speech: Vowel Space Translation
Clear speech: Vowel Space Expansion
E.Godoy, Speaking Style Conversion
December 11, 2012
750
800
Inspiration for Speech Modifications
Spectral energy band boosting (Lombard)
Vowel space expansion (Clear)
1.
2.

Features attributed with increased speech intelligibility


Though not observed together in human speech production…
Signal processing algorithms can accomplish both!
12
E.Godoy, Speaking Style Conversion
December 11, 2012
Spectral Energy Band Boosting

Corrective Filters
Spectral Energy Band Boosting, Varying Gain 0:0.5:3
Average Correction Filter for All Speakers
20
15
all frames
Enhanced (Lombard: high SII
10
15
10
5
dB
dB
5
0
0
-5
-5
-10
-15
-10
0
1000
2000
3000
4000
Hz
5000
6000
7000
-15
8000 0
Lombard-inspired & Enhanced (high SII)
13
1000
2000
3000
4000
Hz
5000
6000
7000
Corrective Filter: Varying Gain
E.Godoy, Speaking Style Conversion
December 11, 2012
8000
Frequency Warping for VS Expansion
Clear-casual: Vowel Space, ALL Speakers
LUCID: Frequency differences for F1, F2; ALL
2600
150
casual
clear
2400
100
2200
50
2000
1800
Hz
F2 (Hz)
0
-50
1600
-100
1400
1200
-150
1000
-200
800
300

350
400
450
500
550
600
F1 (Hz)
650
700
750
800
-250
F1diff
F2diff
0
500
1000
1500
2000
Casual F1 and F2 (Hz)
Curve fitting formant shifts inspires warping…
14
E.Godoy, Speaking Style Conversion
December 11, 2012
2500
3000
Sound Samples
With Noise (SSN, 0dB)
 Original
 Warp
 Boost
 BW
15
No Noise
 Original
 WarpE
 Boost
 BW
E.Godoy, Speaking Style Conversion
December 11, 2012
Want more ?

See Maria’s presentation for more details … 
16
E.Godoy, Speaking Style Conversion
December 11, 2012
Voice & Speaking Style Conversion Parallels

Voice Conversion


Dynamic Frequency Warping + Amplitude Scaling
(based on acoustic-phonetic spaces of source & target speakers)
Speaking Style Conversion

Frequency Warping + Corrective Filter
1.
2.
17
Clear-speech inspired frequency warping for vowel space expansion
Lombard-speech inspired corrective filters to increase loudness
E.Godoy, Speaking Style Conversion
December 11, 2012
Thank you!
More Questions?
Extras…
Objective Metrics for Evaluation
Loudness
I.

Energy in frequency bands weighted based on human hearing
Speech Intelligibility Index (SII)
II.

20
Energy & modulations in frequency bands relative to a
noise masker
E.Godoy, Speaking Style Conversion
December 11, 2012
Loudness Distributions
Loudness Histogram
Loudness Histogram
normal
lombard
0.03
casual
clear
0.045
0.04
0.025
0.035
0.03
0.02
0.025
0.015
0.02
0.015
0.01
0.01
0.005
0.005
0
0.5



21
1
1.5
2
2.5
3
Loudness value
3.5
4
4.5
5
0
0.5
1
1.5
2
2.5
Loudness value
3
Lombard speech: “louder” for voiced (bi-modal)
Clear speech: not “louder” than casual speech
Transients: neither style distinguishes on average
E.Godoy, Speaking Style Conversion
December 11, 2012
3.5
4
4.5
Extended SII Distributions
extended SII Histogram
extended SII Histogram
0.03
0.03
normal
lombard
casual
clear
0.025
0.025
0.02
0.02
0.015
0.015
0.01
0.01
0.005
0.005
0
0.1



0.2
0.4
0.5
SII
0.6
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
extSII highly correlated with ave loudness
Lombard speech objectively more intelligible
Clear speech intelligibility gain not captured by extSII

22
0.3
0.5
SII
0.6
0.7
limitations of objective intelligibility metrics
E.Godoy, Speaking Style Conversion
December 11, 2012
0.8
0.9
Observations from Analyses

Lombard Speech

Spectral boosting in inclusive formant region



Vowel space translation, but no expansion
Clear Speech


Small changes in average spectra (slight spectral “flattening”)
Consistent vowel space expansion


Increase in Loudness (also extSII)
Greater vowel discrimination
Comparison between styles

Acoustic differences



23
translate into perceptual distinctions
linked to intelligibility gains
Spectral boosting & Vowel space expansion: mutually exclusive
E.Godoy, Speaking Style Conversion
December 11, 2012