An Acoustic Profile of Speech Efficiency

Download Report

Transcript An Acoustic Profile of Speech Efficiency

AN ACOUSTIC PROFILE
OF SPEECH EFFICIENCY
R.J.J.H. van Son, Barbertje M. Streefkerk, and
Louis C.W. Pols
Institute of Phonetic Sciences / ACLC
University of Amsterdam,
Herengracht 338, 1016 CG Amsterdam, The Netherlands
tel: +31 20 5252183; fax: +31 20 5252197
email: [email protected]
ICSLP2000, Beijing, China, Oct. 20, 2000
INTRODUCTION
• Speech is "efficient":
Important components are emphasized
Less important ones are de-emphasized
• Two mechanisms:
1) Prosody:
Lexical Stress and Sentence Accent (Prominence)
2) Predictability:
Frequency of Occurrence (tested) and
Context (not tested)
MECHANISMS FOR
EFFICIENT SPEECH
Speech emphasis should mirror importance
which largely corresponds to unpredictability
• Prosodic structure distributes emphasis
according to importance (lexical stress,
sentence accent / prominence)
• Speakers can (de-)emphasize according to
supposed (un)importance
• Speech production mechanisms can facilitate
redundant speech or hamper unpredictable
speech
QUESTIONS
• Can the distribution of emphasis or reduction
be completely explained from Prosody?
(Lexical stress
and Sentence Accent / Prominence)
• If not, can we identify a speech production
mechanism that would assist efficiency in
speech?
e.g. preprogrammed articulation of redundant
and / or high-frequent syllable-like segments?
SPEECH MATERIAL (DUTCH)
• Single Male Speaker: Vowels and Consonants
Matched Informal and Read speech, 791 matched VCV pairs
• Polyphone: Vowels only
273 speakers (out of 5000), telephone speech, 1244 read sentences
Segmented with a modified HMM recognizer (Xue Wang)
• Corpora sizes: Number of realizations of vowels and consonants
Corpus 
Accent 
Unstressed
–
+
Single
Speaker
consonants
vowels
550
812
Polyphone
Stressed
–
+
Total
180
461
569
528
283
224
1582
2025
vowels 4435 4942
9603
3516
22496
• Accent: Sentence accent / Prominence
• Stressed/Unstressed: Lexical stress
METHODS:
SPEECH PREPARATION
• Single speaker corpus
– All 2 x 791 VCV segments hand-labeled
– Also sentence accent determined by hand
– 22 Native listeners identified consonants from this corpus
• Polyphone corpus
– Automatically labeled using a pronunciation lexicon and a
modified HMM recognizer
– 10 Judges marked prominent words (prominence 1-10)
• Word and Syllable -log2(Frequencies) for both
corpora were determined from Dutch CELEX
METHODS: ANALYSIS
Single Speaker Corpus
Consonants and Vowels
• Duration in ms (vowels and consonants)
• Contrast (vowels only)
F1 / F2 distance to (300, 1450) Hz in semitones
• Spectral Center of Gravity (CoG) (V and C)
Weighted mean frequency in semitones at point of
maximum energy
• Log2(Perplexity) from consonant identification
Calculated from confusion matrices
METHODS: ANALYSIS
Polyphone Corpus
Vowels only
• Loudness
in sone
• Spectral Center of Gravity (CoG)
Weighted mean frequency in semitones averaged over
the segment
• Prominence (1-10)
The number of 'PROMINENT' listener judgements
0 – 5 is considered Unaccented
6 –10 is considered Accented
CONSISTENCY OF MEASUREMENTS
Correlation coefficients between factors
Correlation Coefficient -> R
0.7
Consonants G
(n=1582) E
H
0.6
S
Vowels A
(n=2025) 2
C
Polyphone G
I
H
0.5
B
0.4
0.3
0.2
0.1
0
Accent
(n=22496)
B
B
H
B
HB
JH
;
J
J
Z
F
P
J
S
;
2
E2;
A
S
2
A
—
+
Unstressed
—
+
Stressed
P;F
P
F;
Z
Duration x CoG
Duration x Px
CoG x Px
Duration x Contr.
Duration x CoG
Contrast x CoG
Loudness x CoG
Filled: p<=0.01
}
Single
Speaker
Polyphone
Filled symbols:
P<=0.01
Total
• Duration in ms
• Loudness in sones
• CoG: Spectral Center of Gravity (semitones)
• Px: log2(Perplexity) plotted is –R
• Contrast: F1/ F2 distance to (300, 1450) Hz (semitones)
CONSONANT REDUCTION VERSUS
FREQUENCY OF OCCURRENCE
(correlation coefficients)
Correlation Coefficient -> R
0.35
Syllable frequencies
0.30
B
B
0.25
J
0.20
G Duration
E CoG
A Perplexity
Filled: p<=0.01
J
B
A
A
F
A
G
E
A
A
0
Accent — + — + Total
Unstressed Stressed
Filled symbols:
P<=0.01
B
J
0.10
J
B
B
0.15
0.05
J
Single speaker
corpus
(n=1582)
Word frequencies
J
B
A
G
— + — + Total
Unstressed Stressed
• CoG: Spectral Center of Gravity (semitones)
• Perplexity: log2(Perplexity), plotted is –R.
• Syllable and word frequencies were correlated (R=0.230, p=0.01)
VOWEL REDUCTION VERSUS
FREQUENCY OF OCCURRENCE
(correlation coefficients)
Correlation Coefficient -> R
0.35
0.30
0.25
Syllablefrequencies
F
F
Single speaker
corpus
(n=2025)
G Duration
E CoG
A Contrast
Filled: p<=0.01 F
B
F
B
B
J
J
G
Filled symbols:
P<=0.01
B
J
0.10
0.05
F
F
0.20
0.15
Wordfrequencies
E
J
E
G
G
E
0
Accent — + — + Total
Unstressed Stressed
B
A
E
A
G
E
— + —
+ Total
Unstressed Stressed
• Duration in ms
• Contrast: F1/ F2 distance to (300, 1450) Hz (semitones)
• CoG: Spectral Center of Gravity (semitones)
• Syllable and word frequencies were correlated (R=0.280, p<=0.01)
DISCUSSION OF SINGLE
SPEAKER DATA
• There are consistent correlations between
frequency of occurrence and “acoustic
reduction” (duration, CoG and contrast), but
not for consonant identification (perplexity)
• Correlations for syllable frequencies tend to
be larger than those for word frequencies
(p0.01)
• Correlations were found after accounting for
Phoneme identity, Lexical Stress and
Sentence Accent
PROMINENCE VERSUS VOWEL REDUCTION
AND FREQUENCY OF OCCURRENCE
(correlation coefficients)
Correlation Coefficient -> R
0.5
0.4
Loudness
CoG
Syllable freq.
Word freq.
Filled: p<=0.01
F
F
Polyphone
corpus
(n=22496)
B
0.3
0.2
Word freq.
G
E
C
A
B
B
Loudness
Filled symbols:
P<=0.01
H
F
H
0.1
J
J
0
–
CoG
+
J
H
Total
–
Syllable
freq.
+
Total
Lexical stress
• Loudness (sone)
• CoG: Spectral Center of Gravity (semitones)
• Syllable and word frequencies (-log2(freq))
VOWEL REDUCTION VERSUS
FREQUENCY OF OCCURRENCE
(correlation coefficients)
B
Wordfrequencies
Correlation Coefficient -> R
0.10
Syllablefrequencies
G Loudness
E CoG
Filled: p<=0.01
B
0.08
Polyphone
corpus
(n=22496)
B
0.06
Filled symbols:
P<=0.01
B
0.04
E
0.02
E
G
B
E
E
E
0
Accent — + —
+ Total
Unstressed Stressed
E
E
E
J
— + — + Total
Unstressed Stressed
Accent:
+ Prom > 5
– Prom <= 5
• Loudness (sone)
• CoG: Spectral Center of Gravity (semitones)
• Syllable and word frequencies were correlated (R=0.316, p<=0.01)
DISCUSSION OF
POLYPHONE DATA
• Perceived prominence correlates with “acoustic
vowel reduction” (loudness, CoG) and
frequency of occurrence (syllable and word)
• There are small but consistent correlations
between “acoustic vowel reduction” and
frequency of occurrence
• Correlations were found after accounting for
Vowel identity, Lexical Stress and Prominence
CONCLUSIONS
• LEXICAL STRESS and
SENTENCE ACCENT / PROMINENCE cannot
explain all of the “efficiency” of speech:
FREQUENCY OF OCCURRENCE and possibly
CONTEXT in general are needed for a full account
• A SYLLABARY which speeds up (and reduces) the
articulation of “stored”, high-frequency, syllables
with respect to “computed”, rare, syllables might
explain at least part of our data
SPOKEN LANGUAGE CORPUS
How Efficient is Speech
• 8-10 speakers:
~60 minutes of speech each
(fixed and variable materials)
• Informal story telling and retold stories
~15 min
• Reading continuous texts
~15 min
• Reading Isolated (Pseudo-) sentences
~20 min
• Word lists
~ 5 min
• Syllable lists
~ 5 min
MEASURING
SPEECH EFFICIENCY
• Speaking Style differences
(Informal, Retold, Read, Sentences, Lists)
• Predictability
– Frequency of Occurrence (words and syllables)
– In Context (language models)
– Cloze-tests
– Shadowing (RT or delay)
• Acoustic Reduction
– Segment identification
– Duration
– Spectral reduction