Acoustic and Linguistic Characterization of Spontaneous Speech

Download Report

Transcript Acoustic and Linguistic Characterization of Spontaneous Speech

Acoustic and Linguistic
Characterization of Spontaneous
Speech
Masanobu Nakamura, Koji Iwano, and
Sadaoki Furui
Department of Computer Science
Tokyo Institute of Technology
Tokyo, Japan
Introduction(1/2)
Background

Present speech recognition technology
High recognition accuracy for read speech
 Rather poor accuracy for spontaneous speech


Improvement of recognition accuracy for spontaneous speech is
necessary.
What are
differences?


What are the differences between spontaneous and read speech?
Why is the recognition accuracy for spontaneous speech low?
2
Introduction(2/2)
Goals


Statistical and quantitative analysis of acoustic and
linguistic differences between spontaneous and read
speech.
Investigation of acoustic and linguistic characteristics
which affect speech recognition performance in
spontaneous speech.
3
Corpus of Spontaneous Japanese (CSJ)

A large-scale spontaneous speech corpus


Roughly 7M words (morphemes) with a total speech length of 650
hours
Orthographic and phonetic transcription are manually given.
Speaking styles
Academic presentations(AP)



Live recordings of academic presentations
The fields of Engineering, social science, and humanities
Extemporaneous presentations(EP)



Studio recordings of paid layman speakers’ speech
Small audience and relatively relaxed atmosphere
More informal than AP
Dialogue speech(D)

Interview, task oriented dialogue, and free dialogue
Read speech(R)

Reading transcription of AP or EP by the same speaker
4
Disfluency ratio

Filled pauses (F), word fragments (W), and reduced
articulation or mispronunciation (M)

Approximately one-tenth of the words are disfluencies
in the spontaneous speech in the CSJ.
The ratio of F is significantly higher than that of W and
M.

5
Acoustic characteristics
Acoustic feature extraction
1.
2.
3.

39-dimensional feature vectors
 12-dimensional MFCC, log-energy, and their first and second
derivatives
 25 ms-length window shifted every 10 ms
 CMS is applied to each utterance.
HMMs
 Mono-phone HMMs with a single Gaussian mixture
 Left-to-right topology with three self-loops
 Trained using samples of every combination of phonemes, speakers,
and utterance styles
Acoustic features for each phoneme
 Mean and variance vectors of 12-dimensional MFCC at the 2nd state
of the HMM
Target phonemes

31 Japanese phonemes
(10 vowels and 21 consonants)
Vowel
/a, i, u, e, o, a:, i:, u:, e:, o:/
Consonant
/w, y, r, p, t, k, b, d, g, j, ts, ch,
z, s, sh, h, f, N, N:, m, n/
7
Reduction ratio


Quantitative analysis of the spectral space reduction for
spontaneous speech
Definition




m p(X) is the mean vector of a phoneme p uttered with a speaking
style X.
m p(R) is the mean vector of read speech.
Av: average over all phonemes
|| ||: Euclidean norm/distance
Speaking style X
Phoneme p
Center of the
distribution of all
phonemes
Read speech 8
Reduction ratio averaged over 10
speakers
redp(X) = 1

MFCC space is reduced for almost all the phonemes, and
this is most significant for dialogue utterances.
9
Reduction ratio averaged over vowels
and consonants

Reduction of the distribution of spontaneous speech in
comparison with read speech is observed for all the
speaking styles, and this is most significant for dialogue
speech.
10
Between-phoneme distances

The reduction of the MFCC distance between
each phoneme pair is measured by using
Mahalanobis distance.
MFCC space
r
a
n
k
m
u
r
Phoneme
cepstrum
Mahalanobis
distance between
each phoneme pair
a
n
k
m
u
11
Mahalanobis distance

Mahalanobis distance Dij (X) between phoneme i
and j :


K: dimension of MFCC vector (K = 12)
m ik and s ik2: k th elements of the mean and variance
vector of MFCC for phoneme i uttered with a
speaking style X.
12
Cumulative frequency of distribution
of Mahalanobis distances

Mahalanobis distances between every phoneme pair for
each speaking style
Increase of
spontaneity


Mahalanobis distance between phonemes decreases as
the spontaneity of utterances increases.
The more spontaneous the utterances become, the more
reduced the cepstrum space becomes.
13
Relationship between phoneme
distances and phoneme accuracy (1/2)

Investigation of relationship between mean phoneme
distances and phoneme recognition accuracy

Acoustic model



A common model for all speaking styles
Trained on the data from 100 males and 100 females for AP and
150 males and 150 females for EP (about 2M phoneme samples,
respectively)
Language model

Phoneme network constrained by phoneme-class probabilities
14
Relationship between phoneme
distances and phoneme accuracy (2/2)
Correlation
coefficient
0.97


Strong correlation between mean phoneme distance and
phoneme accuracy
Reduction of the distances between phonemes is a major
factor contributing to the degradation of spontaneous
speech recognition accuracy.
15
Linguistic characteristics
Written text and spontaneous speech
corpora
Mainichi newspaper(NP)

Written text corpus
News commentary(NC)

Transcription of utterances spoken based on prepared
text
Academic presentations(AP) (in CSJ)
Extemporaneous presentations(EP) (in CSJ)
Dialogue(D) (in CSJ)
17
Part-of-speech observation frequency
Noun


Fillers
The frequency of nouns is much higher in the newspaper
corpus than in the spontaneous speech.
The frequency of fillers is much higher in the dialogue
than in news commentary and presentations.
18
Perplexity matrix

Trigrams are built as statistical language models for each
speaking style, and test-set perplexity is measured for
every combination of the styles.
Perplexity matrix
Diagonal
elements

Test-set perplexity for spontaneous speech is roughly
five times larger than that for written newspaper texts.
19
Distance matrix for visualization


Visualization of relationships between the language
models
Symmetrization of the perplexity matrix as follows:
PP matrix (PP(aij))
Symmetrization
Distance matrix (D(dij))
Visualization
20
Correction

Equation (3) in the paper is wrong.

Correct equation (3) is as follows:
21
Difference between language models

Relationship between the language models projected
onto a two-dimensional space derived from the distance
matrix using MDS (Multidimensional scaling) method

Newspaper text and dialogue are situated at two
extreme positions.
Presentations and news commentary are situated in
between.

22
Relationship between perplexity and
word accuracy (1/2)

Investigation of relationship between test-set perplexity
and word accuracy

Acoustic model



A common model for all speaking styles
Trained on the data from 10 males and 10 females for each
speaking style (about 750K phoneme samples)
Language models

Separate models for each
speaking style
23
Relationship between perplexity and
word accuracy (2/2)

The test-set perplexity (diagonal elements in the PP
matrix) and word accuracy
Correlation
coefficient
–0.98

Experimental results indicate that they have a high
correlation of –0.98 between the test-set perplexity and
recognition accuracy across different speaking styles.
24
Conclusion(1/2)

Clarified differences of acoustic and linguistic
characteristics between spontaneous speech and read
speech.

Acoustic characteristics

Spectral distribution of spontaneous speech is reduced in
comparison with that of read speech.

The more spontaneous, the smaller the distances between
phonemes.

There is a high correlation between the mean phoneme distance
and the phoneme recognition accuracy.

Spontaneous speech can be characterized by the reduction of
spectral space in comparison with that of read speech, and this is
one of the major factors contributing to the decrease in phoneme
recognition accuracy.
25
Conclusion(2/2)

Linguistic characteristics

The perplexity for language models of spontaneous speech is
significantly higher than that for written text.
Spontaneous speech frequently includes ungrammatical
phenomena and linguistic variations, including repetitions and
repairs
 There is a high correlation between the test-set perplexity and the
word recognition accuracy.


Increment of the test-set perplexity of spontaneous speech is one
of the major factors contributing to the decrease in word
recognition accuracy
26
Future research

Analysis over wider ranges of spontaneous speech using
utterances other than those included in the CSJ


Is the relationship between phoneme distances and phoneme
recognition accuracy general?
Is the relationship between test-set perplexity and word recognition
accuracy general?

How to incorporate filled pauses, repairs, hesitations,
repetitions, partial words, and disfluencies for spontaneous
speech

Investigations of how we can use these results obtained
in this paper for improving recognition performance of
spontaneous speech

Creating methods for adapting acoustic and language models to
spontaneous speech
27
Thank you very much for your
kind attention!
E-mail: [email protected]