Speech and speaker normalization - uni

Download Report

Transcript Speech and speaker normalization - uni

Speech and speaker normalization
(in vowel normalization)
Venice International University
Phonetic and technological aspects of speaker characteristics
Prof. Dr. J. Harrington
Presented by
Clara Tillmanns
[email protected]
18.10.2007
Contents
1. Speech and speaker normalization in
vowel normalization: definition
2. Influencing parameters and
instruments for vowel normalization
3. Theories
4. Studies: Johnson 1990 and 1999
5. Recapitulation
Clara Tillmanns - Speech and speaker normalization
2
Definition
Normalization.
We know there is extensive variation in speech.
How come that listeners agree in their
perception of vowels?
Clara Tillmanns - Speech and speaker normalization
3
Fig. 1:
Scatter plot of first and second formant
values of American English vowels.
From Peterson & Barney 1952
Clara Tillmanns - Speech and speaker normalization
4
Definition
Normalization.
Which information influences this decision?
Clara Tillmanns - Speech and speaker normalization
5
Definition
Normalization.
And, which mechanism leads to the decision?
Clara Tillmanns - Speech and speaker normalization
6
Contents
1.
Speech and speaker normalization: definition
2. Influencing parameters and instruments for
vowel normalization
3.
4.
5.
Context
Formant ratio
F0
Visual information
Auditory gestalts
Theories
Studies: Johnson 1990 and 1999
Recapitulation
Clara Tillmanns - Speech and speaker normalization
7
Influencing parameters and
instruments for vowel normalization
Extrinsic
Intrinsic
Context
Formant ratio
F0
Auditory gestalts
Visual information
Clara Tillmanns - Speech and speaker normalization
8
Influencing parameters and
instruments for vowel normalization
Syllable external
Extrinsic
Syllable internal
Intrinsic
Context
Formant ratio
F0
Auditory gestalts
Visual information
Clara Tillmanns - Speech and speaker normalization
9
Influencing parameters and
instruments for vowel normalization
Syllable external
Extrinsic
Intrinsic
Context
Vocalic
Prosodic
Tonal
Syllable internal
Formant ratio
F0
Auditory gestalts
Visual information
Clara Tillmanns - Speech and speaker normalization
10
Influencing parameters and
instruments for vowel normalization
Context:
Perceived vowel quality is influenced
- by the formant frequencies of context vowels
(Ladefoged & Broadbent 1957)
- by the F0 range of the carrier phrase (Johnson 1990)
Tones: Pitch range of a context utterance influences
Mandarin Chinese tones (Leather 1983)
Clara Tillmanns - Speech and speaker normalization
11
Influencing parameters and
instruments for vowel normalization
Syllable external
Extrinsic
Syllable internal
Intrinsic
Context
Formant ratio
Vocalic
Prosodic
Tonal
Relative patterns
Gender
F0
Auditory gestalts
Visual information
Clara Tillmanns - Speech and speaker normalization
12
Influencing parameters and
instruments for vowel normalization
Formant ratio
Vowels are relative patterns - no absolute
frequencies
Clara Tillmanns - Speech and speaker normalization
13
Influencing parameters and
instruments for vowel normalization
Formant ratio
Fig. 2: Spectrogram of a man and a woman saying “cat”.
The three lowest vowel formants (vocal tract resonant
frequencies are marked as F1, F2 and F3) (Johnson
Clara Tillmanns - Speech and speaker normalization
14
2004)
Influencing parameters and
instruments for vowel normalization
F0
Miller 1953
doubled F0 and found vowel category shift for
most American English vowels
Fujisaki & Kawashime 1968:
Found F1 boundary shifts from 100Hz to 200Hz
for F0 shifts of 200Hz
Clara Tillmanns - Speech and speaker normalization
15
Influencing parameters and
instruments for vowel normalization
Syllable external
Extrinsic
Syllable internal
Intrinsic
Context
Formant ratio
Vocalic
Prosodic
Tonal
Relative patterns
Gender
F0
Auditory gestalts
Visual information Articulatory gestures
Gender / Age
Clara Tillmanns - Speech and speaker normalization
16
Influencing parameters and
instruments for vowel normalization
Visual information
- Gender: boundary shift much like the F0 shift
(Strand & Johnson 1996)
- Age
- Vowel quality: boundary shift through differing
visual phonetic information (Johnson 1999)
- Sociocultural: Speech intelligibility is reduced,
when the voice is associated with an Asian
looking face (Rubin 1992)
Clara Tillmanns - Speech and speaker normalization
17
Influencing parameters and
instruments for vowel normalization
Auditory gestalts - “secondary cues”
Duration
Formant frequency movement trajectories:
- Lehiste & Metzger 1973:
- Fixed duration vowels synthesized with steady-state formant
frequencies (51% correct)
- mixed lists of the original vowels from men, women and
children 79% correct.
- Hillenbrand & Neary 1999:
- Flat-formant vowels were correctly identified 74% of the time,
while vowels synthesized with the original formant frequency
trajectories were correctly identified 89% of the time.
Clara Tillmanns - Speech and speaker normalization
18
Contents
1.
2.
Speech and speaker normalization in vowel
normalization: definition
Influencing parameters and instruments for vowel
normalization
3. Theories
3.1 Vocal tract normalization (VTN)
3.2 Talker normalization (TN)
4.
5.
Studies: Johnson 1990 and 1999
Recapitulation
Clara Tillmanns - Speech and speaker normalization
19
Theories - VTN
Vocal tract normalization theories
consider that listeners perceptually
evaluate vowels on a talker specific
coordinate system.” (Johnson 2004)
• Context vowels (reference)
• Visual information about the size of the
vocal tract
Clara Tillmanns - Speech and speaker normalization
20
Theories - VTN
But: Talkers may differ from each other at the
level of their articulatory habits of speech:
“Perception may not be able to depend on vocal
tract normalization to “remove” talker
differences by removing vocal tract
differences.” (Johnson 2004)
 Speaker/speech variation depends on
anatomical differences only?
Clara Tillmanns - Speech and speaker normalization
21
Theories - VTN
Cross-linguistic gender differences
Bladon, Henton and Pickering (1984):
The difference between men and women vary
from language to language.
 Cultural factors are involved in defining and
shaping male or female speech
 Anatomy does not completely determine the
vowel formant frequencies
Clara Tillmanns - Speech and speaker normalization
22
Theories - VTN
Fig. 3 Spectral shift
needed to normalize male
and female spectra
From Bladon, Henton &
Pickering (1984)
Clara Tillmanns - Speech and speaker normalization
23
Theories - VTN
“This seems to suggest that talkers choose
different styles of speaking as social, dialectal
gender markers.
A speaker normalization that removes vocal
tract differences will fail to account for the
linguistic categorical similarity of vowels that
are different due to different habits of
articulation.”
(Johnson 2004)
Clara Tillmanns - Speech and speaker normalization
24
Theories - TN
Talker normalization is subject to expectations:
Magnuson & Nusbaum (1994) compared
1-voice with 2-voice instructions in a mixed-talker and blockedtalker experiment.
Advantage of blocked-talker disappeared when subjects didn’t
know about the different F0s of the two voices.
Talker normalization is an active process:
Kato & Kakehi (1988) Listener adaptation to talker voice:
Increase in recognition accuracy over the course of 5 stimuli
presented in noise
Clara Tillmanns - Speech and speaker normalization
25
Theories - TN
“In this approach, cognitive categories are represented
as collections of the stored cognitive representations
of experienced instances of the category,
rather than as normalized abstract representations from
which category-internal structure has been removed”
(Johnson 2004)
Clara Tillmanns - Speech and speaker normalization
26
Contents
1.
2.
3.
Speech and speaker normalization in vowel
normalization: definition
Influencing parameters and instruments for vowel
normalization
Theories
4. Studies
4.1 Johnson 1990
4.2 Johnson 1999
5.
Recapitulation
Clara Tillmanns - Speech and speaker normalization
27
Studies
“The role of perceived speaker identity in
F0 normalization of vowels” (Johnson
1990)
Presentation of vowels from a “hood”-”hud”
continuum in two different intonational
contexts which were judged to have been
produced by different speakers, even
though the F0 of the test word was identical in
the two contexts.
Clara Tillmanns - Speech and speaker normalization
28
Studies
“The role of perceived speaker identity in F0
normalization of vowels” (Johnson 1990)
Shift in identification as a result of the
intonational context
which was interpreted as evidence for the
role of perceived speaker identity in vowel
normalization
Clara Tillmanns - Speech and speaker normalization
29
Studies
“Auditory-visual integration of talker gender in
vowel perception” (Johnson 1999)
Exp. 1 found, that the gender of auditoryvisually presented stimuli shift the phoneme
boundary of a vowel continuum
Exp. 2 found that visual phonetic information is
integrated in the boundary shift
Exp. 3 showed that listeners integrate abstract
gender information with phonetic information
in speech perception
Clara Tillmanns - Speech and speaker normalization
30
Contents
1.
2.
3.
4.
Speech and speaker normalization in vowel
normalization: definition
Influencing parameters and instruments for vowel
normalization
Theories
Studies: Johnson 1990 and 1999
5. Recapitulation
Clara Tillmanns - Speech and speaker normalization
31
Recapitulation
-
-
Great internal and external influence on the
perception (of vowels)
Explanation must integrate repeated
learning
Information on speaker identity influences
the perception (of vowels)
But: Is the perception of speaker identity
influenced by certain components of the
speech signal?
May speaker identity be manipulated?
Clara Tillmanns - Speech and speaker normalization
32
References
Bladon, R.A., Henton, C. G. & Pickering, J. B. (1984) Towards an auditory theory of speaker normalization. Language
Communication 4, 59-69.
Fujisaki, H. & Kawashima, T. (1968) The roles of pitch and higher formants in the perception of vowels. IEEE Transactions
on Audio and Electroacoustics AU-16, 73-77.
Hillenbrand, J. M. & Neary, T. M. (1999) Identification of synthesized /hVd/ utterances: Effects of formant contour. J.
Acoust. Soc. Am. 105, 3509-3523.
Ladefoged, P. & Broadbent, D. E. (1957) Information conveyed by vowels. J. Acoust. Soc. Am. 29, 98-104
Leather, J. (1983) Speaker normalization in the perception of lexical tone. Journal of Phonetics 11, 373-382
Lehiste, I. & Metzger, D. (1973) Vowel and speaker identification in natural and synthetic speech. Language and Speech
16, 356-364.
Johnson, K., Strand, E. A. & D’Imperio, M. (1999) Auditory-visual integration of talker gender in vowel perception. Journal
of Phonetics 27, 359-384
Johnson, K. (2004) Speaker normalization in speech perception. Ohio State University
Johnson, K. (1990) The role of percieved speaker identity in F0 normalization of vowels. J. Acoust. Soc. Am. 88 642-654
Kato, K & Kakehi, K. (1988) Listener adaptability to individual speaker differences in monosyllabic speech perception. J.
Acoust. Soc. Of Japan 44, 180-186
Magnuson, J. & Nusbaum, H. (1994) Are representations used for talker identification available for talker normalization?
Proceedings of the International Conference on Spoken Language Processing.
Miller, R. L. (1953) Auditory tests with synthetic vowels. J. Acoust. Soc. Am. 25, 114-121.
Peterson, G. E. & Barney, H. L. (1952) Control methods used in the study of vowels. J. Acoust. Soc. Am. 24, 175-184
Rubin, D. L. (1992) Non-language factors affecting undergraduates’ jedgements of non-native English-speaking teaching
assistants. Research in Higher Education 33, 4.
Strand, E. A. & Johnson, K. (1996) Gradient and visual speaker normalization in the perception of fricatives. In Natural
languag processing and speech technology: results of the 3rd KONVENS conference, Bielefeld, (D. Gibbon, Ed.),
Berlin: Mouton de Gruyter (pp. 14-26).
Clara Tillmanns - Speech and speaker normalization
33