Acquiring and implementing phonetic knowledge
Download
Report
Transcript Acquiring and implementing phonetic knowledge
Acquiring and implementing
phonetic knowledge
Louis C.W. Pols
Institute of Phonetic Sciences (IFA)
http://www.fon.hum.uva.nl/
Amsterdam Center for Language and
Communication (ACLC) / LOT
Faculty of Humanities, University of Amsterdam
Herengracht 338, Amsterdam, The Netherlands
Eurospeech 2001 - Scandinavia
Aalborg, Sept. 3, 2001, Keynote
why so excited?
0.3956
0
speech & speech research are beautiful
doing, supervising, talking, publishing is fun
speech community is wonderful
ISCA, former ESCA, is the best
Paul Dalsgaard c.s., Aalborg, Denmark,
and Eurospeech 2001-Scandinavia unique
-0.2692
so…..
what better could happen to me than
getting this ISCA medal, here and now
in the year that I became 60!
and 75 years chair Phonetics in A’dam
0
Time (s)
0.408463
outline
phonetic knowledge
acquiring and implementing that knowledge
30 years ago, 7th ICA in Budapest, Sept. 1971
nothing compared to G. Fant & K. Stevens (ESCA medallists) who
can easily talk about half a century of experience in speech research!
speech production and speech perception
supervising some 25 Ph.D. projects
speech acquisition (L1 and L2)
speech technology
speech databases
what might future bring us?
acquiring and implementing
phonetic knowledge
from speech production and speech perception
via speech analysis
via experimental procedures
via data mining in speech databases
via literature
formalizing and generalizing knowledge
applying knowledge via rules, statistical
procedures, proper selections, etc.
phonetic knowledge is
indispensable for
language acquisition (both L1 and L2)
education and training
aids for the handicapped
speech technology (analysis, coding, synthesis,
recognition, dialogs, translation, spotting)
but, see Eurospeech Special Event 7, Friday, 9:00-12:30
Integration of Phonetic Knowledge in Speech Technology:
a) Experiments and Experiences,
Presentations
b) Is Phonetic Knowledge any Use? Panel Discussion
7th ICA
Budapest
17th ICA now in Rome, Italy (Sept. 2-7)
every 3 years; first one in 1951 in Delft, Neth.
7th ICA in Budapest, Hungary (Sept. 1971),
plus subsequent Speech Symposium in Szeged
my first active participation in a major
(speech) conference
substantial international participation on speech
proper view of state-of-the-art 30 years ago
state-of-the-art 30 years ago (1)
speech perception
-
Kasuya: effect of context on vowel perception
Rao: plosive - vowel interaction
Kozhevnikov: perception of AM vowel-like stimuli
Chistovich: vowel discrimination, plus keynote on
importance of psycho-acoustics for speech perception
- followed by Symposium on ‘Auditory Analysis and
Perception of Speech’, Leningrad, Aug. 1973
speech production
- Fujimura: dynamic palatography, electromyography,
and Tokyo x-ray microbeam system
state-of-the-art 30 years ago (2)
speech processing
- Velichko: dynamic programming
- Atal: initial ideas about predictive coding
speech synthesis (no rule synthesis, no diphones)
-
Liljencrants & Fant: OVE III formant synthesizer
Coker: articulatory synthesis
Mermelstein and Atal: Vocal Tract transfer functions
Rabiner: digital formant synthesizer
‘we were away a year ago’, ‘may we all learn a yellow lion roar’
- Denes: word concatenation
- Itakura: digital filters of ladder form for synthesis
state-of-the-art 30 years ago (3)
speech recognition (only template matching, simple
time normalization, no probabilistic approach)
- isolated word recognition (some 50 words)
Erman: over telephone carefully spoken by one
Neely: in noise
male speaker: Ken Stevens
Pols: dimensional representation of BF spectra
Rao: diad matching
Bonner DAWID-II system
- Sakoe: dynamic processing for time normalization
- Dreyfus-Graf: artificial language to simplify recogn.
- Flanagan: keynote on focal points in sp. comm. res.
state-of-the-art 30 years ago (4)
musical acoustics
- Sundberg: real time pitch extraction in folk music
- Mathews: music synthesis
psycho-acoustics
- Houtgast: psychophysical evidence for lateral inhibition
- Evans & Wilson: neurophysiological evidence
- Julesz: critical bands in vision and audition
- de Boer: reverse-correlation method
speech production and perception
three representative events:
Speech Recognition As pAttern Classification
(SPRAAC), MPI-workshop July 11-13, 2001
- van Son & Pols: ‘Phoneme recognition as a function
of task and context’
- Moore & Cutler: ‘Constraints on theories of human
vs.. machine recognition of speech’
MIT Symposium on ‘Invariance and variability
of speech processes’, Cambridge, Oct. 1983
Symposium on ‘Auditory analysis and
perception of speech’, Leningrad, Aug. 1973
supervising some 25 Ph.D. projects
ideas and productivity via these students
Dutch habit: good-looking booklet of each thesis
plus reports at conf., workshops, and in open lit.
in 3 main fields of research
-early speech acquisition (normal/pathological)
-speech production and perception (normal/pathological)
-speech technology
joint responsibility for several projects
-daily supervision by Florien Koopmans- van Beinum
-with colleague promotores
NAME
Loes Klaassen-Don (UL)
Gerrit Bloothooft (VU)
Louis ten Bosch
Herman Steeneken
Paul van Alphen
Mirjam Tielen
Amos van Gelderen
Cecile Kuijpers
Rob van Son
Jeannette van der Stelt
Dick van Bergem
Astrid van Wieringen
Henning Reetz
Xue Wang
Irma Verdonck-de Leeuw
Paul Boersma
Sylvie Mozziconacci (TUE)
Kino Jansonius-Schultheiss
Monique van Donzel
Jan van Dijk
Ahmed Elgendy
Corina van As
YEAR early
acq.
1983
1985
1991
1992
1992
1992
1992
1993
x
1993
1993
x
1995
1995
1996
1997
1998
1998
1998
1999
x
1999
2001
2001
2001
prod./ techn.
perc.
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Univ. of Amsterdam
Sept. 26, 2001
coarticulatory effects on the schwa
D. van Bergem (1995)
stylized F2-tracks with second order polynomials
F2-track of the schwa via model prediction:
F2 (t ) F2 (t ) 2C1 (t ) 2C 2 (t ) 2V (t )
, and at onset, center, offset(derived fromactualdata)
t-n
w-l
Gradual Learning Algorithm (GLA)
P. Boersma (1998)
Word type
average score of 10 raters
acceptability
1=high
7 =low
light
dark
judgment
version
version
difference
conjectured
frequency of
light variant
(%)
light
Louanne
gray-ling, gaily, free-ly
Mailer, Hayley, Greeley, Daley
mail-er, hail-y, gale-y, feel-y
mail it
bell, help
1.30
1.10
1.57
1.90
3.01
4.40
6.60
99.956
99.923
94.53
76.69
16.67
0.49
0.0011
6.10
5.55
3.34
2.64
2.01
1.10
1.12
4.80
4.45
1.77
0.74
-1.00
-3.30
-5.48
word forms presented to GLA according to this probability of ‘lightness’
Boersma & Hayes, Linguistic Inquiries 32(1), 2001, 45-86
OT-type ranked Constraints
Ranking value
IDENTOO(vowel features, phrasal)
DARK [ł] IS POSTVOCALIC
PRETONIC /l/ IS LIGHT
IDENTOO(vowel features, morphological)
PREVOCALIC /l/ IS LIGHT
/l/ IS DARK
108.146
107.760
103.422
103.394
100.786
99.084
running this grammar for 1 million trials results in
Word type
Observed Projected
Modeled
Predicted
judgment freq. of light freq. of light judgment
difference variant (%) variant (%) difference
light
Louanne
gray-ling, gai-ly, free-ly
Mailer, Hayley, Greeley, Daley
mail-er, hail-y, gale-y, feel-y
mail it
bell, help
4.80
4.45
1.77
0.74
-1.00
-3.30
-5.48
99.956
99.923
94.53
76.69
16.67
0.49
0.0011
99.938
99.904
95.76
72.62
16.63
0.47
0
excellent fit!
4.59
4.31
1.94
0.61
-1.00
-3.33
-6.00
speech signal processing package
‘praat’
mainly developed and maintained by P. Boersma
meanwhile >4000 registered users in 85 countries
freely available upon request
(http://www.fon.hum.uva.nl/praat/)
for all common platforms: Macintosh, Windows,
Linux, SGI, Solaris, HP-UX
user friendly, excellent graphical output, scriptable
see demo at Educational Arena (Thu. afternoon)
praat = ‘doing phonetics by computer’
a.o. used for transcriptions in Spoken Dutch Corpus
phonetic knowledge and
early speech acquisition
source filter description system (FvB-JvdSt)
early indicators for dyslexia (C. Schwippert)
early hearing screening with babies
but, early detection requires early intervention
optimizing digital hearing aids
objective adaptation of hearing aids for babies
cochlear implants, also for young babies
early speech development
Articulation type
Phonation type
No Phonation
Uninterrupted Phonation
Interrupted Phonation
Variegated Unint. Phon.
Variegated Interrupted Phon.
0
6
10
NoArt
One Art
Two Art
Stage I
Stage II
Stage IV
Stage IV
Stage III
Stage III
Stage III
Stage III + IV
Stage II + III + IV
Stage V
Stage V
Stage V
Stage IV + V
Stage IV + V
average onset (in weeks)
20
31
40
vB, Cl, vdD, Developmental Sc. 4(1), 2001, 61-70; see poster, sess.C26
phonetic knowledge and
speech technology (1)
speech technology barely existed 30 years ago
ideal test bed for all acquired speech knowledge
speech synthesis
- fully natural synthetic speech ( including multilingual
and in various speaking styles) text interpretation
and speech generation problem solved
- even better if optimized for noisy and reverberant
conditions and for non-natives and elderly people
speech understanding
- full performance speaker adaptation, robust word
recognition, and speech understanding problem solved
predicting prominence
Ph.D. project Barbertje Streefkerk (oral, sess. B32)
acoustical and/or textual features to predict prom.
(for ASR and rule synthesis purposes, respectively)
prominence = judgment by listeners at word level
textual feat.: POS (11 categ.), #syll, word pos., co-occ.
rule set to predict prom. (level 0-4): for results see paper
acoustical features (7):
- F0: median & range, syll. & word
- duration: vowel, syllable
- intensity: vowel
additional (+5):
median sent.
Vnorm.; sent. rate
Vnorm.; sentence
neural net predictor: ~ 82% best score (prom. 0 / 1)
phonetic knowledge and
speech technology (2)
speech technological needs for handicapped
-
artificial voice for laryngectomized speakers
better digital hearing aid for hearing impaired
better cochlear implant for deaf
natural speech output for visually impaired
training aids for speech and language impaired
speech technology in education and training
phonetic knowledge in
speech databases
speech databases potentially are a wealth of
phonetic knowledge
requires annotation (manually or automatic) at
various levels (from segmental to prosodic &
linguistic)
requires SQL-type access & intelligent data mining
new ways of defining knowledge, e.g.
- duration modeling
- pronunciation variants
- concatenative synthesis (best match)
2 examples
Spoken Dutch Corpus
-
Dutch-Flemish project, start June 1998, 5 years
10M words: ~1000 hrs of speech, many styles/speakers
for all 10M: orthography, lemmas, POS
for 1M: phonetic and syntactic annotation
for 250k: prosodic annotation
IFA corpus (Dutch), R. van Son (poster, sess. D36)
-
few speakers (4 M and 4 F), but >30 min./speaker
various speaking styles per speaker, and
all material phonemically segmented and labeled
free access via SQL query language
Spoken Dutch Corpus
W. Levelt (chairman Board), J.P. Martens (overall
coordinator), Nijmegen Univ. (Dutch coordination)
so far, mainly ‘project-internal’ results, e.g.
optimizing transcription protocols, e.g.
- orthographic (using ‘praat’)
- phonetic ‘doe ik’: du-w-Ik ‘is zes’: Is_sEs
determining consistency and efficiency (costs)
optimizing automatic procedures for
-
POS-tagging & lemmatization
syntactic annotation (semi-automatic)
grapheme-to-phoneme conversion
word alignment
IFA corpus: Consonant duration
Corrected Mean Duration -> ms
Intervocalic Nasals, Fricatives, Stops, and Glides
in Spontaneous and Read connected speech (2 or more syllable words)
accounting for the effects of speaker (8), style, and phoneme identity
word freq. < 1/4000 (CELEX); words not at sentence boundary
90
+Read
85
+ Spont
+ Read
B
(202, 295, 20)
80
spont. unstr.
75
70
J
B
E
GJ
55
E
50
BE
G
Initial
Medial
Within Word Position
(715, 837, 75)
read unstr.
G
dashed / +: p < 0.005
(Mann-Whitney Signed Rank test)
(96, 810, 94)
read str.
J
65
60
spont. str.
N = 6332
Final
(285, 2586, 317)
I
M F
some conclusions (1)
let speech speak for itself (speech databases)
25 Ph.D students can do much more than one
(administratively overloaded) senior
despite skepticism: much progress in last 30 yrs
over 10,000 active in spoken language community
?
700 papers at E’01 > all speech papers in 1971
JASA: speech 2nd (14.4%) in 1999 (N<700); 6th in 1970 (5.1%)
joint phonetic knowledge is insufficient to solve
today’s communicative demands
some conclusions (2)
speech is most natural form of communication,
however, natural HC dialog is far away
synthetic speech is intelligible, but no proper
control over naturalness and speaker/style char.
ASR requires greater robustness and quicker
adaptation
speech and language technology could be used
more in education, language training and aids for
the handicapped
much basic knowledge about sp. perc. still missing
some intriguing questions
how do listeners normalize over speakers?
how do listeners handle speech variation?
is there always a cause for any variation?
what is a realistic and efficient front-end?
- also for noisy speech and high-pitched voices
how do we acquire our mother tongue and a
foreign language?
what are implications of speaking/hearing defect
- plus hearing aid and cochlear implant
epilogue
privileged to have been part of this lively
speech community for over 30 years
high expectations of progress to come
phonetic knowledge nowadays more accessible
and easier to implement in
- descriptive models (computational phonetics), and
- technological systems
thank you all for your kind attention!