Acquiring and implementing phonetic knowledge

Download Report

Transcript Acquiring and implementing phonetic knowledge

Acquiring and implementing
phonetic knowledge
Louis C.W. Pols
Institute of Phonetic Sciences (IFA)
http://www.fon.hum.uva.nl/
Amsterdam Center for Language and
Communication (ACLC) / LOT
Faculty of Humanities, University of Amsterdam
Herengracht 338, Amsterdam, The Netherlands
Eurospeech 2001 - Scandinavia
Aalborg, Sept. 3, 2001, Keynote
why so excited?
0.3956
0





speech & speech research are beautiful
doing, supervising, talking, publishing is fun
speech community is wonderful
ISCA, former ESCA, is the best
Paul Dalsgaard c.s., Aalborg, Denmark,
and Eurospeech 2001-Scandinavia unique
-0.2692
so…..
what better could happen to me than
getting this ISCA medal, here and now
in the year that I became 60!
and 75 years chair Phonetics in A’dam
0
Time (s)
0.408463
outline
phonetic knowledge
 acquiring and implementing that knowledge
 30 years ago, 7th ICA in Budapest, Sept. 1971

nothing compared to G. Fant & K. Stevens (ESCA medallists) who
can easily talk about half a century of experience in speech research!
speech production and speech perception
 supervising some 25 Ph.D. projects

speech acquisition (L1 and L2)

speech technology

speech databases
 what might future bring us?

acquiring and implementing
phonetic knowledge
from speech production and speech perception
 via speech analysis
 via experimental procedures
 via data mining in speech databases
 via literature
 formalizing and generalizing knowledge
 applying knowledge via rules, statistical
procedures, proper selections, etc.

phonetic knowledge is
indispensable for
language acquisition (both L1 and L2)
 education and training
 aids for the handicapped
 speech technology (analysis, coding, synthesis,
recognition, dialogs, translation, spotting)

but, see Eurospeech Special Event 7, Friday, 9:00-12:30
Integration of Phonetic Knowledge in Speech Technology:
a) Experiments and Experiences,
Presentations
b) Is Phonetic Knowledge any Use? Panel Discussion
7th ICA
Budapest
17th ICA now in Rome, Italy (Sept. 2-7)
 every 3 years; first one in 1951 in Delft, Neth.
 7th ICA in Budapest, Hungary (Sept. 1971),
plus subsequent Speech Symposium in Szeged
 my first active participation in a major
(speech) conference
 substantial international participation on speech
 proper view of state-of-the-art 30 years ago

state-of-the-art 30 years ago (1)

speech perception
-
Kasuya: effect of context on vowel perception
Rao: plosive - vowel interaction
Kozhevnikov: perception of AM vowel-like stimuli
Chistovich: vowel discrimination, plus keynote on
importance of psycho-acoustics for speech perception
- followed by Symposium on ‘Auditory Analysis and
Perception of Speech’, Leningrad, Aug. 1973

speech production
- Fujimura: dynamic palatography, electromyography,
and Tokyo x-ray microbeam system
state-of-the-art 30 years ago (2)

speech processing
- Velichko: dynamic programming
- Atal: initial ideas about predictive coding

speech synthesis (no rule synthesis, no diphones)
-
Liljencrants & Fant: OVE III formant synthesizer
Coker: articulatory synthesis
Mermelstein and Atal: Vocal Tract transfer functions
Rabiner: digital formant synthesizer
‘we were away a year ago’, ‘may we all learn a yellow lion roar’
- Denes: word concatenation
- Itakura: digital filters of ladder form for synthesis
state-of-the-art 30 years ago (3)

speech recognition (only template matching, simple
time normalization, no probabilistic approach)
- isolated word recognition (some 50 words)
Erman: over telephone carefully spoken by one
Neely: in noise
male speaker: Ken Stevens
Pols: dimensional representation of BF spectra
Rao: diad matching
Bonner DAWID-II system
- Sakoe: dynamic processing for time normalization
- Dreyfus-Graf: artificial language to simplify recogn.
- Flanagan: keynote on focal points in sp. comm. res.
state-of-the-art 30 years ago (4)

musical acoustics
- Sundberg: real time pitch extraction in folk music
- Mathews: music synthesis

psycho-acoustics
- Houtgast: psychophysical evidence for lateral inhibition
- Evans & Wilson: neurophysiological evidence
- Julesz: critical bands in vision and audition
- de Boer: reverse-correlation method
speech production and perception
three representative events:
 Speech Recognition As pAttern Classification
(SPRAAC), MPI-workshop July 11-13, 2001
- van Son & Pols: ‘Phoneme recognition as a function
of task and context’
- Moore & Cutler: ‘Constraints on theories of human
vs.. machine recognition of speech’
MIT Symposium on ‘Invariance and variability
of speech processes’, Cambridge, Oct. 1983
 Symposium on ‘Auditory analysis and
perception of speech’, Leningrad, Aug. 1973

supervising some 25 Ph.D. projects
ideas and productivity via these students
 Dutch habit: good-looking booklet of each thesis
 plus reports at conf., workshops, and in open lit.
 in 3 main fields of research

-early speech acquisition (normal/pathological)
-speech production and perception (normal/pathological)
-speech technology

joint responsibility for several projects
-daily supervision by Florien Koopmans- van Beinum
-with colleague promotores
NAME
Loes Klaassen-Don (UL)
Gerrit Bloothooft (VU)
Louis ten Bosch
Herman Steeneken
Paul van Alphen
Mirjam Tielen
Amos van Gelderen
Cecile Kuijpers
Rob van Son
Jeannette van der Stelt
Dick van Bergem
Astrid van Wieringen
Henning Reetz
Xue Wang
Irma Verdonck-de Leeuw
Paul Boersma
Sylvie Mozziconacci (TUE)
Kino Jansonius-Schultheiss
Monique van Donzel
Jan van Dijk
Ahmed Elgendy
Corina van As
YEAR early
acq.
1983
1985
1991
1992
1992
1992
1992
1993
x
1993
1993
x
1995
1995
1996
1997
1998
1998
1998
1999
x
1999
2001
2001
2001
prod./ techn.
perc.
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Univ. of Amsterdam
Sept. 26, 2001
coarticulatory effects on the schwa
D. van Bergem (1995)
stylized F2-tracks with second order polynomials
F2-track of the schwa via model prediction:
F2 (t )  F2 (t )   2C1 (t )   2C 2 (t )   2V (t )
 ,  and  at onset, center, offset(derived fromactualdata)
t-n
w-l
Gradual Learning Algorithm (GLA)
P. Boersma (1998)
Word type
average score of 10 raters
acceptability
1=high
7 =low
light
dark
judgment
version
version
difference
conjectured
frequency of
light variant
(%)
light
Louanne
gray-ling, gaily, free-ly
Mailer, Hayley, Greeley, Daley
mail-er, hail-y, gale-y, feel-y
mail it
bell, help
1.30
1.10
1.57
1.90
3.01
4.40
6.60
99.956
99.923
94.53
76.69
16.67
0.49
0.0011
6.10
5.55
3.34
2.64
2.01
1.10
1.12
4.80
4.45
1.77
0.74
-1.00
-3.30
-5.48
word forms presented to GLA according to this probability of ‘lightness’
Boersma & Hayes, Linguistic Inquiries 32(1), 2001, 45-86
OT-type ranked Constraints
Ranking value
IDENTOO(vowel features, phrasal)
DARK [ł] IS POSTVOCALIC
PRETONIC /l/ IS LIGHT
IDENTOO(vowel features, morphological)
PREVOCALIC /l/ IS LIGHT
/l/ IS DARK
108.146
107.760
103.422
103.394
100.786
99.084
running this grammar for 1 million trials results in
Word type
Observed Projected
Modeled
Predicted
judgment freq. of light freq. of light judgment
difference variant (%) variant (%) difference
light
Louanne
gray-ling, gai-ly, free-ly
Mailer, Hayley, Greeley, Daley
mail-er, hail-y, gale-y, feel-y
mail it
bell, help
4.80
4.45
1.77
0.74
-1.00
-3.30
-5.48
99.956
99.923
94.53
76.69
16.67
0.49
0.0011
99.938
99.904
95.76
72.62
16.63
0.47
0
excellent fit!
4.59
4.31
1.94
0.61
-1.00
-3.33
-6.00
speech signal processing package
‘praat’
mainly developed and maintained by P. Boersma
 meanwhile >4000 registered users in 85 countries
 freely available upon request
(http://www.fon.hum.uva.nl/praat/)
 for all common platforms: Macintosh, Windows,
Linux, SGI, Solaris, HP-UX
 user friendly, excellent graphical output, scriptable
 see demo at Educational Arena (Thu. afternoon)
 praat = ‘doing phonetics by computer’
 a.o. used for transcriptions in Spoken Dutch Corpus

phonetic knowledge and
early speech acquisition
source filter description system (FvB-JvdSt)
 early indicators for dyslexia (C. Schwippert)
 early hearing screening with babies
 but, early detection requires early intervention
 optimizing digital hearing aids
 objective adaptation of hearing aids for babies
 cochlear implants, also for young babies

early speech development
Articulation type
Phonation type
No Phonation
Uninterrupted Phonation
Interrupted Phonation
Variegated Unint. Phon.
Variegated Interrupted Phon.
0
6
10
NoArt
One Art
Two Art
Stage I
Stage II
Stage IV
Stage IV
Stage III
Stage III
Stage III
Stage III + IV
Stage II + III + IV
Stage V
Stage V
Stage V
Stage IV + V
Stage IV + V
average onset (in weeks)
20
31
40
vB, Cl, vdD, Developmental Sc. 4(1), 2001, 61-70; see poster, sess.C26
phonetic knowledge and
speech technology (1)
speech technology barely existed 30 years ago
 ideal test bed for all acquired speech knowledge
 speech synthesis

- fully natural synthetic speech ( including multilingual
and in various speaking styles)  text interpretation
and speech generation problem solved
- even better if optimized for noisy and reverberant
conditions and for non-natives and elderly people

speech understanding
- full performance  speaker adaptation, robust word
recognition, and speech understanding problem solved
predicting prominence
Ph.D. project Barbertje Streefkerk (oral, sess. B32)
 acoustical and/or textual features to predict prom.

(for ASR and rule synthesis purposes, respectively)
prominence = judgment by listeners at word level
 textual feat.: POS (11 categ.), #syll, word pos., co-occ.

rule set to predict prom. (level 0-4): for results see paper

acoustical features (7):
- F0: median & range, syll. & word
- duration: vowel, syllable
- intensity: vowel

additional (+5):
median sent.
Vnorm.; sent. rate
Vnorm.; sentence
neural net predictor: ~ 82% best score (prom. 0 / 1)
phonetic knowledge and
speech technology (2)

speech technological needs for handicapped
-

artificial voice for laryngectomized speakers
better digital hearing aid for hearing impaired
better cochlear implant for deaf
natural speech output for visually impaired
training aids for speech and language impaired
speech technology in education and training
phonetic knowledge in
speech databases
speech databases potentially are a wealth of
phonetic knowledge
 requires annotation (manually or automatic) at
various levels (from segmental to prosodic &
linguistic)
 requires SQL-type access & intelligent data mining
 new ways of defining knowledge, e.g.

- duration modeling
- pronunciation variants
- concatenative synthesis (best match)
2 examples

Spoken Dutch Corpus
-

Dutch-Flemish project, start June 1998, 5 years
10M words: ~1000 hrs of speech, many styles/speakers
for all 10M: orthography, lemmas, POS
for 1M: phonetic and syntactic annotation
for 250k: prosodic annotation
IFA corpus (Dutch), R. van Son (poster, sess. D36)
-
few speakers (4 M and 4 F), but >30 min./speaker
various speaking styles per speaker, and
all material phonemically segmented and labeled
free access via SQL query language
Spoken Dutch Corpus
W. Levelt (chairman Board), J.P. Martens (overall
coordinator), Nijmegen Univ. (Dutch coordination)
 so far, mainly ‘project-internal’ results, e.g.
 optimizing transcription protocols, e.g.

- orthographic (using ‘praat’)
- phonetic ‘doe ik’: du-w-Ik ‘is zes’: Is_sEs
determining consistency and efficiency (costs)
 optimizing automatic procedures for

-
POS-tagging & lemmatization
syntactic annotation (semi-automatic)
grapheme-to-phoneme conversion
word alignment
IFA corpus: Consonant duration
Corrected Mean Duration -> ms
Intervocalic Nasals, Fricatives, Stops, and Glides
in Spontaneous and Read connected speech (2 or more syllable words)
accounting for the effects of speaker (8), style, and phoneme identity
word freq. < 1/4000 (CELEX); words not at sentence boundary
90
+Read
85
+ Spont
+ Read
B
(202, 295, 20)
80
spont. unstr.
75
70
J
B
E
GJ
55
E
50
BE
G
Initial
Medial
Within Word Position
(715, 837, 75)
read unstr.
G
dashed / +: p < 0.005
(Mann-Whitney Signed Rank test)
(96, 810, 94)
read str.
J
65
60
spont. str.
N = 6332
Final
(285, 2586, 317)
I
M F
some conclusions (1)
let speech speak for itself (speech databases)
 25 Ph.D students can do much more than one
(administratively overloaded) senior
 despite skepticism: much progress in last 30 yrs
 over 10,000 active in spoken language community
?
 700 papers at E’01 > all speech papers in 1971

JASA: speech 2nd (14.4%) in 1999 (N<700); 6th in 1970 (5.1%)

joint phonetic knowledge is insufficient to solve
today’s communicative demands
some conclusions (2)
speech is most natural form of communication,
however, natural HC dialog is far away
 synthetic speech is intelligible, but no proper
control over naturalness and speaker/style char.
 ASR requires greater robustness and quicker
adaptation
 speech and language technology could be used
more in education, language training and aids for
the handicapped
 much basic knowledge about sp. perc. still missing

some intriguing questions
how do listeners normalize over speakers?
 how do listeners handle speech variation?
 is there always a cause for any variation?
 what is a realistic and efficient front-end?

- also for noisy speech and high-pitched voices
how do we acquire our mother tongue and a
foreign language?
 what are implications of speaking/hearing defect

- plus hearing aid and cochlear implant
epilogue
privileged to have been part of this lively
speech community for over 30 years
 high expectations of progress to come
 phonetic knowledge nowadays more accessible
 and easier to implement in

- descriptive models (computational phonetics), and
- technological systems

thank you all for your kind attention!