Speech acoustics and phonetics

Download Report

Transcript Speech acoustics and phonetics

Speech acoustics and phonetics

Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) NATO-ASI “Dynamics of Speech Production and Perception” Il Ciocco, Tuscany, Italy, July 1, 2002

Overview

      Dynamics in speech acoustics Contour modeling (mainly formants) Aspects of spectral undershoot Modeling V and C reduction Phonetic knowledge from speech corpora  IFA, CGN, TIMIT, found speech Conclusions July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 2

Dynamics in speech acoustics

   Dynamics is the norm,

not

 articulatory efficiency stationarity Dynamics is everywhere     generally no word boundaries in speech deletion of words, syllables, phonemes; insertion within/between word coarticulation/assimilation vowel and consonant reduction Acoustic manifestations  segment duration, F0, loudness, spectral quality July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 4

Dynamics is the norm

   The speaker speaks as sloppily as the listeners allow him to do in communication  communicative efficiency Articulatory vs. perceptual efficiency  do spectral transitions facilitate or hamper perception? —> see other presentation Speaker flexibility; speaking style (clear vs. sloppy); speaking rate July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 5

Dynamics is everywhere

    Deletion    ‘bread and butter’ /brEmbY3/ ‘Amsterdam’ ( Du ) /Amst@rdAm/ —>/Ams@dAm/ ‘koninklijke’ ( Du ) /konIŋkl@k@/ —>/kol@k@/ Insertion  homorganic glide insertion: ‘die een’ ( Du ) /dij@n/ Degemination  ‘is zichtbaar’ ( Du ) /Is zIxtbar/ —>/IsIxbar/ Reduction, coarticulation, assimilation July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 6

Acoustic manifestations

 pitch, loudness, formant, component contours   contour stylization (e.g., pitch in praat ) contour modeling    n-th degree curve fitting Legendre polynomials 16 points per segment ) (D.van Bergem) ) (R.van Son)  (phoneme) segmentation  by hand (time consuming; non-consistent)  automatically (via forced phoneme recognition and a pronunciation lexicon with alternatives; systematic errors) July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 7

Contour modeling

   allows modeling of specific phenomena   pitch accentuation (vs. vowel onset) reduction, centralization, undershoot allows generation of stimuli for perc. expts.

   phoneme identification in extending context 2-alternatives forced choice identif. of continua discrimination, RT allows statistics on large speech corpora  TIMIT, CGN, IFA-corpus, Switchboard July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 8

Static vs. dynamic V recogn.

      see Weenink (2001)  “Vowel normalizations with the TIMIT acoustic phonetic speech corpus”, IFA Proc. 24, 117-123 438 males, both train & test sent. of TIMIT 35,385 vowel segments, hand segmented 13 monophthongeal vowel categories 1-Bark bandfilter anal. (18), intensity. normal.

3 frames per segment: central and 25 ms L/R July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 9

Some results

 Vowel classif. (%) with discriminant functions Condition Original # Items 35,385

438x13x(1…25)

35,385 Static 1 frame 59.3

62.2

speaker normalized V centers per speaker speaker normalized 5,374

438x13

5,374 78.9

87.9

Dynamic 3 frames 66.9

69.2

90.1

94.5

July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 10

Formant tracks / speaking rate

   Ph.D. thesis Rob van Son (1993)   “Spectro-temporal features of vowel segments” see also Speech Comm. 13, 135-148 (Pols & vSon) 850-words text, read at normal and fast rate hand segmentation of 7 most freq. V + schwa   formant tracks  via 16 points per segm. or 5 Legendre polynomials influence of rate, V-dur., context, sent. acc.

 evidence for duration-controlled undershoot?

July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 11

Some results

    no differences for F1/F2 in

vowel center

for normal- or fast-rate speech; only some over all rise in F1 for fast rate (irrespective of V) same formant track

shape

(normalized to 16 points) for normal- or fast-rate speech same results when using the more elaborate Legendre polynomials Concl.: changes in V-duration do not change the amount of undershoot —> active control of articulation speed July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 12

Formant representations

2000 1800 1600 1400 1200 1000 800 300 i y u 400 ´ o 500  Normal rate Fast rate  a 600 -250 -200 -150 -100 -50 0 50 100 150 200 250 50 i y o ´ u 0 -50  a Normal rate Fast rate  -100 -150 zeroth order Legendre Legendre polynomial coefficients (mean F i in vowel segment) July 1st, 2002 second order polynomials (axes reversed) Speech acoustics and phonetics, Il Ciocco 13

Modeling vowel reduction

     Ph.D. thesis Dick van Bergem (1995)  “Acoustic and lexical vowel reduction”  see also Speech Communication 16, 329-358 lexical V reduction

Fr

/betõ/ vs.

Du

/b@tOn/ acoustic V reduction /banan, bAnan, b@nan/  f(sent. acc., w. str., w. class): can-candy-canteen coarticulatory effects on the schwa  C 1 @C 2 V- and VC 1 @C 2 -type nonsense words perceptual effects (full V or schwa, f.i. ‘ananas’) July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 14

Some results

t-n w-l The schwa is not just a centralized vowel but something that is completely assimilated with its phonemic context July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 15

Modeling consonant reduction

    Sp. Comm. (1999) 28, 125-140 (vSon & Pols) 20 min. speech, both spontaneous and read 2 x 791 similar VCV; hand segmented 5 aspects of V and C reduction   related to coarticulation: F2 slope differences at CV- vs. VC-boundaries; F2 locus equations (F2 onset vs. F2 target) related to speaking effort: duration; spectral COG (mean freq.); V-C sound energy differences July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 16

Some results

    V markedly reduced in spontaneous speech lower F2-slope diff. in spontaneous speech —> decrease in articulation speed no systematic effect on F2 locus equation; V onsets and targets change in concert —> any V reduction mirrored by comparable change in C spont. sp.: V and C shorter; lower COG —> decrease in vocal and articulatory effort July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 17

Access to large corpora

    more, and more realistic, data phonetic knowledge via statistical analyses f.i. highly accessible IFA-corpus (free, SQL)   see “Structure and access of the open source IFA-corpus”, IFA Proc. 24, 15-26 (vSon & Pols) on-line http://www.fon.hum.uva.nl/IFAcorpus/ 4 M/4F speakers, 5.5 hrs of speech   from informal to read + sent., words, syllables ~ 50Kwords segm. and labeled at phoneme level July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 18

Some results

  speech + annot. + meta data: relational DB realization of final n, f.i.

Du

‘geven’ /xev@(n)/ Style

Informal Retelling Narr. story Sentences Pseudo-sent

All #wrds /@n/ 5,250 1 6,229 13 14,453 14,970 2,554 43,456 180 203 62 /@/ 304 236 372 340 19 All 305 249 552 543 81 459 1,271 1,730 % /@n/ 0.3

5.2 LF 33 HF 42 30 37 77 36 July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 19

Spoken Dutch Corpus (CGN)

        10 M words, 1,000 hrs of speech variety of styles, incl. telephone speech adult Dutch and Flemish speakers for linguistic and technological research see various LREC and ICSLP papers (2002) see also http://lands.let.kun.nl/cgn/home.htm

fully transcribed: orthogr., POS, lemmas partly transcr.: phonemic, prosodic, syntactic July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 20

TIMIT

      popular DB in acoustic phonetics and ASR  also telephone version (NTIMIT) hand segmented & labeled at phoneme level 438 males, 192 females (8 dialect regions) 10 sent./sp. (2 fixed, 1 phon. compact, 7 diverse) sa1: “She had her dark suit in greasy wash water all year” includes separate test data (112 M, 56 F) e.g. Ph.D thesis X. Wang (1997) “Incorporating knowledge on segmental duration in HMM -based continuous speech recognition” July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 21

R

Useful info: durational variability

Root /iy/ 4626 95 39 normal rate=95 0 1544 83 1 1588 95 36 2 1494 109 46 factor level m ean s.d.

S

0 796 78 25 1 711 89 2 37 91 36 25 word final=136 0 816 87 29 1 735 104 40 2 37 98 34 0 719 98 33 1 729 119 54

Lw

0 91 80 1 529 91 2 117 75 26 30 utterance final=186 22 3 79 80 25 0 52 94 27 1 70 136 50 2 180 101 25 3 433 101 42 0 14 83 24 1 22 107 36

Lu

0 52 94 0 50 126 1 12 186 2 8 121 0 134 98 2 46 111 0 374 96 27 46 52 Adopted from Wang (1998) July 1st, 2002 23 25 24 37 Speech acoustics and phonetics, Il Ciocco 1 37 156 58 2 46 104 42 2 1 99 0 2 22 90 27 22

180 140 all 3,696 training sent. (sx + si) of TIMIT training set 160 120 140 100 80 60 40 20

histogram phone dur

120 100 80 60 40 20 0 0 1.54

-0.76

-0.53

-0.3

-0.07

0.16

0.39

0.62

0.85

1.08

0 normalized phone duration  

d

   , speaking rate 1.31

r

N

1

i N

  1 

i

,

‘found’ speech

  DARPA-LVSR community rather ambitious Broadcast News (BN), Sp.Comm. 37 (2002) < ’95

WSJ NAB read sp.

audio training data 100 hrs text (for LM) 430 K 1995

Market place

10 hrs 1996

F0-F5, FX partitioned

55 hrs 122 M 1997

3 hrs test unpartit.

1998

+ non Engl. speech also < 10x RT

+ 50 hrs + 100 hrs 540 M > 900 M best % WER on test set 27.0 % 27.1 % 1:46 hrs 16.2 % 3 hrs 13.5 —>16.1 % 3 hrs (10xRT) For Proc. DARPA Workshops, see http://www.nist.gov/speech/proc/darpa99/index.htm

July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 24

Articul.-acoustic features in ASR

   “A Dutch treatment of an elitist approach to articulatory-acoustic feature classification”, Proc. Eurospeech-2001, 1729-1732 (M. Wester et al.) “Integrating articulatory features into acoustic models for speech recognition”, Phonus 5, 73-86 (K. Kirchhoff, 2000) “An overlapping-feature-based phonological model incorporating linguistic constraints: Applications to speech recognition”, JASA 111 (2), 1086-1101 (J. Sun & L. Deng, 2002) July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 25

Conclusions

    examples of dynamics in speech acoustics going from formal to informal speech:    less dynamics, more reduction (artic. guided) undershoot vs. speaking style sloppiness or articulatory limits?

functionality of dynamics? —> other paper systematicity of dynamics?

 easing ASR, rules for TTS, acquiring knowledge?

July 1st, 2002 Speech acoustics and phonetics, Il Ciocco 26