Transcript Document

CS 551/651:
Structure of Spoken Language
Lecture 5: Characteristics of Place of Articulation;
Phonetic Transcription
John-Paul Hosom
Fall 2010
1
Acoustic-Phonetic Features: Manner of Articulation
Approximately 8 manners of articulation:
Name
Vowel
Approximants
Nasal
Stop
Fricative
Affricate
Aspiration
Flap
Sub-Types
vowel, diphthong
liquid, glide
unvoiced, voiced
unvoiced, voiced
unvoiced, voiced
Examples
.
aa, iy, uw, eh, ow, …
l, r, w, y
m, n, ng
p, t, k, b, d, g
f, th, s, sh, v, dh, z, zh
ch, jh
h
dx, nx
Change in manner of articulation usually abrupt and visible;
manner provides much information about location of phonemes.
2
Acoustic-Phonetic Features: Place of Articulation
Approximately 8 places of articulation for consonants:
Name
Labial
Labio-Dental
Dental
Alveolar
Palato-Alveolar
Palatal
Velar
Glottal
Examples
p, b, m, (w)
f, v
th, dh
t, d, s, z, n, l*
sh, zh, ch**, jh**, r***
y
k, g, ng, (w)
h
.
/l/ doesn’t have same coarticulatory properties as other alveolars
** starts as alveolar (/t/, /d/), then becomes palatal-alveolar
*** /r/ can have a complex place of articulation
*
Place of articulation more subject to coarticulation than manner;
3
F2 trajectory important for identifying place of articulation.
Acoustic-Phonetic Features: Place of Articulation
Labial (/p/, /b/, /m/, /w/):
• constriction (or complete closure) at lips
• the only unvoiced labial is /p/
• the only nasal labial is /m/
• characterized by F1, F2, (even) F3 of adjacent vowel(s)
rapidly and briefly decreasing at border with labial
4
Acoustic-Phonetic Features: Place of Articulation
Labio-Dental (/f/, /v/):
• produced by constriction between lower lip and upper teeth
• in English, all labio-dental phonemes are fricatives
• can be characterized by formants of adjacent vowel(s)
decreasing at border with labial (similar to characteristics
of labials)
Dental (/th/, /dh/):
• produced by constriction between tongue tip and upper teeth
(sometimes tongue tip is closer to alveolar ridge)
• in English, all dental phonemes are fricatives
• may be characterized by stronger energy above 6 KHz,
but weaker than /sh/, /zh/ fricatives
5
Acoustic-Phonetic Features: Place of Articulation
Alveolar (/t/, /d/, /s/, /z/, /n/, /l/):
• tongue tip is at or near alveolar ridge
• a large number of English consonants are alveolar
• primary cue to alveolars: F2 of neighboring vowel(s)
is around 1800 Hz, except for /l/
• /l/ has low F1 ( 400 Hz) and F2 ( 1000 Hz), high F3
• /l/ before vowel is “light” /l/, after vowel is “dark” /l/.
6
Acoustic-Phonetic Features: Place of Articulation
Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/):
• tongue is between alveolar ridge and hard palate
• 2 fricatives, 2 affricates, 1 rhotic consonant (r sound)
• retroflex has “depression” midway along tongue
• the palato-alveolar fricatives tend to have strong energy
due to weak constriction allowing large airflow
• /r/ (and /er/) most easily identified by F3 below 2000 Hz
• /r/ sometimes considered alveolar approximant
Palatal (/y/):
• produced with tongue close to hard palate
• “extreme” production of /iy/
• F1-F2 tend to be more spread than /iy/, F1 is lower than /iy/
7
Acoustic-Phonetic Features: Place of Articulation
Velar (/k/, /g/, /ng/):
• produced with constriction against velum (soft palate)
• only plosives /k/ and /g/, and nasal /ng/
• characteristic of velars is the “velar pinch”, in which
F2 and F3 of neighboring vowel become very close
at boundary with velar. Most visible in front vowel /ih/
8
Acoustic-Phonetic Features: Place of Articulation
Glottal (/h/):
• /h/ is the nominal glottal phoneme in English; in
reality, the tongue can be in any vowel-like position
• the primary cue for /h/ is formant structure without
voicing, an energy dip, and/or an increase in aspiration
noise in higher frequencies.
9
Distinctive Phonetic Features: Summary
•
Distinctive features may be used to categorize phonetic
sub-classes and show relationships between phonemes
•
There is often not a one-to-one correspondence between a
feature value and a particular trait in the speech signal
•
A variety of context-dependent and context-independent cues
(sometimes conflicting, sometimes complimentary) serve to
identify features
•
Speech is highly variable, highly context-dependent, and
cues to phonemic identity are spread in both the spectral
and time domains. The diffusion of features makes
automatic speech recognition difficult, but human speech
recognition is able to use this diffusion for robustness.
10
Redundancy
•
Distinctive features are not always independent; some
redundancy may be implied (especially binary features)
•
Example: Spanish
High
Low
Back
Round
+high  low
+round  +back
back  low
i
+
e
a
o




+






+
+
+

+low  high
+low  +back
+round  low
u
+

+
+
back  round
+low  round
These relationships are language and feature-set specific.
11
(from Schane, p. 35-38)
Redundancy
•
Redundant information can be indicated by circling redundant
features:
e
a
o
High
i
+


Low



+
u
+
Back
Round





+
+

+
+
+

•
Some redundancies are universal (can’t be +high and +low)
•
Phonetic sequences also have constraints (redundant info.):
English has no more than 3 word-initial consonants; in this
case, first consonant is always /s/; next is always /p/, /t/, or /k/;
third is always /r/ or /l/ (from Schane, p. 36-40)
12
Phonetic Transcription
Given a corpus of speech data, it’s often necessary to create a
transcription:
• word level
• phoneme level
• time-aligned phoneme level
• time-aligned detailed phoneme level (with diacritics)
• other information: phonetic stress, emotion, syntax, repairs
Most common are word-level and time-aligned phoneme level.
Time-aligned phonetic transcription examples:
0
110
180
240
280
390
110
180
240
280
390
540
.pau
h
eh
l
ow
.pau
t
uw
.br
13
Phonetic Transcription
Are phonemes precise quantities with exact boundaries?
No… humans disagree on phonetic labels and boundary positions;
disagreement may be a matter of interpretation of the utterance.
Phonetic label agreement between humans:
Full Labels
Base Labels
Broad
Categories
English
70%
71%
89%
German
61%
65%
81%
Mandarin
66%
78%
87%
Spanish
74%
82%
90%
Full, Base Label Set: 55 (English), 62 (German), 50 (Mandarin),
42 (Spanish)
Broad Categories:
7 corresponding to manner of articulation
*From Cole, Oshika, et al., ICSLP’94
14
Phonetic Transcription
70% agreement on 55 phonemes, 89% agreement on 7 categories
Best phoneme-level automatic speech recognition results on TIMIT,
with a 39-phoneme symbol set: 75.8% (Antoniou, 2001; Reynolds and
Antoniou, 2003?)
Differences:
1. Human agreement evaluated on spontaneous speech (stories),
TIMIT is read speech
2. Humans used 55 phonemes; 39 phonemes for evaluating TIMIT
Phoneme agreement doesn’t translate into word accuracy…
human word accuracy is typically an order of magnitude better
than the best automatic speech recognition system.
15
Phonetic Transcription
Phonetic label boundary agreement between humans:
Agreement measured by comparing two manual labelings, A and B,
and computing the percentage of cases in which B labels are within
some threshold (20 msec) of A labels.
100
agreement (%)
Cosi
90
Ljolje
80
Wesenick
70
Cole
Leung
60
Hosom
50
0
5
10
15
20
25
30
35
40
45
threshold (msec)
Average agreement of 93.8% within 20 msec threshold;
Maximum agreement of 96% within 20 msec
16
Phonetic Transcription
Is there a “correct” answer? No; inherently subjective although
semi-arbitrary guidelines can be imposed.
Is measuring accuracy meaningless? No; phonemes do have
identity and order, although details may be subjective.
Sometimes very precise (if semi-arbitrary) labels and boundaries
are extremely important (e.g. concatenative text-to-speech databases).
What about getting a computer to generate transcriptions, or
at least phonetic boundaries?
Advantages:
consistent, fast
Disadvantages: not accurate, compared to human transcription
not robust to different speakers, environments
17
Phonetic Transcription
Automatic Phonetic Alignment (assume phonetic identity is known):
Two common methods:
(1) “Forced Alignment”:
Use existing speech recognizer, constrained to recognize
only the “correct” phoneme sequence. The search process
used by HMM recognizers returns both phoneme identity and
location. Location information is boundary information.
(2) Dynamic Time Warping:
(a) Use text-to-speech or utterance “templates” to generate
same speech content with known boundaries. (b) Warp time
scale of reference (TTS or template) with input speech to
minimize spectral error. (c) Convert known boundary
locations to original time scale.
18
Phonetic Transcription
Accuracy of automatic alignment
Speaker-independent alignment using Forced Alignment:
100
Brugnara
Ljolje1
Pellom
Wightman
Ljolje2
Rapp
Svendsen1
Pauws
Dalsgaard
Malfrere1
Kipp
Stober1
Stober2
Hosom
agreement (%)
90
80
70
60
50
40
30
0
5
10
15 20 25 30
threshold (msec)
35
40
45
19
Phonetic Transcription
Comparing manual and automatic alignment of TIMIT corpus:
100
90
Automatic
80
70
Manual
60
50
0
5
10
15
20
25
30
35
40
45
• Automatic method still makes “stupid” mistakes.
• Manual labeling criteria not rigorously defined.
• Performance degrades significantly in presence of noise.
• Assumes correct phonetic sequence is known…
20