Landmark-Based Speech Recognition

Download Report

Transcript Landmark-Based Speech Recognition

Landmark-Based Speech
Recognition:
Spectrogram Reading,
Support Vector Machines,
Dynamic Bayesian Networks,
and Phonology
Mark Hasegawa-Johnson
University of Illinois at Urbana-Champaign, USA
Assistant Professor, Electrical and Computer Engineering Department
Assistant Professor, Beckman Institute for Advanced Science and Technology
Adjunct Professor, Speech and Hearing Sciences Department
Lecture 1
Introduction to Spectrogram Reading
• Review
–
–
–
–
Laplace and Fourier transforms
Short-time Fourier transform (STFT) and windowing
White noise
Periodic Signals
• Spectrogram reading: Pitch
– Wideband and narrowband spectrograms
• Spectrogram reading: Manner
– Speech physiology
– Manner classification of phonemes
• Spectrogram reading: Formants
– Log-linear form of a rational filter
Laplace and Fourier Transforms
Transform Properties
Transforms worth knowing:
Impulses
Transforms worth knowing: Filters
Rectangular Window
Hamming & Hanning Windows
Periodic Signals
Random Signals (Noise)
The Short-Time Fourier Transform
The Spectrogram
Narrowband Spectrogram: N > 2T0
Wideband Spectrogram: N < T0
Fundamental Frequency
10F0
4T0
Fundamental Frequency (Pitch): F0=1/T0
On to New Material:
Manner Features, Speech
Production, and Landmarks
Anatomy of Speech Production
Hard Palate
Lips
Nasal Cavity
Oral Cavity
Soft Palate
(Open)
Pharynx
Tongue Blade
Epiglottis
Tongue Body
Vocal Folds
Jaw
Tongue Root
Speech sources: Voicing,
Turbulence, and Transients
• The vocal folds:
– A nonlinear, high-impedance oscillator
– Excitation is like a periodic impulse train
• Turbulence:
– Vortices striking an obstacle produce white noise
– Excitation is like white noise
• Transient:
– High pressure, suddenly released
– Excitation is like a single loud impulse, d(t)
The vocal folds: A nonlinear, highimpedance oscillator
Vocal tract “rings” like a bell,
shaping the sound produced
by the vocal folds
(Cross-sectional area of the
vocal tract: 0.5-10 cm2)
Larynx (the opening between the
vocal folds) has an open area of
0.03 cm2. In order to get
through, air from lungs must
speed up to a high-speed jet.
Vocal folds flap back and forth,
driven by the jet, with a rate of
100-200 pulses/second.
Turbulence: Vortices striking an
obstacle produce white noise
In a fricative, area of the tongue
constriction is about 0.2cm2. In
order to get through, air speeds up
into a turbulent jet.
The turbulent jet strikes against
downstream obstacles, like the
teeth. The jet contains vortices of
all different radii, between 0mm and
0.2cm, therefore the resulting
sound contains noise at all
frequencies above about 700Hz.
Transient:
High pressure, suddenly released
While tongue tip is closed, air
pressure builds up behind the
constriction.
When constriction is released,
there is a sudden change in air
flow through the constriction (from
0 to nonzero). The sudden
change in airflow is heard as a
“pop.”
The Source-Filter Model of Speech
Production
Corresponds to: S(s) = H(s)E(s), where
S(s) = Recorded speech spectrum
E(s) = Source spectrum
H(s) = Transfer function = Filtering by the vocal tract
Manner Classification of
Phonemes: [continuant]
• [-continuant] = lips or tongue close
COMPLETELY on midline of the vocal tract:
–
–
–
–
stops (p,b,t,d,k,g)
nasals (m,n,ng),
affricates (q,j,ch,zh)
syllable-initial lateral (l, e.g., “lake”)
• [+continuant] = no complete closure:
–
–
–
–
fricatives (f,v,s,z,sh,x, Chinese h)
glides (w,y,r, English h)
vowels (a,e,i,o,u)
diphthongs (in “buy,” “boy,” “bow”)
Manner Classification of
Phonemes: [sonorant]
• [+sonorant] = “a sound you can sing” (Latin)
–
–
–
–
–
nasals (m,n,ng)
lateral (l)
glides (w,y,r)
vowels (a,e,i,o,u)
diphthongs (buy, boy, bow)
• [-sonorant] = air pressure builds up behind constriction;
voicing amplitude drops (also called an “obstruent
consonant”)
– stops (p,b,t,d,k,g)
– affricates (q,j,ch,zh)
– fricatives (f,v,s,z,sh,x)
• Special status of “sonorant” in Chinese:
– “initial” must be all-sonorant (“liang”) or all-obstruent (“qing”)
– “final” must be all-sonorant
Sonorant Consonants: Glide,
Lateral, Nasal
“layya ton” -- /l/, /y/, /t/, /n/
(the /y/ is [+continuant], others are -)
“ame” -- /m/ [-continuant]
Obstruent Consonants: Fricatives,
Affricates, and Stops
sa (+continuant)
shi (+continuant)
ba (-continuant)
qe (-continuant)
iji (-continuant)
ita (-continuant)
Place of Primary Articulation
Palatal (Blade):q,j,sh,y,i
Alveolar (Blade):t,d,s,z,n,l
Retroflex (Blade):ch,zh,x,r,er
Dental (Blade):th,dh
Labial (Lips):p,b,f,v,m,w,u,o
Velar (Body):k,g,ng,w,u
Uvular (Body):h,o
Pharyngeal(Body):a,ae
Laryngeal:h
Features of Secondary Articulators:
[lateral], [nasal], [affricated], [aspirated]
• [+sonorant,+continuant]: vowels, glides
• [+sonorant,-continuant]:
– [+nasal] = soft palate is open; air escapes through the nose
– [+lateral] = tongue is open on the sides; air can escape around
edges of tongue
• [-sonorant,+continuant]: fricatives
• [-sonorant,-continuant]:
– [+affricated]: tongue stays nearly closed after release, causing
frication (q,j,ch,zh)
– [+aspirated]: larynx stays open after release, causing aspiration
(p,t,k)
– [-affricated,-aspirated]: nothing special happens after release;
vowel starts immediately (b,d,g)
Sonorant Consonants: Glide,
Lateral, Nasal
“layya ton” -- /l/, /y/, /t/, /n/
(the /y/ is [+continuant], others are -)
“ame” -- /m/ [-continuant]
Waveforms and Spectrograms:
Aspirated and Unaspirated Stops
Unaspirated: /b/
Aspirated: /t/
Phonetic Subsegments in the
Release of an Aspirated Stop
Waveforms and Spectrograms:
Fricatives and Affricates
sa
shi
qe
iji
Landmarks: Changes in the
features [continuant], [sonorant]
/t/ release
/l/ release
/t/ closure
/m/ release
/m/ closure
/v/ release
/v/ closure
/k/
/n/ release
/n/ closure
The Vocal Tract Transfer Function
Log-Spectral Separation of Source
and Filter
Formant Frequencies = Resonant
Frequencies of the Vocal Tract
Formant Frequencies of a Vowel
From Peterson and Barney,
“Control Methods in a Study of
the Vowels,”
Journal of the Acoustical
Society of America, 1952
Classifying Vowels
F2=1200Hz
F1=800Hz
Therefore vowel is /AH/
F2 starts at 1200Hz,
rises to 2000Hz
F1 starts at 800Hz,
falls to 300Hz
Therefore diphthong is /AY/
Rational Filters: Obstruents
Example:
Front Cavity
Resonance of
/ch/ (q) is near
F3 of
Following
Vowel
Rational Filters: Nasal Consonants
Examples: Nasal Consonants
/m/: This talker makes /m/ with
resonances at 1000Hz, 1800Hz
uncancelled, but with the resonance at
300Hz cancelled by zeros.
/ng/: This talker makes /ng/ with
resonances at 300Hz, 1000Hz
uncancelled, but with the resonance at
1800Hz cancelled by zeros.
Summary
• Spectrogram is the log magnitude of the STFT.
• Wideband spectrogram: N<T0, pitch shows up in the time domain
• Narrowband spectrogram: N>2T0, pitch shows up in the frequency
domain
• Landmarks occur at changes in the values of the distinctive features
[continuant] and [sonorant]:
–
–
–
–
[+continuant,+sonorant]: vowels, glides, diphthongs
[+continuant,-sonorant]: fricatives
[-continuant,+sonorant]: nasals, laterals
[-continuant,-sonorant]: stops, affricates
• Recognition of Vowels and Glides: F1 and F2 are usually enough
• Recognition of Diphthongs: F1 and F2 at two separate points in time
(beginning and ending of the vowel).
• Obstruent Consonants: Back cavity formants are cancelled by zeros,
leaving only the front cavity formants (e.g., F3 for /sh/, /q/)
• Nasal Consonants: Resonances of the mouth-nose system are often
cancelled by zeros, leaving primarily low-frequency energy.