CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 6: Feature Extraction.
Download
Report
Transcript CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 6: Feature Extraction.
CS 224S / LINGUIST 285
Spoken Language Processing
Dan Jurafsky
Stanford University
Spring 2014
Lecture 6: Feature Extraction
Outline
Feature extraction
How to compute MFCCs
Dealing with variation
Adaptation
MLLR
MAP
Lombard speech
Foreign accent
Pronunciation variation
Discrete Representation of Signal
Represent continuous signal into discrete form.
Image from Bryan Pellom
Sampling
• Measuring amplitude of a signal at time t
• The sample rate needs to have at least two samples
for each cycle
• One for the positive, and one for the negative half of
each cycle
• More than two samples per cycle is ok
• Less than two samples will cause frequencies to be
missed
• So the maximum frequency that can be measured is
one that is half the sampling rate.
• The maximum frequency for a given sampling rate
called Nyquist frequency
Sampling
Original signal in red:
If measure at
green dots, will
see a lower
frequency wave
and miss the
correct higher
frequency one!
Sampling
• In practice we use the following sample rates
• 16,000 Hz (samples/sec), for microphones,
“wideband”
• 8,000 Hz (samples/sec) Telephone
• Why?
• Need at least 2 samples per cycle
• Max measurable frequency is half the sampling
rate
• Human speech < 10KHz, so need max 20K
• Telephone is filtered at 4K, so 8K is enough.
Digitizing Speech (II)
Quantization
Representing real value of each amplitude as integer
8-bit (-128 to 127) or 16-bit (-32768 to 32767)
Formats:
16 bit PCM
8 bit mu-law; log compression
LSB (Intel) vs. MSB (Sun, Apple)
Headers:
Raw (no header)
Microsoft wav
Sun .au
40 byte
header
WAV format
1/5/07
Discrete Representation of Signal
Byte swapping
Little-endian vs. Big-endian
Some audio formats have headers
Headers contain meta-information such as
sampling rates, recording condition
Raw file refers to 'no header'
Example: Microsoft wav, Nist sphere
Nice sound manipulation tool: Sox
http://sox.sourceforge.net/
change sampling rate
convert speech formats
MFCC
Mel-Frequency Cepstral Coefficient (MFCC)
Most widely used spectral representation in ASR
Pre-Emphasis
Pre-emphasis: boosting the energy in the high
frequencies
Q: Why do this?
A: The spectrum for voiced segments has more energy
at lower frequencies than higher frequencies.
This is called spectral tilt
Spectral tilt is caused by the nature of the glottal
pulse
Boosting high-frequency energy gives more info to the
Acoustic Model
Improves phone recognition performance
George Miller figure
Example of pre-emphasis
Spectral slice from the vowel [aa] before and after pre-emphasis
MFCC
Windowing
Image from Bryan Pellom
Windowing
Why divide speech signal into successive
overlapping frames?
Speech is not a stationary signal; we want
information about a small enough region that
the spectral information is a useful cue.
Frames
Frame size: typically, 10-25ms
Frame shift: the length of time between
successive frames, typically, 5-10ms
Common window shapes
Rectangular window:
Hamming window
Window in time domain
MFCC
Discrete Fourier Transform
Input:
Windowed signal x[n]…x[m]
Output:
For each of N discrete frequency bands
A complex number X[k] representing magnidue and
phase of that frequency component in the original signal
Discrete Fourier Transform (DFT)
Standard algorithm for computing DFT:
Fast Fourier Transform (FFT) with complexity N*log(N)
In general, choose N=512 or 1024
Discrete Fourier Transform
computing a spectrum
A 25 ms Hamming-windowed signal from [iy]
And its spectrum as computed by DFT (plus other
smoothing)
MFCC
Mel-scale
Human hearing is not equally sensitive to all frequency
bands
Less sensitive at higher frequencies, roughly > 1000 Hz
I.e. human perception of frequency is non-linear:
Mel-scale
A mel is a unit of pitch
Pairs of sounds
perceptually equidistant in pitch
are separated by an equal number of mels
Mel-scale is approximately linear below 1 kHz and
logarithmic above 1 kHz
Mel Filter Bank Processing
Mel Filter bank
Roughly uniformly spaced before 1 kHz
logarithmic scale after 1 kHz
Mel-filter Bank Processing
Apply the bank of Mel-scaled filters to the spectrum
Each filter output is the sum of its filtered spectral
components
MFCC
Log energy computation
Compute the logarithm of the square magnitude of
the output of Mel-filter bank
Log energy computation
Why log energy?
Logarithm compresses dynamic range of values
Human response to signal level is logarithmic
humans less sensitive to slight differences in
amplitude at high amplitudes than low
amplitudes
Makes frequency estimates less sensitive to slight
variations in input (power variation due to
speaker’s mouth moving closer to mike)
Phase information not helpful in speech
MFCC
The Cepstrum
One way to think about this
Separating the source and filter
Speech waveform is created by
A glottal source waveform
Passes through a vocal tract which because of its shape has a
particular filtering characteristic
Remember articulatory facts from lecture 2:
The vocal cord vibrations create harmonics
The mouth is an amplifier
Depending on shape of oral cavity, some harmonics are
amplified more than others
Vocal Fold Vibration
UCLA Phonetics Lab Demo
George Miller figure
We care about the filter not the
source
Most characteristics of the source
F0
Details of glottal pulse
Don’t matter for phone detection
What we care about is the filter
The exact position of the articulators in the oral
tract
So we want a way to separate these
And use only the filter function
The Cepstrum
The spectrum of the log of the spectrum
Spectrum
Log spectrum
Spectrum of log spectrum
Thinking about the Cepstrum
Mel Frequency cepstrum
The cepstrum requires Fourier analysis
But we’re going from frequency space back to time
So we actually apply inverse DFT
Details for signal processing gurus: Since the log
power spectrum is real and symmetric, inverse DFT
reduces to a Discrete Cosine Transform (DCT)
Another advantage of the
Cepstrum
DCT produces highly uncorrelated features
If we use only the diagonal covariance matrix for our
Gaussian mixture models, we can only handle
uncorrelated features.
In general we’ll just use the first 12 cepstral
coefficients (we don’t want the later ones which have
e.g. the F0 spike)
MFCC
Dynamic Cepstral Coefficient
The cepstral coefficients do not capture
energy
So we add an energy feature
“Delta” features
Speech signal is not constant
slope of formants,
change from stop burst to release
So in addition to the cepstral features
Need to model changes in the cepstral
features over time.
“delta features”
“double delta” (acceleration) features
Delta and double-delta
Derivative: in order to obtain temporal information
Typical MFCC features
Window size: 25ms
Window shift: 10ms
Pre-emphasis coefficient: 0.97
MFCC:
12 MFCC (mel frequency cepstral coefficients)
1 energy feature
12 delta MFCC features
12 double-delta MFCC features
1 delta energy feature
1 double-delta energy feature
Total 39-dimensional features
Why is MFCC so popular?
Efficient to compute
Incorporates a perceptual Mel frequency
scale
Separates the source and filter
IDFT(DCT) decorrelates the features
Necessary for diagonal assumption in
HMM modeling
There are alternatives like PLP
Feature extraction for DNNs
Mel-scaled log energy
For DNN (neural net) acoustic models
instead of Gaussians
We don’t need the features to be
decorrelated
So we use mel-scaled log-energy spectral
features instead of MFCCs
Just run the same feature extraction but
skip the discrete cosine transform.
Acoustic modeling of variation
Variation due to speaker differences
Speaker adaptation
MLLR
MAP
Splitting acoustic models by gender
Speaker adaptation approaches also solve
Variation due to environment
Lombard speech
Foreign accent
Acoustic and pronunciation adaptation to accent
Variation due to genre differences
Pronunciation modeling
Acoustic Model Adaptation
Shift the means and variances of Gaussians to better
match the input feature distribution
Maximum Likelihood Linear Regression (MLLR)
Maximum A Posteriori (MAP) Adaptation
For both speaker adaptation and environment
adaptation
Widely used!
Maximum Likelihood Linear
Regression (MLLR)
Leggetter, C.J. and P. Woodland. 1995. Maximum
likelihood linear regression for speaker adaptation of
continuous density hidden Markov models. Computer
Speech and Language 9:2, 171-185.
Given:
a trained AM
a small “adaptation” dataset from a new speaker
Learn new values for the Gaussian mean vectors
Not by just training on the new data (too small)
But by learning a linear transform which moves the
means.
Maximum Likelihood Linear
Regression (MLLR)
Estimates a linear transform matrix (W) and bias
vector () to transform HMM model means:
mnew = Wrmold + wr
Transform estimated to maximize the likelihood of
the adaptation data
Slide from Bryan Pellom
MLLR
New equation for output likelihood
æ 1
1
-1
Tö
b j (ot ) =
expç- (ot - (Wm j + w )) | S j | (ot - (Wm j + w )) ÷
n /2
1/ 2
è 2
ø
(2p ) | S j |
MLLR
Q: Why is estimating a linear transform from
adaptation data different than just training on
the data?
A: Even from a very small amount of data we
can learn 1 single transform for all triphones!
So small number of parameters.
A2: If we have enough data, we could learn
more transforms (but still less than the number
of triphones). One per phone (~50) is often
done.
MLLR: Learning
Given
an small labeled adaptation set (a couple
sentences)
a trained AM
Do forward-backward alignment on
adaptation set to compute state occupation
probabilities γj(t).
W can now be computed by solving a system
of simultaneous equations involving γj(t)
MLLR performance on baby task (RM)
(Leggetter and Woodland 1995)
Only 3 sentences!
11 seconds of speech!
MLLR doesn’t need supervised
adaptation set!
Slide from Bryan Pellom
Slide from Bryan Pellom after Huang et al
Summary
MLLR: works on small amounts of
adaptation data
MAP: Maximum A Posterior Adaptation
Works well on large adaptation sets
Acoustic adaptation techniques are quite
successful at dealing with speaker variability
If we can get 10 seconds with the speaker.
Sources of Variability: Environment
Noise at source
Car engine, windows open
Fridge/computer fans
Noise in channel
Poor microphone
Poor channel in general (cellphone)
Reverberation
Lots of research on noise-robustness
Spectral subtraction for additive noise
Cepstral Mean Normalization
Microphone arrays
What is additive noise?
Sound pressure for two non coherent sources
p
2
2
2
ps pn
ps : speech
• pn : noise source
• p : mixture of speech and noise sources
•
Slide from Kalle Palomäki, Ulpu Remes,
Mikko Kurimo
What is additive noise?
SNR = -18 dB
SNR = -6 dB
SNR = -14 dB
SNR = -2 dB
SNR=-10 dB
SNR = 2 dB
Slide from Kalle Palomäki, Ulpu Remes,
Mikko Kurimo
Additive Noise: Spectral
Subtraction
Find some silence in the signal, record it and
compute the spectrum of the noise.
Subtract this spectrum from the rest of the signal
Hope that the noise is constant.
There are weird artifacts of the subtraction that
have to be cleaned up
Additive Noise:
Parallel Model Combination
Best but impossible: train models with exact same noisy
speech as test set
Instead:
Collect noise in test, generate a model
Combine noise model and clean-speech models in real-time
performed on model parameters in cepstral domain
Noise and signal are additive in linear domain so transform the
parameters from cepstral to linear domain for combination
Clean speech HMM Noisy speech HMM
Cepstral domain
Noise HMM
Slide from Li Lin Shan 李琳山
C-1
C
exp
log
Model combination
Linear Spectral domain
Cepstral Mean Normalization
Microphone, room acoustics, etc.
Treat as channel distortion
A linear filter h[n] convolved with the signal
ŷ[n] = x[n] ∗ ĥ[n]
In frequency space
Yˆ (k) = Xˆ(k)H^(k)
In log frequency space:
log Yˆ (k) = log Xˆ(k) + log Hˆ (k)
H is constant for a given sentence
So subtracting the mean of the sentence, we eliminate
this constant filter
Sources of Variability: Genre/Style/Task
Read versus conversational speech
Lombard speech
Domain (Booking restaurants, dictation, or
meeting summarization)
One simple example: The Lombard
effect
Changes in speech production in the presence of
background noise
Increase in:
Amplitude
Pitch
Formant frequencies
Result: intelligibility (to humans) increases
Me talking
over Ray Charles:
longer, louder, higher
Lombard Speech
Me talking over
silence
Analysis of Speech Features under LE
Fundamental Frequency
16
Distribution of fundamental frequency
Czech SPEECON
8
Office F
Car F
Office M
Car M
6
Distribution of fundamental frequency
CZKCC
14
Number of samples (x 1000)
10
4
2
12
Eng off F
Eng on F
Eng off M
Eng on M
10
8
6
4
2
0
0
70
170
270
370
470
Fundamental frequency (Hz)
6
Number of samples (x 10,000)
Number of samples (x 10,000)
12
570
70
170
270
370
Fundamental frequency (Hz)
Distribution of fundamental frequency
CLSD'05
5
4
Neutral F
LE F
Neutral M
LE M
3
2
1
0
70
170
270
370
470
570
Fundamental frequency (Hz)
Slides from John Hansen
470
570
Analysis of Speech Features under LE
Formant Locations
2100
2500
2300
Formants - CZKCC
Female digits
/i'/
/i/
Formants - CZKCC
Male digits
/i'/
1900
/i/
/e'/
1700
2100
/e/
/e/
/e'/
1500
F2 (Hz)
F2 (Hz)
1900
1700
1500
1300
/u'/
/a'/
900
300
/u'/
/u/
/a/
/o'/
900
Female_N
Female_LE
400
500
600
700
800
900
Male_N
Male_LE
700
1000
500
200
300
400
500
600
700
800
900
F1 (Hz)
F1 (Hz)
2500
2100
2300
Formants - CLSD'05
Female digits
/i'/
/i/
1900
/i/
1700
/e/
F2 (Hz)
1700
/a'/
/a/
1500
1300
1100
900
300
400
Female_N
Female_LE
500
600
/o/
/u/
/o'/
900
/o/
/u/
1100
/a'/
/a/
/u'/
/o'/
/u'/
/e'/
1500
/e/
1300
Formants - CLSD'05
Male digits
/i'/
1900
/e'/
2100
F2 (Hz)
/o/
/o'/
/u/
1100
1300
1100
/a/
/o/
/a'/
700
F1 (Hz)
Slides from John Hansen
800
900
1000
Male_N
Male_LE
700
500
200
300
400
500
F1 (Hz)
600
700
800
900
One solution to Lombard speech
MLLR
Sources of Variability: Speaker
Gender
Dialect/Foreign Accent
Individual Differences
Physical differences
Language differences (“idiolect”)
VTLN
Speakers overlap
in their phones
Vowel from
different speakers:
VTLN
Vocal Tract Length Normalization
Remember we said vocal tract was tube of length L
If you scale the tube by a factor k, the new length L’ =
kL
Now the formants are scaled by 1/k
In decoding, try various ks
warp the frequency axis linearly during the FFT
computation so as to fit some “canonical” speaker.
Then compute MFCCs as usual
Slide adapted from Chen, Picheny, Eide
Acoustic Adaptation to Foreign
Accent
Train on accented data (if you have enough)
Otherwise, combine MLLR and MAP
MLLR (move the means toward native
speakers)
MAP (mix accented and native speakers
with weights)
Variation due to task/genre
Probably largest remaining source of
error in current ASR
I.e., is an unsolved problem
Maybe one of you will solve it!
Variation due to the conversational
genre
Weintraub, Taussig, Hunicke-Smith, Snodgrass. 1996.
Effect of Speaking Style on LVCSR Performance.
SRI collected a spontaneous conversational speech
corpus, in two parts:
1. Spontaneous Switchboard-style conversation on
an assigned topic
2. A reading session in which participants read
transcripts of their own conversations
2. As if they were dictating to a computer
3. As if they were having a conversation
How do 3 genres affect WER?
WER on exact same words:
Speaking Style
Read Dictation
Read Conversational
Spontaneous Conversation
Word Error
28.8%
37.6%
52.6%
Solution: it’s not the words, it’s something about the
pronunciation of spontaneous speech
Conversational pronunciations!
Switchboard corpus
I was like, “Itʼs just a stupid bug!”
ax z l ay k ih s jh ah s t ey s t uw p ih b ah g
HMMs built from pronunciation dictionary
Actual phones don’t match dictionary sequence!
I was: ax z not ay w ah z
It’s:
ih s
not
ih t s
Testing the hypothesis that pronunciation
is the problem
Saraclar, M, H. Nock, and S. Khudanpur. 2000. Pronunciation modeling by sharing
Gaussian densities across phonetic models. Computer Speech and Language
14:137-160.
“Cheating experiment” or “Oracle experiment”
What is error rate if oracle gave perfect knowledge?
What if you knew what pronunciation to use?
Extracted the actual pronunciation of each word in
Switchboard test set from phone recognizer
Use that pronunciation in dictionary
Baseline SWBD system WER
47%
Oracle Pronunciation Dictionary 27%
Solutions
We’ve tried many things
Multiple pronunciations per word
Decision trees to decide how a word
should be pronounced
Nothing has worked well
Conclusions so far:
triphones do well at minor phonetic
variation
the problem is massive deletions of
phones
Pronunciation modeling in current
recognizers
Use single pronunciation for a word
How to choose this pronunciation?
Generate many pronunciations
Forced alignment on training set
Merge similar pronunciations
For each pronunciation in dictionary
If it occurs in training, pick most likely pronunciation
Else learn some mappings from seen pronunciations,
apply these to unseen pronunciations
Outline
Feature extraction
MFCC
Dealing with variation
Adaptation
MLLR
MAP
Lombard speech
Foreign accent
Pronunciation variation