Front end signal processing for ASR

Download Report

Transcript Front end signal processing for ASR

Feature Extraction for
speech applications
Chapters 19-22
The course so far
• Brief introduction to speech analysis and
recognition for humans and machines
• Some basics on speech production, acoustics,
pattern classification, speech units
Where to next
• Multi-week focus on audio signal processing
• Start off with the “front end” for ASR
• Goal: generate features good for classification
• Waveform is too variable
• Current front ends make some sense in terms
of signal characteristics alone (production
model) - recall the spectral envelope
• But analogy to perceptual system is there too
• A bit of this now (much more on ASR in April)
Biological analogy
• Essentially all ASR front ends start with
a spectral analysis stage
• “Output” from the ear is frequency dependent
• Target-probe experiments going back to
Fletcher (remember him?) suggest a
“critical band”
• Other measurements also suggest similar
mechanism (linear below 1kHz, log above)
Basic Idea (Fletcher)
• Look at response to pure tone in white noise
• Set tone level to just audible
• Reduce noise BW, initially same threshold
• For noise BW below critical value, audibility
threshold goes down
• Presence or absence of tone based on SNR
within the band
Feature Extraction for ASR
Spectral
(envelope)
Analysis
Auditory
Model/
Normalizations
Deriving the envelope
(or the excitation)
excitation
e(n)
Time-varying filter
ht(n)
y(n)=e(n)*ht(n)
HOW CAN WE GET e(n) OR h(n) from y(n)?
But first, why?
• Excitation/pitch: for vocoding; for synthesis;
for signal transformation; for prosody
extraction (emotion, sentence end, ASR for
tonal languages …); for voicing category in ASR
• Filter (envelope): for vocoding; for synthesis;
for phonetically relevant information for ASR
• Frequency dependency appears to be a key
aspect of a system that works - human audition
Spectral Envelope Estimation
• Filters
• Cepstral Deconvolution
(Homomorphic filtering)
• LPC
Channel vocoder
(analysis)
Broad w.r.t harmonics
e(n)*h(n)
Bandpass power estimation
Band-pass filter
A
B
Rectifier
Low-pass filter
A
B
C
C
Deriving spectral envelope
with a filter bank
BP 1
rectify
LP 1
decimate
BP 2
rectify
LP 2
decimate
Magnitude
signals
speech
BP N
rectify
LP N
decimate
Filterbank properties
• Original Dudley Voder/Vocoder: 10 filters,
300 Hz bandwidth (based on # fingers!)
• A decade later, Vaderson used 30 filters,
100 Hz bandwidth (better)
• Using variable frequency resolution, can use
16 filters with the same quality
Mel filterbank
• Warping function B(f) = 1125 ln (1 + f/700)
• Based on listening experiments with pitch
(mel is for “melody”)
Other warping functions
• Bark(f) = [26.8 /(1 + (1960/f))] - 0.53
(named after Barkhausen, proposed loudness scale)
Based on critical band estimates from masking
experiments
• ERB(f) = 21.4 log10(1+ 4.37f/1000)
(Equivalent Rectangular Bandwidth)
Similarly based on masking experiments,
but with better estimates of auditory filter shape
All together now
Towards other
deconvolution methods
• Filters seem biologically plausible
• Other operations could potentially
separate excitation from filter
• Periodic source provides harmonics
(close together in frequency)
• Filter provides broad influence
(envelope) on harmonic series
• Can we use these facts to separate?
“Homomorphic”
processing
• Linear processing is well-behaved
• Some simple nonlinearities also permit
simple processing, interpretation
• Logarithm a good example; multiplicative
effects become additive
• Sometimes in additive domain, parts
more separable
• Famous example: “blind” deconvolution
of Caruso recordings
IEEE Oral History Transcripts: Oppenheim on
Stockham’s Deconvolution of Caruso Recordings (1)
Oppenheim: Then all speech compression systems and many speech recognition systems are
oriented toward doing this deconvolution, then processing things separately, and then going on
from there. A very different application of homomorphic deconvolution was something that Tom
Stockham did. He started it at Lincoln and continued it at the University of Utah. It has become
very famous, actually. It involves using homomorphic deconvolution to restore old Caruso
recordings.
Goldstein: I have heard about that.
Oppenheim: Yes. So you know that's become one of the well-known applications of
deconvolution for speech.
…
Oppenheim: What happens in a recording like Caruso's is that he was singing into
a horn that to make the recording. The recording horn has an impulse response, and that
distorts the effect of his voice, my talking like this. [cupping his hands around his mouth]
Goldstein: Okay.
IEEE Oral History Transcripts (2)
Oppenheim: So there is a reverberant quality to it. Now what you want to do is deconvolve that
out, because what you hear when I do this [cupping his hands around his mouth] is the
convolution of what I'm saying and the impulse response of this horn. Now you could say, "Well
why don't you go off and measure it. Just get one
of those old horns, measure its impulse response, and then you can do the deconvolution."
The problem is that the characteristics of those horns changed with temperature, and they
changed with the way they were turned up each time. So you've got to estimate that from the
music itself. That led to a whole notion which I believe Tom launched, which is the concept of
blind deconvolution. In other words, being able to estimate from the signal that you've got the
convolutional piece that you want to get rid of. Tom did that using some of the techniques of
homomorphic filtering. Tom and a student of his at Utah named Neil Miller did some further
work. After the deconvolution, what happens is you apply some high pass filtering to the
recording. That's what it ends up doing. What that does is amplify some of the noise that's on
the recording. Tom and Neil knew Caruso's singing. You can use the homomorphic vocoder
that I developed to analyze the singing and then resynthesize it. When you resynthesize it you
can do so without the noise. They did that, and of course what happens is not only do you get
rid of the noise but you get rid of the orchestra. That's actually become a very fun demo which I
still play in my class. This was done twenty years ago, but it's still pretty dramatic. You hear
Caruso singing with the orchestra, then you can hear the enhanced version after the blind
deconvolution, and then you can also hear the result after you get rid of the orchestra,. Getting
rid of the orchestra is something you can't do with linear filtering. It has to be a nonlinear
technique.
Log processing
•
•
•
•
Suppose y(n) = e(n)*h(n)
Then Y(f) = E(f)H(f)
And logY(f) = log E(f) + log H(f)
In some cases, these pieces are
separable by a linear filter
• If all you want is H, processing can
smooth Y(f)
Source-filter separation by
cepstral analysis
Excitation
Windowed
speech
FFT
Log
magnitude
FFT
Pitch
detection
Time
Spectral
separation function
Cepstral features
•
•
•
•
Typically truncated (smooths the estimate; why?)
Corresponds to spectral envelope estimation
Features also are roughly orthogonal
Common transformation for many spectral
features
• Used almost universally for ASR (in some form)
• To reconstruct speech (without min phase
assumption) need complex cepstrum
An alternative:
Incorporate Production
• Assume simple excitation/vocal tract model
• Assume cascaded resonators for vocal tract
frequency response (envelope)
• Find resonator parameters for best spectral
approximation
Resonator frequency response
Pole-only (complex) resonator
H (z)  1bz1  cz2
1
b  2r cos
cr
2
Where r = pole magnitude, θ = pole angle
Error Signal
P
e(n)  y(n)  y%
(n)  y(n)   a j y(n  j)
j 1
P
j
%
E(z)  Y (z)  Y (z)  Y (z)   a j z Y (z)
P
j 1
 Y (z)(1   a j z Y (z))
j
j 1
Y (z)
E(z) 
H (z)
where
H (z) 
1
P
1   a j z y
j 1
Some LPC Issues
• Error criterion
• Model order
Error Criterion
LPC Peak Modeling
• Total error constrained to be (at best)
gain factor squared
• Error where model spectrum is larger
contributes less
• Model spectrum tends to “hug” peaks
LPC spectra and error
More effects of
LPC error criterion
• Globally tracks, but worse match in
log spectrum for low values
• “Attempts” to model anti-aliasing
filter, mic response
• Ill-conditioned for wide-ranging spectral
values
Other LPC properties
• Behavior in noise
• Sharpness of peaks
• Speaker dependence
LPC Model Order
• Too few, can’t represent formants
• Too many, model detail, especially harmonics
• Too many, low error, ill-conditioned matrices
LPC Speech Spectra
LPC Prediction error
Optimal Model Order
• Akaike Information Criterion (AIC)
• Cross-validation (trial and error)
Coefficient Estimation
• Minimize squared error - set derivs to zero
• Compute in blocks or on-line
• For blocks, use autocorrelation or covariance
methods (pertains to windowing, edge effects)
Minimizing the error criterion
N 1
N 1
P
D   e (n)   (y(n)   a j y(n  j))
2
n0
2
j 1
n0
If we take partial derivatives with respect to each
aj
P
  a (i, j)  (i,0)
j
for
i  1, 2,..., P
j 1
Where
 (i, j)
is a correlation sum between versions
of the signal delayed by i and j points
Solving the Equations
• Autocorrelation method: Levinson or Durbin
recursions, O(P2) ops; uses Toeplitz property
(constant along left-right diagonals),
guaranteed stable
• Covariance method: Cholesky decomposition,
O(P3) ops; just uses symmetry property, not
guaranteed stable
LPC-based representations
• Predictor polynomial - ai, 1<=i<=p , direct computation
• Root pairs - roots of polynomial, complex pairs
• Reflection coefficients - recursion; interpolated values
always stable (also called PARCOR coefficients ki, 1<=i<=p)
• Log area ratios = ln((1-k)/(1+k)) , low spectral sensitivity
• Line spectral frequencies - freq. pts around resonance;
low spectral sensitivity, stable
• Cepstra - can be unstable, but useful for recognition
LPC analysis block diagram
Spectral Estimation
Filter Banks
Reduced Pitch Effects
X
Excitation Estimate
Direct Access to Spectra
X
Less Resolution at HF
X
Orthogonal Outputs
Peak-hugging Property
Reduced Computation
Cepstral
Analysis
LPC
X
X
X
X
X
X
X
Feature Extraction for
ASR
Chapter 22
ASR Front End
• Coarse spectral representation (envelope)
• Coarsest for high frequencies
• Limitations for each basic type (filter bank,
cepstrum, LPC)
Limitations for archetypes
• Filter banks

correlated outputs, no focus on peaks
• Cepstral analysis

uniform spectral resolution, no focus on
peaks
• LPC

uniform spectral resolution
• Solution: hybrid approaches
Two “Standards”
• Mel Cepstrum: Bridle (1974), Davis and
Mermelstein (1980)
• Perceptual Linear Prediction (PLP): Hermansky,
~1985, 1990
Mel Cepstral Analysis
Single Zero FIR
Preemphasis
PLP Analysis
Done in Crit. Band step
FFT
| |2
Triangular
Critical bands
Trapezoidal
Log
Compression
Cube Root
IFFT
Cepstral truncation
Smoothing
Liftering
LPC Analysis
Perceptual Linear
Prediction (PLP)
[Hermansky 1990]
•
•
Auditory-like modifications
of short-term speech
spectrum prior to its
approximation by all-pole
autoregressive model (or
cepstral truncation in case
of MFCC)
– critical-band spectral
resolution
– equal-loudness
sensitivity
– intensity-loudness
nonlinearity
These 3 applied in virtually
all state-of-the-art
experimental ASR systems
Steps 2-4 of PLP
Dynamic Features
• Delta features - local slope in cepstrum
• Computed by filtering/linear regression
• Higher derivatives often used now
• Typically used in combination w/ “static”
features
Speaker robustness - VTLN
• Different vocal tract lengths -> different
formant positions (e.g., male vs female)
• Expansion/compression can be estimated
• Typically use statistical modeling to optimize
• Can look at characteristics like pitch or 3rd
formant
Acoustic (environment)
robustness
• Convolutional error (e.g., microphone, channel
spectrum)
• Additive noise (e.g., fans, auto engine)
• Limitations for typical solutions: time-invariant
or slowly varying, linear, phone-independent
Key Processing Step for
ASR:
Cepstral Mean Subtraction
• Imagine a fixed filter h(n), so x(n)=s(n)*h(n)
• Same arguments as before, but
- let s vary over time
- let h be fixed over time
• Then average cepstra should represent the
fixed component (including fixed part of s)
• (Think about it)
Convolutional Error
X(ω,t) = S(ω,t)H(ω,t)
|X(ω,t)|2 = |S(ω,t)|2 |H(ω,t)|2
log |X(ω,t)|2 = log|S(ω,t)|2 + log |H(ω,t)|2
Cx(n,t) = CS (n,t) + CH (n,t)
Convolutional error
strategies
• Blind deconvolution/cepstral mean subtraction:
Atal 1974
• On-line method- RelAtive SpecTral Analysis
(RASTA): Hermansky and Morgan, 1991
Some variants on CMS
• Subtract utterance mean from each
cepstral coefficient
• Compute mean over a longer region (e.g.,
conversational side)
• Compute a running mean
• Use the mean from the last utterance
• Also divide by std deviation
“Standard” RASTA
Some of the proposed
improvements to RASTA
• Run backwards and forwards in time (gets
rid of phase in transfer fn)
• Train filter on data (discriminative RASTA)
• Use multiple filters
• Use in combination with Wiener filtering
Long-time convolution
• Reverberation has effects beyond the typical
analysis frame
• Can do log spectral subtraction w/ long frames
• Alternatively, smear system training data to
improve match to temporal smearing in test
• In practice, this is an unsolved problem
(especially when noise is present, i.e., always)
Additive noise (stationary)
• Subtract off noise spectral estimate
• Need a noise estimate
• Use a second microphone if you have it
Wiener filter /spectral
subtraction
• Assume that X = S + N
(suppressing freq dep in notation)
• If uncorrelated, |X|2 = |S|2 + |N|2 (PSDs)
• |Sest|2 = |X|2 - |Nest|2 , or |H|2 = 1 - |N|2 / |X|2
• If no channel effect, Wiener filter is
H = |S|2 / (|S|2 + |N|2 )
• So Wiener filter is H = 1 - |N|2 / |X|2
• Similar effect but for exponents
• In practice many variants - also smoothing to
avoid “musical noise”
Just Suppose …
• What if, for some ω,|Nest|2 › |X|2
?
• Then |Sest|2 = |X|2 - |Nest|2 is negative
• But if it is a PSD …
• So, what should we do?
Piano with noise
Piano with noise and Wiener filtering
ETSI standard: AFE
• Aurora competition
• AFE = “Advanced Front End”
• Noise est., Wiener filtering, done twice
• Emphasis on high SNR parts of waveform
• Other methods did well later
•
(e.g., Ellis – Tandem [MLP+HMM], 2 streams,
PLP+MSG)
Modulation-filtered SpectroGram (MSG)
Kingsbury, 1998 Berkeley PhD thesis
Noise and convolution
• Can use a different form of RASTA: “J-RASTA”
• Filters log-like function of spectrum:
f(x) = log( 1 + Jx)
•
where J
1/Noise power
• Many other methods (primarily statistical)
• None lower word error rates to clean levels
Noise and convolution other compensation methods
• Given “stereo” data, find additive vector to best
match the cepstra
• Get data from multiple testing
environments/microphones, find best match
• Vector Taylor Series methods (approx effect on
cepstra of noise, convolution)
• SPLICE (Stereo-based Piecewise LInear
Compensation for Environments) methods
• Or else, adaptation of stat model
Noise and convolution what would we really want?
• For online case, would like to be insensitive
to noise and convolutional errors
• Would like to do this without needing known noise
regions
• People can do this
• So - study auditory system?
“Auditory” properties
in speech “front ends”
• Nonlinear spacing/bandwidth for filter bank
• Compression (log for MFCC, cube root for PLP)
• Preemphasis/equal loudness
• Smoothing for envelope estimate
• Insensitivity to constant spectral multiplier
Auditory Models
• Shifting definitions
• Typically means whatever we aren’t using yet
• Example: Ensemble Interval Histogram (EIH)

looking for coherence across bands of
histogram of threshold crossings
Seneff Auditory Model
Auditory Models (cont.)
• Representation of cochlear output - e.g., the
cochleagram
• Representation of temporal information - the
correlogram (particularly for pitch)
- shows autocorrelation function for each
spectral component; i.e., frequency vs lag
Correlogram example 1
Correlogram example 2
Correlogram example 3
Correlogram example 4
Correlogram example 5
Other perspectives
• Temporal information in individual bands
(TRAPS/HATS)
• Spectro-Temporal Receptive Fields (models
from ferret brain experiments)
• Multiple mappings for greater robustness,
including more “sluggish” features
ASR Systems are half-deaf
• Phonetic classification is very poor (even in lownoise conditions)
• Success is due to constraints (domain, speaker,
noisecanceling mic, etc)
• These constraints can mask the underlying
weakness of the technology
Pushing the envelope (aside)
OLD
25 ms (stepped by 10 ms)
estimate
of sound
identity
• Problem: Spectral envelope is a fragile information carrier
• Solution: Probabilities from multiple time-frequency patches
up to 1s
k-th estimate
i-th estimate
n-th estimate
time
information
fusion
NEW
estimate
of sound
identity
Multi-rate features
Narrowband 500 ms (HATS)
Broadband 100 ms (Tandem) Broadband 25 ms
13
overlapping
spectral
slices
9 frames,
PLP cepstra
MLP
1 frame,
PLP cepstra
MLP
posteriors
combine
concatenate
features
Multiple microphones
• Array approaches for beamforming
• “Distant” second microphone for noise estimate

use cross-correlation to derive transfer fn
for noise to get from noise sensor to signal
sensor
The Rest of the System
• Focus has been on features
• Feature choice affects statistics
• Noise/channel robustness strategies often
focus on the statistical models
• For now, we will focus on a deterministic view -
later, deterministic ASR (ch 24)
• First, pitch and general audio – chap. 16, 31,
35, 37, 39
End - feature extraction; on to DTW …