Automatic speech recognition using an echo state network

Download Report

Transcript Automatic speech recognition using an echo state network

Automatic speech recognition using an
echo state network
Mark D. Skowronski
Computational Neuro-Engineering Lab
Electrical and Computer Engineering
University of Florida, Gainesville, FL, USA
May 10, 2006
CNEL Seminar History
•
•
•
•
•
Ratio spectrum, Oct. 2000
HFCC, Sept. 2002
Bats, Dec. 2004
Electrohysterography, Aug. 2005
Echo state network, May 2006
2006
2000
Overview
•
•
•
•
•
ASR motivations
Intro to echo state network
Multiple readout filters
ASR experiments
Conclusions
ASR Motivations
• Speech is most natural form of
communication among humans.
• Human-machine interaction lags behind with
tactile interface.
• Bottleneck in machine understanding is
signal-to-symbol translation.
• Human speech a “tough” signal:
– Nonstationary
– Non-Gaussian
– Nonlinear systems for production/perception
How to handle the “non”-ness of speech?
ASR State of the Art
• Feature extraction: HFCC
– bio-inspired frequency analysis
– tailored for statistical models
• Acoustic pattern rec: HMM
– Piecewise-stationary stochastic model
– Efficient training/testing algorithms
– …but several simplistic assumptions
• Language models
– Uses knowledge of language, grammar
– HMM implementations
– Machine language understanding still
elusive (spam blockers)
Hidden Markov Model
Premier stochastic model of non-stationary time series
used for decision making.
Assumptions:
1) Speech is piecewise-stationary
process.
2) Features are independent.
3) State duration is exponential.
4) State transition prob. function of
previous-next state only.
Can we devise a better
pattern recognition model?
Echo State Network
• Partially trained recurrent neural network,
Herbert Jaeger, 2001
• Unique characteristics:
– Recurrent “reservoir” of processing elements,
interconnected with random untrained weights.
– Linear readout weights trained with simple
regression provide closed-form, stable, unique
solution.
ESN Diagram & Equations
x(n)  f ( W  x(n  1)  Win  u(n))
y (n)  Wout  x(n)
ESN Matrices
• Win: untrained, M x Min matrix
– Zero mean, unity variance normally distributed
– Scaled by rin
• W: untrained, M x M matrix
– Zero mean, unity variance normally distributed
– Scaled such that spectral radius r < 1
• Wout: trained, linear regression, Mout x M matrix
– Regression  closed-form, stable, unique solution
– O(M2) per data point complexity
Echo States Conditions
• The network has echo states if x(n) is
uniquely determined by left-infinite input
sequence …,u(n-1),u(n).
• x(n) is an “echo” of all previous inputs.
• If f is tanh activation function:
– If σmax(W)=||W||<1, guarantees echo states
– If r=|λmax(W)|>1, guarantees no echo states
ESN Training
• Minimize mean-squared error between y(n)
and desired signal d(n).
Wiener solution:
1
Wout  R  p
T 1
Wout  (x(n)  x(n) )  (x(n)  d(n) )
T
ESN Example: Mackey-Glass
M=60 PEs
r=0.9
rin=0.3
u(n): MG,
10000 samples
d(n)=u(n+1)
Prediction Gain
(var(u)/var(e)):
Input: 16.3 dB
Wiener: 45.1 dB
ESN: 62.6 dB
Multiple Readout Filters
• Wout projects reservoir space to output space.
• Question: how to divide reservoir space and
use multiple readout filters?
• Answer: competitive network of filters
y (n)  W  x(n), k  [1, K ]
k
k
out
• Question: how to train/test competitive
network of K filters?
• Answer: mimic HMM.
HMM vs. ESN Classifier
HMM
Output Likelihood
Architecture States, left-to-right
ESN Classifier
MSE
States, left-to-right
Minimum Gaussian kernel
element
Readout filter
Elements GMM
combined
Winner-take-all
Transitions State transition matrix
Training Segmental K-means
(Baum-Welch)
Discriminatory No
Binary switching matrix
Segmental K-means
Maybe, depends on desired
signal
Segmental K-means: Init
For each input, xi(n) and desired di(n) for sequence i:
Divide x,d into equal-sized chunks Xη,Dη (one per state).
For each n, select k(n)[1,K] uniform random.
A
k ( n ),


 Xi (n)  ( Xi (n))
T
B k ( n ),  Xi (n)  (Di (n))T
After init. with all sequences:
k ,
Wout
 ( A k , ) 1  B k ,
Segmental K-means: Training
• For each utterance:
– Produce MSE for each readout filter.
– Find Viterbi path through MSE matrix.
– Use features from each state to update
auto- and cross-correlation matrices.
• After all utterances: Wiener solution
• Guaranteed to converge to local
minimum in MSE over training set.
ASR Example 1
• Isolated English digits “zero” - “nine” from TI46 corpus: 8 male,
8 female, 26 utterances each, 12.5 kHz sampling rate.
• ESN: M=60 PEs, r=2.0, rin=0.1, 10 word models, various
#states and #filters per state.
• Features: 13 HFCC, 100 fps, Hamming window, pre-emphasis
(α=0.95), CMS, Δ+ΔΔ (±4 frames)
• Pre-processing: zero-mean and whitening transform
• M1/F1: testing; M2/F2: validation; M3-M8/F3-F8 training
• Two to six training epochs for all models
• Desired: next frame of 39-dimension features
• Test: corrupted by additive noise from “real” sources (subway,
babble, car, exhibition hall, restaurant, street, airport terminal,
train station)
• Baseline: HMM with identical input features
ASR Results, noise free
Number of classification errors out of 518 (smaller is better).
K
ESN (HMM)
1
2
3
4
5
10
Nst=1
7(171)
6(136)
3(65)
2(33)
3(4)
2(2)
2
1(83)
1(46)
0(4)
1(3)
2(2)
1(0)
3
0(126)
1(4)
0(2)
0(2)
0(1)
2(0)
5
1(11)
1(2)
0(0)
0(0)
1(0)
0(0)
10
1(2)
1(0)
1(0)
1(0)
0(0)
0(0)
15
0(1)
0(0)
0(0)
0(0)
0(0)
1(0)
20
0
0
0
0
0
1
ASR Results, noisy
Average accuracy (%),all noise sources, 0-20 dB SNR (larger is better):
K
ESN (HMM)
1
2
3
4
5
10
Nst=1
70.9(22.4)
70.0(29.7)
74.6(45.6)
74.3(46.0)
74.3(36.2)
75.8(50.9)
2
76.3(41.5)
77.6(47.6)
78.3(50.1)
77.7(53.8)
77.1(50.2)
75.8(64.5)
3
78.8(29.2)
79.2(44.6)
79.3(51.7)
79.2(58.6)
79.1(58.6)
78.8(55.6)
5
81.4(51.6)
81.1(56.4)
81.6(59.7)
81.9(59.2)
81.3(59.2)
81.3(53.5)
10
84.6(57.2)
84.4(61.1)
84.4(58.7)
83.6(55.7)
83.5(56.2)
81.0(52.2)
15
85.4(64.0)
85.1(62.0)
85.0(59.2)
83.8(56.4)
82.8(52.9)
78.4(52.2)
20
85.8
85.6
84.0
83.5
82.5
72.3
ASR Results, noisy
Single mixture per state (K=1): ESN classifier
ASR Results, noisy
Single mixture per state (K=1): HMM baseline
ASR Example 2
• Same experiment setup as Example 1.
• ESN: M=600 PEs, 10 states, 1 filter per state,
rin=0.1, various r.
• Desired: one-of-many encoding of class, ±1, tanh
output activation function AFTER linear readout
filter.
• Test: corrupted by additive speech-shaped noise
• Baseline: HMM with identical input features
ASR Results, noisy
Discussion
• What gives the ESN classifier its noiserobust characteristics?
• Theory: ESN reservoir provides context
of noisy input, allowing reservoir to
reduce effects of noise by averaging.
• Theory: Non-linearity and highdimensionality of network increases
linear separability of classes in reservoir
space.
Future Work
• Replace winner-take-all with mixture-ofexperts.
• Replace segmental K-means with BaumWelch-type training algorithm.
• “Grow” network during training.
• Consider nonlinear activation functions
(e.g., tanh, softmax) AFTER linear
readout filter.
Conclusions
• ESN classifier using inspiration from HMM:
– Multiple readout filters per state, multiple states.
– Trained as competitive network of filters.
– Segmental K-means guaranteed to converge to
local minimum of total MSE from training set.
• ESN classifier noise robust compared to HMM:
– Ave. over all sources, 0-20 dB SNR: +21
percentage points
– Ave. over all sources: +9 dB SNR