Automatic speech recognition using an echo state network

Transcript Automatic speech recognition using an echo state network

Automatic speech recognition using an
echo state network
Mark D. Skowronski
Computational Neuro-Engineering Lab
Electrical and Computer Engineering
University of Florida, Gainesville, FL, USA
May 10, 2006
CNEL Seminar History
•
•
•
•
•
Ratio spectrum, Oct. 2000
HFCC, Sept. 2002
Bats, Dec. 2004
Electrohysterography, Aug. 2005
Echo state network, May 2006
2006
2000
Overview
•
•
•
•
•
ASR motivations
Intro to echo state network
Multiple readout filters
ASR experiments
Conclusions
ASR Motivations
• Speech is most natural form of
communication among humans.
• Human-machine interaction lags behind with
tactile interface.
• Bottleneck in machine understanding is
signal-to-symbol translation.
• Human speech a “tough” signal:
– Nonstationary
– Non-Gaussian
– Nonlinear systems for production/perception
How to handle the “non”-ness of speech?
ASR State of the Art
• Feature extraction: HFCC
– bio-inspired frequency analysis
– tailored for statistical models
• Acoustic pattern rec: HMM
– Piecewise-stationary stochastic model
– Efficient training/testing algorithms
– …but several simplistic assumptions
• Language models
– Uses knowledge of language, grammar
– HMM implementations
– Machine language understanding still
elusive (spam blockers)
Hidden Markov Model
Premier stochastic model of non-stationary time series
used for decision making.
Assumptions:
1) Speech is piecewise-stationary
process.
2) Features are independent.
3) State duration is exponential.
4) State transition prob. function of
previous-next state only.
Can we devise a better
pattern recognition model?
Echo State Network
• Partially trained recurrent neural network,
Herbert Jaeger, 2001
• Unique characteristics:
– Recurrent “reservoir” of processing elements,
interconnected with random untrained weights.
– Linear readout weights trained with simple
regression provide closed-form, stable, unique
solution.
ESN Diagram & Equations
x(n)  f ( W  x(n  1)  Win  u(n))
y (n)  Wout  x(n)
ESN Matrices
• Win: untrained, M x Min matrix
– Zero mean, unity variance normally distributed
– Scaled by rin
• W: untrained, M x M matrix
– Zero mean, unity variance normally distributed
– Scaled such that spectral radius r < 1
• Wout: trained, linear regression, Mout x M matrix
– Regression  closed-form, stable, unique solution
– O(M2) per data point complexity
Echo States Conditions
• The network has echo states if x(n) is
uniquely determined by left-infinite input
sequence …,u(n-1),u(n).
• x(n) is an “echo” of all previous inputs.
• If f is tanh activation function:
– If σmax(W)=||W||<1, guarantees echo states
– If r=|λmax(W)|>1, guarantees no echo states
ESN Training
• Minimize mean-squared error between y(n)
and desired signal d(n).
Wiener solution:
1
Wout  R  p
T 1
Wout  (x(n)  x(n) )  (x(n)  d(n) )
T
ESN Example: Mackey-Glass
M=60 PEs
r=0.9
rin=0.3
u(n): MG,
10000 samples
d(n)=u(n+1)
Prediction Gain
(var(u)/var(e)):
Input: 16.3 dB
Wiener: 45.1 dB
ESN: 62.6 dB
Multiple Readout Filters
• Wout projects reservoir space to output space.
• Question: how to divide reservoir space and
use multiple readout filters?
• Answer: competitive network of filters
y (n)  W  x(n), k  [1, K ]
k
k
out
• Question: how to train/test competitive
network of K filters?
• Answer: mimic HMM.
HMM vs. ESN Classifier
HMM
Output Likelihood
Architecture States, left-to-right
ESN Classifier
MSE
States, left-to-right
Minimum Gaussian kernel
element
Readout filter
Elements GMM
combined
Winner-take-all
Transitions State transition matrix
Training Segmental K-means
(Baum-Welch)
Discriminatory No
Binary switching matrix
Segmental K-means
Maybe, depends on desired
signal
Segmental K-means: Init
For each input, xi(n) and desired di(n) for sequence i:
Divide x,d into equal-sized chunks Xη,Dη (one per state).
For each n, select k(n)[1,K] uniform random.
A
k ( n ),


 Xi (n)  ( Xi (n))
T
B k ( n ),  Xi (n)  (Di (n))T
After init. with all sequences:
k ,
Wout
 ( A k , ) 1  B k ,
Segmental K-means: Training
• For each utterance:
– Produce MSE for each readout filter.
– Find Viterbi path through MSE matrix.
– Use features from each state to update
auto- and cross-correlation matrices.
• After all utterances: Wiener solution
• Guaranteed to converge to local
minimum in MSE over training set.
ASR Example 1
• Isolated English digits “zero” - “nine” from TI46 corpus: 8 male,
8 female, 26 utterances each, 12.5 kHz sampling rate.
• ESN: M=60 PEs, r=2.0, rin=0.1, 10 word models, various
#states and #filters per state.
• Features: 13 HFCC, 100 fps, Hamming window, pre-emphasis
(α=0.95), CMS, Δ+ΔΔ (±4 frames)
• Pre-processing: zero-mean and whitening transform
• M1/F1: testing; M2/F2: validation; M3-M8/F3-F8 training
• Two to six training epochs for all models
• Desired: next frame of 39-dimension features
• Test: corrupted by additive noise from “real” sources (subway,
babble, car, exhibition hall, restaurant, street, airport terminal,
train station)
• Baseline: HMM with identical input features
ASR Results, noise free
Number of classification errors out of 518 (smaller is better).
K
ESN (HMM)
1
2
3
4
5
10
Nst=1
7(171)
6(136)
3(65)
2(33)
3(4)
2(2)
2
1(83)
1(46)
0(4)
1(3)
2(2)
1(0)
3
0(126)
1(4)
0(2)
0(2)
0(1)
2(0)
5
1(11)
1(2)
0(0)
0(0)
1(0)
0(0)
10
1(2)
1(0)
1(0)
1(0)
0(0)
0(0)
15
0(1)
0(0)
0(0)
0(0)
0(0)
1(0)
20
0
0
0
0
0
1
ASR Results, noisy
Average accuracy (%),all noise sources, 0-20 dB SNR (larger is better):
K
ESN (HMM)
1
2
3
4
5
10
Nst=1
70.9(22.4)
70.0(29.7)
74.6(45.6)
74.3(46.0)
74.3(36.2)
75.8(50.9)
2
76.3(41.5)
77.6(47.6)
78.3(50.1)
77.7(53.8)
77.1(50.2)
75.8(64.5)
3
78.8(29.2)
79.2(44.6)
79.3(51.7)
79.2(58.6)
79.1(58.6)
78.8(55.6)
5
81.4(51.6)
81.1(56.4)
81.6(59.7)
81.9(59.2)
81.3(59.2)
81.3(53.5)
10
84.6(57.2)
84.4(61.1)
84.4(58.7)
83.6(55.7)
83.5(56.2)
81.0(52.2)
15
85.4(64.0)
85.1(62.0)
85.0(59.2)
83.8(56.4)
82.8(52.9)
78.4(52.2)
20
85.8
85.6
84.0
83.5
82.5
72.3
ASR Results, noisy
Single mixture per state (K=1): ESN classifier
ASR Results, noisy
Single mixture per state (K=1): HMM baseline
ASR Example 2
• Same experiment setup as Example 1.
• ESN: M=600 PEs, 10 states, 1 filter per state,
rin=0.1, various r.
• Desired: one-of-many encoding of class, ±1, tanh
output activation function AFTER linear readout
filter.
• Test: corrupted by additive speech-shaped noise
• Baseline: HMM with identical input features
ASR Results, noisy
Discussion
• What gives the ESN classifier its noiserobust characteristics?
• Theory: ESN reservoir provides context
of noisy input, allowing reservoir to
reduce effects of noise by averaging.
• Theory: Non-linearity and highdimensionality of network increases
linear separability of classes in reservoir
space.
Future Work
• Replace winner-take-all with mixture-ofexperts.
• Replace segmental K-means with BaumWelch-type training algorithm.
• “Grow” network during training.
• Consider nonlinear activation functions
(e.g., tanh, softmax) AFTER linear
readout filter.
Conclusions
• ESN classifier using inspiration from HMM:
– Multiple readout filters per state, multiple states.
– Trained as competitive network of filters.
– Segmental K-means guaranteed to converge to
local minimum of total MSE from training set.
• ESN classifier noise robust compared to HMM:
– Ave. over all sources, 0-20 dB SNR: +21
percentage points
– Ave. over all sources: +9 dB SNR

Automatic speech recognition using an echo state network

Transcript Automatic speech recognition using an echo state network

Directory