Adaptive Systems: Is Mean-Squared Error the Ultimate

Transcript Adaptive Systems: Is Mean-Squared Error the Ultimate

Learning Using
Augmented Error Criterion
Yadunandana N. Rao
Advisor: Dr. Jose C. Principe
Overview
Criterion
MSE
Linear
Adaptive
Systems
Algorithm
LMS/RLS
AEC
AEC
Algorithms
FIR, IIR
Topology
2
Why another criterion?
MSE gives biased parameter estimates with
noisy data
d(n)
v(n)
x(n)
u(n)
+
Adaptive
Filter
w
+
-
+
+
e(n)
T. Söderström, P. Stoica. “System Identification.” Prentice-Hall, London, United Kingdom, 1989.
3
Is the Wiener-MSE solution optimal?
Assumptions:
1. v(n), u(n) are uncorrelated with input & desired
2. v(n) and u(n) are uncorrelated with each other
white input noise:
colored input noise:
W=(R+σ2I)-1P
W=(R+V)-1P
Unknown σ2
Unknown V
Solution will change with changing
noise statistics
4
An example
2
RLS
True
Input SNR = 0dB
1.5
1
0.5
0
-0.5
-1
-1.5
-2
0
5
10
15
20
25
taps
30
35
40
45
50
5
Existing solutions…
Total Least Squares
Gives exact unbiased estimate
iff
v(n) and u(n) are iid with
equal variances !!
Input is noisy and desired is noise-free
Y.N. Rao, J.C. Principe. “Efficient Total Least Squares Method for System Modeling using Minor
Component Analysis.” IEEE Workshop on Neural Networks for Signal Processing XII, 2002.
6
Existing solutions …
Extended Total Least Squares
Gives exact unbiased estimate
with colored v(n) and u(n)
iff
noise statistics are known!!
J. Mathews, A. Cichocki. “Total Least Squares Estimation.” Technical Report, University of Utah,
USA and Brain Science Institute Riken, 2000.
7
Going beyond MSE - Motivation
Assumption:
1. v(n) and u(n) are white
The input covariance matrix is,
R=Rx+σ2I
Only the diagonal terms are corrupted!
We will exploit this fact
8
Going beyond MSE - Motivation
w = estimated weights ( length L )
wT = True weights ( length M )
e(n)  xT (n)[wT  w]  u(n)  v T (n)w
 e ()  E[e(n)e(n  )]  [w T  w]T E[x(n)x T (n  )][w T  w]
 w T E[ v(n) v T (n  )]w
If Δ ≥ L, w = wT
ρe(Δ) = 0
J.C. Principe, Y.N. Rao, D. Erdogmus. “Error Whitening Wiener Filters: Theory and Algorithms.” Chapter10, Least-Mean-Square Adaptive Filters, S. Haykin, B. Widrow, (eds.), John Wiley, New York, 2003.
9
Augmented Error Criterion (AEC)
E[e(n)e(n  )]  E[e 2 (n)]  0.5E[e(n)  e(n  )]2
Define e(n)  [e(n)  e(n  )]
J (w)  E[e 2 (n)]  E[e 2 (n)]
MSE
AEC
Error penalty
10
AEC can be interpreted as…
J (w)  E[e 2 (n)]  E[e 2 (n)]
 β>0

Error constrained (penalty) MSE

Error smoothness constraint

Joint MSE and error entropy
11
From AEC to Error Whitening
 β<0

Simultaneous minimization of MSE and
maximization of error entropy
With β = -0.5, AEC cost function reduces to,
J (w)  E[e(n)e(n  )]
When J(w) = 0, the resulting w
partially whitens the error signal!
and is unbiased (Δ>L) even with white noise
12
Optimal AEC solution w*
Irrespective of β, the stationary point of the
AEC cost function is
w*  R  S
1
P  Q
Choose a suitable lag L
R  E[x n x Tn ], S  E[(x n  x n  L )(x n  x n  L ) T ],
P  E[x n d n ], Q  E[(x n  x n  L )(d n  d n  L )]
13
In summary AEC…
J (w)  E[e 2 (n)]  E[e 2 (n)]
β=0
β=-0.5
β>0
MSE
EWC
AEC
Minimization
Root finding!
Minimization
Shape of Performance Surface
14
Searching for AEC-optimal w
β>0
1
0.8
0.6
0.4
w2
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-1
-0.8
-0.6
-0.4
-0.2
0
w1
0.2
0.4
0.6
0.8
1
15
Searching for AEC-optimal w
β<0
1
0.8
0.6
0.4
w
2
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-1
-0.8
-0.6
-0.4
-0.2
0
w1
0.2
0.4
0.6
0.8
1
16
Searching for AEC-optimal w
β<0
1
0.8
Decreasing
0.6
0.4
w2
0.2
0
-0.2
Increasing
-0.4
-0.6
-0.8
-1
-1
-0.8
-0.6
-0.4
-0.2
0
w1
0.2
0.4
0.6
0.8
1
17
Stochastic search – AEC-LMS
Problem
The stationary point for AEC with β < 0 can be a
global min, global max or a saddle point
Theoretically, a saddle point is unstable and a
single sign step-size can never converge to a
saddle point
Use sign information
18
AEC-LMS: β = -0.5
w(n  1)  w(n)  (n) sgn(e 2 (n)  0.5e 2 (n))[e(n)x(n)  0.5e(n)x (n)]
1


w*  R  0.5S P  0.5Q
Convergence in MS sense iff
0  

2 E (ek2  0.5ek2 )
E ek x k  0.5ek x k

2
 eˆ ( L)  [Tr(R)   v2 I][ E(ea2 (k ))   u2  w   v2 ]
2
2
Y.N. Rao, D. Erdogmus, G.Y. Rao, J.C. Principe. “Stochastic Error Whitening Algorithm for
Linear Filter Estimation with Noisy Data.” Neural Networks, June 2003.
19
SNR: 10dB
20
Quasi-Newton AEC
w*  R  S P  Q
1
Problem
Optimal solution requires matrix inversion
Solution
Matrices R and S are positive-definite,
symmetric and allow rank-1 recursion
Overall, T = R + βS has a rank-2 update
24
Quasi-Newton AEC
T(n) = R(n) + βS(n)
T(n)  T(n  1)  (2x(n)  x(n  L))xT (n)
 x(n)(x(n)  x(n  L))T
Use Sherman-Morrison-Woodbury identity
(A  BCDT )1  A1  A1B(C1  DT A1B)1 DT A1
Y.N. Rao, D. Erdogmus, G.Y. Rao, J.C. Principe. “Fast Error Whitening Algorithms for System Identification
25
and Control.” IEEE Workshop on Neural Networks for Signal Processing XIII, September 2003.
Quasi-Newton AEC
Initialize
Initialize
Q 1 (0)  cI, c is a large positive constant
w(0)  0
At every iteration, compute
B  [(2x(n)  x(n  L))
1

x(n)]
D  [x(n)
(x(n)  x(n  L))]
1

κ(n)  Q (n  1)B I 2 x 2  D Q (n  1)B
y(n)  xT (n)w(n  1)
T
1
y(n  L)  x T (n  L)w(n  1)
e(n)  [d (n)  y(n); d (n)  y(n)   (d (n  L)  y(n  L))]
w(n)  w(n 1)  κ(n)e(n)
Z 1 (n)  Z 1 (n  1)  κ (n)DT Z 1 (n  1)
26
Quasi-Newton AEC analysis
Fact 1: Convergence achieved in finite number of
steps
Fact 2: Estimation error covariance is bound
from above
Fact 3: Trace of error covariance is mainly
dependent on the smallest eigenvalue of R+βS
 


1


E[ε Tn ε n ]   u2 Tr R L4  Tr R 02  R 22 L 
2


Y.N. Rao, D. Erdogmus, G.Y. Rao, J.C. Principe. “Fast Error Whitening Algorithms for System Identification
28
and Control with Noisy Data.” NeuroComputing, to appear in 2004.
Minor Components based EWC
Optimal EWC solution motivated from TLS
w*  R  0.5S P  0.5Q
w *  R L1PL
1
R L  E[x n xTnL  x nL xTn ]
Augmented Data Matrix
PL  E[x n d nL  x nL d n ]
R L
G T
 PL
PL 
2  d ( L)
Symmetric, indefinite matrix
31
Minor Components based EWC
Problem
Computing eigenvector corresponding to
zero eigenvalue of an indefinite matrix
Inverse iteration
w(n  1)  G(n  1) 1 w(n)
w(n  1)
w(n  1) 
w(n  1)
EWC-TLS
Y.N. Rao, D. Erdogmus, J.C. Principe. “Error Whitening Criterion for Adaptive Filtering: Theory and
Algorithms.” IEEE Transactions on Signal Processing, to appear.
32
Inverse control using EWC
Reference
Model
-
Adaptive
Plant
controller
(model)
noise
AR
FIR
plant
model
34
Histogram plot of the errors
Performance with EWC controller-plant pair
1000
10
EWC
MSE
number of samples
900
800
6
700
4
600
2
500
0
400
-2
300
-4
200
-6
100
-8
0
-20
-15
-10
-5
0
negative error
5
10
output
desired
8
15
-10
0
100
200
300
400
positive error
500
600
samples
700
800
900
Performance with MSE controller-plant pair
15
output
desired
10
5
0
-5
-10
-15
35
-20
0
100
200
300
400
500
samples
600
700
800
900
1000
1000
Going beyond white noise…
 EWC can be extended to handle colored noise
if


Noise correlation depth is known
Noise covariance structure is known
 Otherwise the results will be biased by the
noise terms
 Exploit the fact that the output and desired
signals have independent noise terms
36
Modified cost function
N
J (w)   E[ek d k   ek  d k ]
 1
 N – filter length (assume sufficient order)
 e – error signal with noisy data
 d – noisy desired signal
 Δ – lags chosen (need many!)
Y.N. Rao, D. Erdogmus, J.C. Principe. “Accurate Linear Parameter Estimation in Colored Noise.”
International Conference on Acoustics, Speech and Signal Processing, May 2004.
37
Cost function…
E[ek d k  ]  w TT E[x k x Tk  ]w T  w TT E[x k x Tk  ]w  E[u k u k  ]
E[ek  d k ]  w TT E[x k  x Tk ]w T  w TT E[x k  x Tk ]w  E[u k  u k ]
If noise in the desired signal is white E[u k u k   ]  0
Input Noise drops out completely!
N
J (w ) 

w TT R  w T  w TT R  w
R   E[xk xTk    xk   xTk ]
 1
38
Optimal solution by root-finding
There is a single unique solution w * for the
equation J (w* )  0 and w *  w T
 E[d x  d x ] 


E[d x  d x ] 

w*  2


....

T
T 
 E[d k x k  N  d k  N x k ]
T
k k 1
T.
k k 2
T
k 1 k
T
k 2 k
1
 E[d k d k 1 ] 
 E[d d ] 
k k 2 


.... 


 E[d k d k  N ]
39
Stochastic algorithm
N
w k 1  w k    sign(ek d k   ek  d k )(x k d k   x k  d k )
 1
 Asymptotically converges to the optimal solution iff
N

2 E[ek d k   ek  d k ]
 1
E J (w k )
2
40
Local stability
10dB input SNR 10dB output SNR
41
System ID in colored input noise
-10dB input SNR & 10dB output SNR (white noise)
42
Extensions to colored noise in
desired signal
E[ek d k  ]  w TT E[x k x Tk  ]w T  w TT E[x k x Tk  ]w  E[u k u k  ]
E[ek  d k ]  w TT E[x k  x Tk ]w T  w TT E[x k  x Tk ]w  E[u k  u k ]
If the noise in desired signal is colored, then
E[ek d k  ]  E[ek  d k ]  wTT R  wT  wTT R  w  2E[uk uk  ]
Introduce a penalty term in the cost function
such that the overall cost converges to
2E[u k u k  ]
43
But, we do not know 2 E[u k u k  ]
Introduce estimators of 2 E[u k u k  ] in the cost!
z (k )  ek d k    ek   d k
Define
N
J (w ) 

 1
z2 ( k )  
N

 [ z (k )    ]2  
 1
N

2  
 1
N

 2
 1
The constants α and β are positive real
numbers that control the stability
44
Gradients…
N
N
J ( w, λ, θ)
 2 z  (k )(x k d k   x k  d k )  2   [ z (k )    ](x k d k   x k  d k )
w
 1
 1
J ( w, λ, θ)
  [ z  (k )    ]2  2

J (w, λ, θ)
 2 [ z  (k )    ]  2
 

 
45
Parameter updates
N
J (w ) 

z2 ( k )  
 1
w k 1
N

 1
 [ z (k )    ]2  
N

2  
 1
N

 2
 1
J ( w k , λ k , θ k )
 w k  w
w k
 ,k 1   ,k
J ( w k , λ k , θ k )
 
k
  ,k 1    ,k
J ( w k , λ k , θ k )
 
 k
46
Convergence
0dB SNR for both input and desired data
47
Summary
 Noise is everywhere
 MSE is not optimal even for linear systems
 Proposed AEC and its extensions handle noisy
data
 Simple online algorithms optimize AEC
48
Future Thoughts
 Complete analysis of the modified algorithm
 Extensions to non-linear systems


Difficult with global non-linear models
Using Multiple Models ?
 Unsupervised learning


Robust subspace estimation
Clustering ?
 Other applications
51
Acknowledgements
Dr. Jose C. Principe
Dr. Deniz Erdogmus
Dr. Petre Stoica
54
Thank You!
55