Transcript Document

LMS Algorithm in a Reproducing
Kernel Hilbert Space
Weifeng Liu, P. P. Pokharel, J. C. Principe
Computational NeuroEngineering Laboratory,
University of Florida
Acknowledgment: This work was partially supported by NSF
grant ECS-0300340 and ECS-0601271.
Outlines





Introduction
Least Mean Square algorithm (easy)
Reproducing kernel Hilbert space (tricky)
The convergence and regularization analysis
(important)
Learning from error models (interesting)
Introduction

Puskal (2006) –Kernel LMS

Kivinen, Smola (2004) –Online learning with
kernels (more like leaky LMS)

Moody, Platt (1990’s)—Resource allocation
networks (growing and pruning)
LMS (1960, Widrow and Hoff)

Given a sequence of examples from U×R:
((u1 , y1 ),..., (uN , yN ))
U: a compact set of RL.
 The model is assumed:

The cost function:
yn  w (un )  v(n)
o
1 N
J (w)   ( yn  w(un ))2
N i 1
LMS

The LMS algorithm
w0  0
ena  yn  wn 1 (un )
(1)
wn  wn 1   e u
a
n n

The weight after n iteration:
wn    e u
n
a
i 1 i i
(2)
Reproducing kernel Hilbert space





A continuous, symmetric, positive-definite
kernel  :U U  R,a mapping Φ, and an inner
product ,  H
H is the closure of the span of all Φ(u).
 f , (u ) H  f (u )
Reproducing
(u1 ), (u2 ) H   (u1 , u2 )
Kernel trick
The induced norm
|| f ||2H   f , f  H
RKHS

Kernel trick:
–
–

An inner product in the feature space
A similarity measure you needed.
Mercer’s theorem:
(u)  [1 (u),2 (u),...,M (u)]
T
Common kernels

Gaussian kernel
 (ui , u j )  exp(a || ui  u j || )
2

Polynomial kernel
 (ui , u j )  (ui u j  1)
T
p
Kernel LMS

Transform the input ui to Φ(ui):
(((u1 ), y1 ),..., ((u N ), yN ))
Assume Φ(ui) ∈RM
 The model is assumed:

The cost function:
yn  o ((un ))  v(n)
1 N
J ()   ( yn  ((un )))2
N i 1
Kernel LMS

The KLMS algorithm
0  0
ena  yn   n 1 ((un ))
(3)
 n   n 1   e  (un )
a
n

The weight after n iteration:
n    e (ui )
n
a
i 1 i
(4)
Kernel LMS
 n 1 ((un ))
 
n 1 a
i 1 i
 
n 1 a
i 1 i
e   (ui ),  (un )  H
e  (ui , un ),
ena  yn   n 1 ((un )),
 n    e (ui ).
n
a
i 1 i
(5)
Kernel LMS
After the learning, the input-output relation:
y   N ((u ))
   e  (ui , u )
N
a
i 1 i
(6)
KLMS vs. RBF

y
KLMS: y 
N
i 1
(e ) (ui , u)
a
i
  (ui , u)
RBF:
i 1 i
α satisfy
G  y
G is the gram matrix: G(i,j)=ĸ(ui,uj)
 RBF needs regularization.
 Does KLMS need regularization?
N
(7)
(8)
KLMS vs. LMS

Kernel LMS is nothing but LMS in the feature
space--a very high dimensional reproducing
kernel Hilbert space (M>N)

Eigen-spread is awful—does it converge?
Example: MG signal predication






Time embedding:
10.
Learn rate: 0.2
500 training data
100 test data point.
Gaussian noise
noise variance: .04
0.1
mse linear
mse kernel
0.08
0.06
0.04
0.02
0
0
100
200
300
400
500
Example: MG signal predication
MSE
Linear KLMS
LMS
RBF RBF
(λ=0) (λ=.1)
RBF
(λ=1)
RBF
(λ=10)
training 0.021
0.0060 0
0.0026
0.0036 0.010
test
0.0066 0.019 0.0041
0.0050 0.014
0.026
Complexity Comparison
RBF
Computation O(N3)
Memory
KLMS
LMS
O(N2)
O(L)
O(N2+N*L) O(N*L)
O(L)
The asymptotic analysis on
convergence—small step-size theory
Denote xi  (ui )  R M
 The correlation matrix
1 N
Rx   xi xiT
N i 1
is singular. Assume 1  ...   k   k 1  ...   M  0

and Rx  PPT
The asymptotic analysis on
convergence—small step-size theory
Denote n    i 1i (n)Pi
we have
o

M
E[ i (n)]  (1  i )n  i (0)
 J min
 J min
2n
2
E[|  i (n) | ] 
 (1   i ) (|  i (0) | 
)
2   i
2   i
2
The weight stays at the initial place in
the 0-eigen-value directions
If  i  0
we have

E[ i (n)]   i (0)
E[|  i (n) |2 ] |  i (0) |2
The 0-eigen-value directions does not
affect the MSE

J (n)  E[| y  i ( x) | ]
2
Denote
J (n)  J min 
 J min
2
i 1 i  i 1 i (| i (0) | 
M
M
2
 J min
2
)(1  i )2n
It does not care about the null space! It only
focuses on the data space!
The minimum norm initialization

The initialization 0  0 gives the
minimum norm possible solution.
ˆ i Pi
n  i 1 w
M
ˆ i ||  i 1|| w
ˆ i ||  i k 1|| w
ˆ i ||2
|| n ||  i 1|| w
2
M
2
k
2
M
Minimum norm solution
5
4
3
2
1
0
-1
0
2
4
Learning is Ill-posed
Over-learning
Regularization Technique




Learning from finite data is ill-posed.
A priori information--Smoothness is needed.
The norm of the function, which indicates the
‘slope’ of the linear operator is constrained.
In statistical learning theory, the norm is
associated with the confidence of uniform
convergence!
Regularized RBF

The cost function:
1 N
J ()   ( yn  ((un )))2
N i 1
subject to ||  ||2  C
or equivalently
1 N
J ()   ( yn  ((un )))2   ||  ||2
N i 1
KLMS as a learning algorithm
o
y


( xn )  v(n) with xn  (un )
 The model
n

The following inequalities hold
|| e a ||2   1 || o ||2 2 || v ||2
|| e ||  2 || y ||
a

2
2
The proof…(H∞ robust + triangle inequality +
matrix transformation + derivative + …)
The numerical analysis

The solution of regularized RBF is
y  i 1i (ui , u)
N
  (G   I )1 y

The reason of ill-posedness is the inversion
of the matrix (G+λI)
|| (G   I ) ||    as   0
1
1
The numerical analysis

The solution of KLMS is
y  i 1e  (ui , u)
N
a
i
e a  Ly

By the inequality we have
|| L || 2
Example: MG signal predication
weight
KLMS
RBF
(λ=0)
RBF
(λ=.1)
RBF
(λ=1)
RBF
(λ=10)
norm
0.520
4.8e+3
10.90
1.37
0.231
The conclusion

The LMS algorithm can be readily used in a
RKHS to derive nonlinear algorithms.

From the machine learning view, the LMS
method is a simple tool to have a regularized
solution.
Demo
Demo
LMS learning model





An event happens, and a decision made.
If the decision is correct, nothing happens.
If an error is incurred, a correction is made
on the original model.
If we do things right, everything is fine and
life goes on.
If we do something wrong, lessons are drawn
and our abilities are honed.
Would we over-learn?





If the real world is attempted to be modeled
mathematically, what dimension is
appropriate?
Are we likely to over-learn?
Are we using the LMS algorithm?
What is good to remember the past?
What is bad to be a perfectionist?

"If you shut your door to all errors, truth
will be shut out."---Rabindranath Tagore