Windrow-Hoff Learning

Download Report

Transcript Windrow-Hoff Learning

CHAPTER 10
Widrow-Hoff
Learning
Ming-Feng Yeh
1
Objectives
Widrow-Hoff learning is an approximate
steepest descent algorithm, in which the
performance index is mean square error.
It is widely used today in many signal
processing applications.
It is precursor to the backpropagation
algorithm for multilayer networks.
Ming-Feng Yeh
2
ADALINE Network
ADALINE (Adaptive Linear Neuron) network
and its learning rule, LMS (Least Mean
Square) algorithm are proposed by Widrow
and Marcian Hoff in 1960.
Both ADALINE network and the perceptron
suffer from the same inherent limitation: they
can only solve linearly separable problems.
The LMS algorithm minimizes mean square
error (MSE), and therefore tires to move the
decision boundaries as far from the training
patterns as possible.
Ming-Feng Yeh
3
ADALINE Network
p
SR
W
R1
1
R
a
n = Wp + b
+
b
S1
n
S1
S1
S
p
Single-layer
perceptron
R1
1
R
Ming-Feng Yeh
a = purelin(Wp + b)
SR
a
W
+
b
S1
S1
n
S1
S
4
Single ADALINE
p1
w11
p2
w12

b
1
n
a
 p1 
p   , W  w11 w12 
 p2 
 n  Wp  b
a  purelin(n)  n
Set n = 0, then Wp + b = 0
specifies a decision boundary.
The ADALINE can be used to
classify objects into two
categories if they are linearly
separable.
Ming-Feng Yeh
p2
a0
a0
W
p1
5
Mean Square Error
The LMS algorithm is an example of supervised
training.
The LMS algorithm will adjust the weights and
biases of the ADALINE in order to minimize the
mean square error, where the error is the difference
between the target output (tq) and the network
output (pq).
 1 w
p
x   , z     a1 w Tp  b  xT z
b
1 
MSE: F (x)  E[e2 ]  E[(t  a)2 ]  E[(t  xT z)2 ]
E[·]: expected value
Ming-Feng Yeh
6
Performance Optimization
Develop algorithms to optimize a performance index
F(x), where the word “optimize” will mean to find the
value of x that minimizes F(x).
The optimization algorithms are iterative as
xk 1  xk   k p k or x  xk 1  xk   k p k
p k : a search direction
 k : positive learning rate, which determines the
length of the step
x0 : initial guess
Ming-Feng Yeh
7
Taylor Series Expansion
Taylor series:

F ( x)  F ( x ) 
dF ( x )
dx x  x
 F ( x )  dFdx( x )
x x

(x  x ) 
2
1 d F ( x)
2! dx 2
x  x
 2
(x  x ) 
3
1 d F ( x)
3! dx 3
x  x
( x  x )3  

(
x

x
)

Vector case: x  ( x1, x2 ,...,xn )
F (x)  F (x )  x1 F (x)
 21! x 2 F (x)
2
1
 2
1 
(
x

x
)



1
1
2
! x 2 F ( x)

2
xx
 21! x1x2 F (x)
2
n
xx
xx

(
x

x
n
n)

 2
(
x

x
n
n)





1 
(
x

x
)(
x

x
)

F
(
x
)
(
x

x
)(
x

x

1
1
2
2
1
1
3
3 ) 
2! x1x3

xx
2
xx
 F ( x  )  F ( x ) T
 F ( x  )  F ( x ) T
Ming-Feng Yeh
xx


(
x

x
1
1 )    xn F ( x)

xx
xx

 T
2
T
1
(
x

x
)

(
x

x
)

F
(
x
)

2!
xx

(
x

x
) 


(
x

x
)

8
Gradient & Hessian
F (x)  F ( x1 , x2 , ..., xn )
 

F
(
x
)

F ( x)
Gradient:

 x1



F ( x) 
F ( x) 
x2
xn

 2
F ( x)

2
 x21
 
Hessian:  2 F (x)   x x F (x)
 2 1
 2 
  F ( x)
 xn x1
Ming-Feng Yeh
2
F ( x)
x1x2
2
F ( x)
2
x2

2

F ( x)
xn x2




T

2
F ( x) 
x1xn

2


F ( x) 
x2xn



2


F
(
x
)

xn2
9
Directional Derivative
The ith element of the gradient, F(x)xi, is the first
derivative of performance index F along the xi axis.
Let p be a vector in the direction along which we wish
to know the derivative.
Directional derivative: pT F (x) p
F (x)  x12  2 x22. Find the derivative of F(x) at the point
T
T

x  0.5 0.5 in the direction p  2 1
1 
2  1 
T
2
 2 x1 
1 
p F ( x )

F ( x ) x  x    
 

0
p
5
 4 x2  x  x   2 
Ming-Feng Yeh
10
Steepest Descent
Goal: The function F(x) can decrease at each
iteration, i.e., F (xk 1 )  F (xk )
Central idea: first-order Taylor series expansion
F (x k 1 )  F (x k  x)  F (x k )  g Tk x k , g k  F (x) x  x
F (xk 1 )  F (xk )  gTk xk  k gTk pk  0  gTk pk  0
k
T
Any vector pk that satisfies g k p k  0 is called a
descent direction.
A vector that points in the steepest descent direction
is p k  g k
Steepest descent: xk 1  xk   k pk  xk   k g k
Ming-Feng Yeh
x  g k
11
Approximated-Based Formulation
Given input/output training data: {p1,t1}, {p2,t2},…,
{pQ,tQ}. The objective of network training is to find the
optimal weights to minimize the error (minimum-squares
error) between the target value and the actual response.
Model (network) function:a   (z, x), x  w bT , z  p 1T
2
Least-squares-error function: E(z, x)  t   (z, x)
The weight vector x can be training by minimizing the
error function along the gradient-descent direction:
E (z, x)
 (z, x)
x  
 t   (z, x) 
x
x
Ming-Feng Yeh
12
Delta Learning Rule
T



x

w
b

T
a


(
z
,
x
)

w
p

b

x
z
ADALINE:

j j j
T



z

p
1

Least-Squares-Error Criterion:
1
2
minimize E  t  a 
2
E E a

 t  a  p j
Gradient:
w j a w j
E E a

 t  a 
b a b
Delta learning rule: w j (k  1)  w j (k )   (t  a) p j
E
b(k  1)  b(k )   (t  a)
x  
x
R
Ming-Feng Yeh
13
Mean Square Error
 F (x)  E[(t  x T z ) 2 ]
 E[t 2  2tx T z  x T z zT x]
 E[t 2 ]  2x T E[tz ]  x T E[z zT ]x
 c  2xTh  xT Rx
c  E[t 2 ]
h  E[tz ] : cross- correlation vectorbetween t and z
R  E[z zT ] : input correlation matrix
 F (x)  (c  2x Th  x T Rx)  2h  2Rx  0
 x  R 1h (h T x)  (x Th)  h, h : constant v
ector
(x T Rx)  Rx  R T x  2Rx, R : symmetricmatrix
Ming-Feng Yeh
14
Mean Square Error
If the correlation matrix R is positive definite, there

1
will be a unique stationary point x  R h , which
will be a strong minimum.

Strong Minimum: the point x is a strong minimum of
F(x) if a scalar   0 exists, such that F (x)  F (x  x)
for all x such that   x  0 .

Global Minimum: the point x is a unique global
minimum of F(x) for all x  0 .

Weak Minimum: the point x is a weak minimum of
F(x) if it is not a strong minimum, and a scalar
exists, such that F (x)  F (x  x) for all x such
that   x  0 .
Ming-Feng Yeh
15
LMS Algorithm
LMS algorithm is to locate the minimum point.
Use an approximate steepest descent algorithm to
estimate the gradient.
Estimate the mean square error F(x) by
2
Fˆ (x)  t (k )  a(k )  e2 (k )
ˆ F (x)  e2 (k )
Estimated gradient: 




2

e
(k )
e(k )
2
e (k ) j 
 2e(k )
for j  1,2,...,R
w j
w j
e 2 (k )
e(k )
2
e (k ) R 1 
 2e(k )
b
b
Ming-Feng Yeh
16
LMS Algorithm
 p1 (k ) 
 p2 (k )  R
  wi pi (k )  b
 a(k )  W p(k )  b  w1 w2  wR 



 i 1
 pR (k )
R


e
(
k
)

t
(
k
)

a
(
k
)

t
(
k
)

w
p
(
k
)

b

 i i

 i 1

e(k )
e(k )
 w   p j (k ), b  1
j
ˆ F (x)  e2 (k )  2e(k )z(k )

Ming-Feng Yeh
17
LMS Algorithm
 The steepest descent algorithm with constant
learning rate  is x k 1  x k   F (x) x  x
k
ˆ F (x)  e2 (k )  2e(k )z(k )  xk 1  xk  2e(k )z(k )

 Matrix notation of LMS algorithm:
W(k  1)  W(k )  2e(k )p T (k )
b(k  1)  b(k )  2e(k )
 The LMS algorithm is also referred to as the delta rule
or the Widrow-Hoff learning algorithm.
Ming-Feng Yeh
18
Quadratic Functions
 General form of quadratic function:
G (x)  c  d T x  12 x T Ax
G (x)  Ax  d
 2G (x)  A (A: Hessian matrix)
 If the eigenvalues of the Hessian matrix are all
positive, then the quadratic function will have one
unique global minimum.
 ADALINE network mean square error:
F (x)  c  2xTh  xT Rx
 d  2h, A  2R
Ming-Feng Yeh
19
Stable Learning Rates
 Suppose that the performance index is a quadratic
function: G(x)  12 xT Ax  dT x  c
 G(x)  Ax  d
 Steepest descent algorithm with constant learning
rate: x k 1  x k  g k  x k   ( Axk  d)
 xk 1  [I  A]xk  d
 A linear dynamic system will be stable if the
eigenvalues of the matrix [I-A] are less than one in
magnitude.
Ming-Feng Yeh
20
Stable Learning Rates
 Let {1, 2,…, n} and {z1,z2,…, zn} be the eigenvalues
and eigenvectors of the Hessian matrix. Then
[I  A]zi  zi  Azi  zi  i zi  (1  i )zi
 Condition for the stability of the steepest descent
algorithm is then 1  i  1
 Assume that the quadratic function has a strong
minimum point, then its eigenvalues must be positive
numbers. Hence,   2 i
2
 This must be true for all eigenvalues:  
max
Ming-Feng Yeh
21
Analysis of Convergence
 In the LMS algorithm xk 1  xk  2e(k )z(k ) , xk is a
function only of z(k-1), z(k-2),…, z(0).
 Assume that successive input vectors are statistically
independent, then xk is independent of z(k).
 The expected value of the weight vector will converge
2

1
{
E
[
e
x

R
h
to
. This is the minimum MSE
k ]}solution.
 xk 1  [I  A]xk  d, d  2h, A  2R
 The condition on stability is 0    1 max
The steady state solution is E[xss ]  [I  2R]E[xss ]  2h
or E[xss ]  R1h  x.
Ming-Feng Yeh
22
Orange/Apple Example



1
1



 p1   1, t1   1, p 2   1 , t 2  1
 1
 1




 
 



 1 0  1
 R  E[pp T ]  12 p1p1T  12 p 2p T2   0 1 0 .
 1 0 1 


   0.0,1.0,2.0
In practical applications, the stable
1
 
 0.5 learning rate  might NOT be
max
practical to calculate R, and 
could be selected by trial and error.
Ming-Feng Yeh
23
Orange/Apple Example
 Start, arbitrary, with all the weights set to zero, and then
will apply input p1, p2, p1, p2, etc., in that order, calculating
the new weights after each input is presented.


 W (0)  0 0 0
1
1











p


1
,
t


1
,
p

1
,
t

1
 1
 2

1
2
  0.2









1

1
 
 



a(0)  W(0)p(0)  W(0)p1  0  e(0)  t (0)  a(0)  t1  a(0)  1
W(1)  W(0)  2e(0)pT (0)   0.4 0.4 0.4.
a(1)  W(1)p(1)  W(1)p2  0.4  e(1)  t (1)  a(1)  t2  a(1)  1.4
W(2)  W(1)  2e(1)pT (1)  0.16 0.96  0.16.
Ming-Feng Yeh
24
Orange/Apple Example
a(2)  W(2)p(2)  W(2)p1  0.64
 e(2)  t (2)  a(2)  t1  a(2)  0.36
W(3)  W(2)  2e(2)pT (2)  0.0160 1.1040  0.0160.
W()  0 1 0.
 This decision boundary falls halfway between the two
reference patterns. The perceptron rule did NOT
produce such a boundary,
 The perceptron rule stops as soon as the patterns are
correctly classified, even though some patterns may
be close to the boundaries. The LMS algorithm
minimizes the mean square error.
Ming-Feng Yeh
25
Solved Problem P10.2
 Category I: p1  1 1 , p2  1 1
T


p

2
2
Category II: 3
I
Since they are linear separable, we
can design an ADALINE network to
p2
make such a distinction.
As shown in figure, b  3, w11  1, w12  1
T
T
 Category III: p1  1 1 , p2  1 1
T


p

1
0
Category IV: 3
They are NOT linear separable, so
an ADALINE network CANNOT
distinguish between them.
T
Ming-Feng Yeh
p3
II
p1
T
p1
p3
p2
26
Solved Problem P10.3



1
1
p1  1, t1  1, p 2   1, t2  1

 



 These patterns occur with equal probability, and they
are used to train an ADALINE network with no bias.
What does the MSE performance surface look like?
c  E[t 2 ]  0.5 12  0.5  (1)2  1
1
 1  0
h  E[tz]  0.5 1    0.5  (1)      
1
 1 1
1 0
T
T
T
R  E[z z ]  0.5  p1p1  0.5  p 2p 2  
0 1
F (x)  c  2xT h  xT Rx  1  2w2  w12  w22 , x  w1 w2 
Ming-Feng Yeh
27
Solved Problem P10.3
F (x)  c  2xT h  xT Rx  1  2w2  w12  w22 , x  w1 w2 
1
1 0 0 0
x R h
 



0 1 1 1

1
4
3
2
The Hessian matrix of F(x),
w2 1
2R, has both eigenvalues
at 2. So the contour of the
0
performance surface will be
-1
circular. The center of the
-2
contours (the minimum
-3

point) is x .
Ming-Feng Yeh
-2
-1
0
1
2
3
w1
28
Solved Problem P10.4



1
1
p

,
t

1
,
p

,
t


1
 1 1 1   2  1 2


 



 Train the network using the LMS algorithm, with the
initial guess set to zero and a learning rate  = 0.25.

 1 1  1 
1 
a(0)  purelin0 0    0, a(1)  purelin 2 2 
 0,



1
 
 1 


e(0)  t (0)  a(0)  1  0  1,
e(1)  t (1)  a(1)  1  0  1,
W(1)  W(0)  2e(0)p(0)T
W(2)  W(1)  2e(1)p(1)T
 12 12 
 0 1
1w
1w
Ming-Feng Yeh
29
Tapped Delay Line
p1 (k )  y(k )
y (k )
D
p2 (k )  y(k  1)
D
D
At the output of the
tapped delay line
we have an R-dim.
vector, consisting
of the input signal at
the current time
and at delays of
from 1 to R–1 time
steps.
pR (k )  y(k  R  1)
Ming-Feng Yeh
30
Adaptive Filter
y (k )
D
w11
w12

D
b
1
D
Ming-Feng Yeh
n(k )
a(k )
a(k )  purelin( W p  b)
R
  w1i y(k  i  1)  b
i 1
w1R
31
Solved Problem P10.1
y (k )
w11  2, w12  1, w13  3
{ y(k )}  {...,0,0,0,5,4,0,0,0,...}
y(0)  5, y(1)  4
D
w11
w12

n(k )
a(k )
 Just prior to k = 0 ( k < 0 ): D
w13
Three zeros have entered
the filter, i.e., y(3) = y(2)
= y(1) = 0, the output just prior to k = 0 is zero.
5 
 k = 0: a (0)  W p(0)  2  1 30  10
0 
 
Ming-Feng Yeh
32
Solved Problem P10.1
 k = 1:a(1)  W p(1)  2  1
 k = 2:a(2)  W p(2)  2  1
 k = 3:a (3)  W p(3)  2  1
 k = 4:a(4)  W p(4)  2  1
Ming-Feng Yeh
  4
3 5   13
0
 
0
3 4  19
5
 
0
3 0   12
  4
 
0 
30  0
0 
 
33
Solved Problem P10.1
 a(1)  0, a(0)  10, a(1)  13, a(2)  19, a(3)  12, a(4)  0
 a (k )  W p(k )  w11
w12
 y (k ) 
w13  y (k  1) 
 y ( k  2) 


 The effect of y(0) last from k = 0 through k = 2,
so it will have an influence for three time
intervals.
This corresponds to the length of the impulse
response of this filter.
Ming-Feng Yeh
34
Solved Problem P10.6
Application of ADALINE: adaptive predictor
The purpose of this filter is to predict the next value of
the input signal from the two previous values.
Suppose that the input signal is a stationary random
process with autocorrelation function given by
Cy (n)  Ey(k ) y(k  n), Cy (0)  3, Cy (1)  1, Cy (2)  1.
y (k )
D
w11
+

D
n(k )
a(k )

t (k )  y (k )
+
e(k )
w12
Ming-Feng Yeh
35
Solved Problem P10.6
i. Sketch the contour plot of the performance index (MSE).
Cy (n)  Ey(k ) y(k  n), Cy (0)  3, Cy (1)  1, Cy (2)  1.
 y(k  1) 
z ( k )  p( k )  
 y(k  2)

 

c  E t 2 (k )  E y 2 (k )  Cy (0)  3
 y(k ) y(k  1)   C y (1)   1
h  Etz   E 

 


 y(k ) y(k  2) C y (2)  1
 
C y (0) C y (1)   3  1
R  E zz  



 C y (1) C y (0)  1 3 
T
Ming-Feng Yeh
36
Solved Problem P10.6
Performance Index (MSE): F (x)  c  2xTh  xT Rx
The optimal weights are
1
3

1

  1  3 8 18   1  1 2 

1
x R h
 4 3      1 




1
3

1

    8 8   1  2 
 6  2
2
2
The Hessian matrix is  F (x)  A  2R  
 2 6 
 Eigenvalues:
1 = 4, 2 = 8.
1
2
1.5
1
 1
 1
v

,
v

 Eigenvectors:
1v 2 
2
0
 1 
 1
The contours-1 of F(x) willxbe elliptical, with the long axis
v1 the 1st eigenvector, since the 1st
of each ellipse along
eigenvalue has
the smallest magnitude.
-2
-2
-1
0
2
The ellipses will
be centered
at x1 .
0.5
0
-0.5
-1
-1.5
-2
-2
Ming-Feng Yeh
-1.5
-1
-0.5
0
0.5
1
1.5
2
37
Solved Problem P10.6
ii. The maximum stable value of the learning for the LMS
algorithm:  2 max  2 8  0.25
iii. The LMS algorithm is approximate steepest descent,
so the trajectory for small learning rates will move
perpendicular to the contour lines.
2
2
1.5
1
1
0.5
0
0
x
-0.5
-1
-1
-1.5
-2
Ming-Feng Yeh
-2
-2
-2
-1.5
-1
-1
-0.5
0
0
0.5
1
1
1.5
2
2
38
Applications
Noise cancellation system to remove
60-Hz noise from EEG signal (Fig. 10.6)
Echo cancellation system in long
distance telephone lines (Fig. 10.10)
Filtering engine noise from pilot’s voice
signal (Fig. P10.8)
Ming-Feng Yeh
39