Intelligent Systems and Control

Download Report

Transcript Intelligent Systems and Control

Artificial neural networks – Lecture 5
Prof. Kang Li
Email: [email protected]
Prof. K. Li
1 /20
Last lecture
Sequential BP
Some issues about MLP training
Advanced MLP learning algorithms
This Lecture
RBF
Prof. K. Li
2 /20
Radial Basis Function networks
h
w11
u1
x1
h1







ui


vk

yk
wkjy
un 
xL
Input layer


w hji
h
wLn
y1
v1

hj
xj

Prof. K. Li
y
w11






hL
Hidden layer


vm

ym
y
wmL
Output layer
3 /20
Radial-basis function: a function which depends
on the radial distance from a center
XOR problem
quadratically separable
 (|| x  ci ||)
Prof. K. Li
4 /20
Some nonlinear functions
(a) Multiquadrics
( u )  ( x   )
2
2 1/ 2
(b) Inverse multiquadrics
2 1 / 2
( u )  ( x   )
2
(c) Gaussian
 ( u )  exp( 
x
2
)
2 2
(d) Thin plate spline
 ( u )  x log x
2
Prof. K. Li
5 /20
Global Support basis functions: Thin plate spline and
multiquadratic functions generalise globally or have global support
( ( u )   as u   )adjusting the parameters of an individual neuron
in the network can affect the network output for all points in the
input space. The MLP also falls into this category.
Local support basis functions: Gaussian and inverse multiquadratic
(IMQ) are only significantly greater than zero for a finite interval
around their centres (  ( u )  0 as u   ): Networks with these basis
functions are said to have local support because adjusting the
parameters associated with a given neuron only affects a small
portion of the input space. This property is particularly useful when
training a neural network on-line as it means new information can be
learned without degrading information previously learned at other
points in the input space. Global support networks do not have this
property.
Prof. K. Li
6 /20
Global support functions
Prof. K. Li
Local support functions
7 /20
The locality of basis functions is determined by the parameter
which is usually referred to as the width of the basis function.

y

Guassian function for   0.1, 0.5 , 0.7 , 1.0 , 1.3, 1.6
Prof. K. Li
8 /20
• Initially it is to use a weighted sum of the outputs from the basis
functions for various problem such as classification, density
estimation etc.
• It is motivated by many things (regularisation, Bayesian inference,
classification, kernel density estimation, noisy interpolation etc), but
all suggest that basis functions are set to represent the data.
• Centers can be thought of as prototypes of input data.
MLP
Prof. K. Li
vs
RBF
9 /20
Mathematical equations for RBF
The complete network equation for an RBF network with Gaussian
basis functions is given by:
 uc 2 
L
i 
y   hi exp 
2 

w
j 1
i


where ci, wi and hi are the centre, width and height of the ith
basis function respectively. L is the number of hidden layer
neurons
In local support networks the approximation is generated by the
neuron forming overlapping ‘bumps’ which combine to give the
overall mapping. The more neurons used, the greater the overlap
and the smoother the approximation obtained.
Prof. K. Li
10 /20
Approximation capabilities –curse of
dimensions for RBFNN
Theorem: An RBF consisting of a single hidden layer of radial
basis functions and a linear output neuron can approximate
arbitrarily well any bounded continuous function. (Hartman et al.,
1990)
Another important theoretical result by Barron (1993) relates to the bounds on
approximation error when using RBFs and MLPs. Barron shows that the
approximation bound for single hidden layer MLP with sigmoidal nonlinearities
is of the order
For RBFNN
E  O(
1
2
( Nh ) d
Prof. K. Li
Nh: number of
hidden nodes
1
E  O(
)
Nh
)
d: input
dimensions
11 /20
RBF training – a two-step approach
• The first step in the two-step approach is to select the centres and
widths of RBF neurons without having to train weights of the
output layer.
• Traditionally this has involved placing the centres on a uniform
grid with the widths chosen as a function of the inter-neuron
spacing to give good interpolation.
• This strategy guarantees good approximation capabilities, the
number of neurons needed increases exponentially with the
dimension of the mapping. This ‘curse of dimensionality’ is a
major restriction in the use of local support networks.
• Uniform placement of the centres is also very inefficient in
problems where only a small portion of the input space is active.
Solution: Place a small number of neurons in a manner which
reflects the distribution of the training data.
Prof. K. Li
12 /20
RBF training (cont.) – step one
The simplest approach is to employ a subset of the training data,
selected at random, as centres.
Alternatively an optimal placement with respect to the
distribution of the data can be obtained by minimising the total
Euclidean distance (Ek) between the training patterns and the
closest centres, that is:
N
Ek   min uk  c( i )
k 1 i
This can be determined using an unsupervised training
procedure known as k-means clustering.
Prof. K. Li
13 /20
K-means clustering algorithm
- divides data points into K subgroups based on similarity
1. Given N samples, initial center c0(i), i=1,…,K. initial
learning rate  0 , and iteration j, j=0 at start.
2. For i =1 to N, do
m  arg [m in ui  c j ( k ) ]
k
c j 1( m )  c j ( m )   j ( ui  c j ( m ))
3. Reduce  j , so that  j 1 
j
j 1
4. j++, go to 2 until converges
Prof. K. Li
14 /20
RBF training (cont.)
Step 2 - Linear training
The important consequence of being able to select suitable centres
and widths for RBF networks is that:
• the remaining weights appear linearly in the network equation
•can be efficiently computed with standard linear least squares, or
SVD.
• consequently RBF networks can be trained much more rapidly
than MLPs
Cost function
1 N
T
E 
 ( d ( i )  y( i ) d ( i )  y( i ) ) / 2
Nm i 1
1

( D  Y )T ( D  Y )
2 Nm
Prof. K. Li
15 /20
RBF training (cont.)
Definitions
 uc 2 
i 
i  exp 
2 

w
i


L
y   hi i   T h
j 1
  1   L  , h  h1  hL 
T
Y  Φh
Then
Prof. K. Li
T
Φ  ( 1 )  ( N )T
1
T
E
( D  h ) ( D  h )
2 Nm
16 /20
1
2
h  arg min(
Φh  D )  (  T  )1 T D
h 2m N

1
T
(

 ) and is generally computed
pseduo-inverse
of
(  )
using Singular Value Decomposition (SVD).
T
• It is possible to treat RBFN as MLP and gradient descent
algorithm is applicable to RBFN.
• Other training algorithms are also available, such as OLS
(orthogonal least square) algorithm which is able to both select the
radial basis functions (data points) as well compute the weights of
the output layer.
Prof. K. Li
17 /20
Problems with RBF
1. Due to the local nature of basis functions, RBF has problems in
ignoring ‘noisy’ input dimensions unlike MLPs.
2. Optimal choice of basis function parameters may not be
optimal for the output task, and optimal RBF network is not
achievable if two-step training algorithm is used.
3. Because of dependence on distance, if variation in one
parameter is small with respect to the others it will contribute
very little to the outcome (l + e)2 ~ l2. Therefore, pre-process
all data vector for each parameters to give zero mean and unit
variance via simple transformation
~
x ( x x )/ 
4. Curse of dimensions
Prof. K. Li
18 /20
Comparison of MLP to RBFN
RBF
MLP
hidden unit outputs are functions of
distance from prototype vector (centre)
hidden unit outputs are monotonic
functions of a weighted linear sum of the
inputs
localised hidden units mean that few
contribute to output => no interference
between units => faster convergence
distributed representation as many hidden
units contribute to network output =>
interference between units => non-linear
training => slow convergence
one hidden layer
can have more than one hidden layer
hybrid learning with supervised learning global supervised learning of all weights
in one set of weights
localised approximations to nonlinear
mappings
Prof. K. Li
global approximations to nonlinear
mappings
19 /20