How to get that first grant: A young scientist`s guide to (AI) funding in

Transcript How to get that first grant: A young scientist`s guide to (AI) funding in

Chapter 5
NEURAL NETWORKS
by S. Betul Ceran
Outline
 Introduction
 Feed-forward Network Functions
 Network Training
 Error Backpropagation
 Regularization
Introduction
Multi-Layer Perceptron (1)
 Layered perceptron networks can realize any logical function, however
there is no simple way to estimate the parameters/generalize the (single
layer) Perceptron convergence procedure
 Multi-layer perceptron (MLP) networks are a class of models that are
formed from layered sigmoidal nodes, which can be used for
regression or classification purposes.
 They are commonly trained using gradient descent on a mean
squared error performance function, using a technique known as
error back propagation in order to calculate the gradients.
 Widely applied to many prediction and classification problems over
the past 15 years.
Multi-Layer Perceptron (2)
 XOR (exclusive OR)
problem
 0+0=0
 1+1=2=0 mod 2
 1+0=1
 0+1=1
 Perceptron does not
work here!
Single layer generates a
linear decision boundary
Universal Approximation
1st layer
2nd layer
3rd layer
Universal Approximation: Three-layer network can in principle
approximate any function with any accuracy!
Feed-forward Network Functions
 M


y (x, w )  f   wjj ( x) 
 j 1

 f: nonlinear activation function
 Extensions to previous linear models by hidden units:
(1)
– Make basis function Φ depend on the parameters
– Adjust these parameters during training
 Construct linear combinations of the input variables x1, …, xD.
D
a j   w(ji1) xi w(j10)
(2)
i 1
 Transform each of them using a nonlinear activation function
zj  h(aj )
(3)
Cont’d
 Linearly combine them to give output unit activations
M
ak   wkj( 2) z j  wk( 20)
(4)
j 1
 Key difference with perceptron is the continuous sigmoidal
nonlinearities in the hidden units i.e. neural network
function is differentiable w.r.t network parameters
 Whereas perceptron uses step-functions
Weight-space symmetry
 Network function is unchanged by certain permutations and
the sign flips in the weight space.
 E.g. tanh(– a) = –tanh(a) ………flip the sign of all weights out of that hidden unit
Two-layer neural network
zj: hidden unit
 M ( 2)  D (1)

(1) 
( 2)
yk ( X, W)  f   wkj h  w ji xi  w j 0   wk 0 
 i 1

 j 1

A multi-layer perceptron fitting into different functions
f(x)=x2
f(x)=sin(x)
f(x)=|x|
f(x)=H(x)
Network Training
 Problem of assigning ‘credit’ or ‘blame’ to individual
elements involved in forming overall response of a
learning system (hidden units)
 In neural networks, problem relates to deciding which
weights should be altered, by how much and in which
direction.
 Analogous to deciding how much a weight in the early
layer contributes to the output and thus the error
 We therefore want to find out how weight wij affects the
error ie we want:
E (t )
wij (t )
Error Backpropagation
Two phases of back-propagation
Activation and Error back-propagation
Weight updates
Other minimization procedures
Two schemes of training
 There are two schemes of updating weights
– Batch: Update weights after all patterns have been
presented (epoch).
– Online: Update weights after each pattern is presented.
 Although the batch update scheme implements the
true gradient descent, the second scheme is often
preferred since
– it requires less storage,
– it has more noise, hence is less likely to get stuck in a
local minima (which is a problem with nonlinear
activation functions). In the online update scheme,
order of presentation matters!
Problems of back-propagation
 It is extremely slow, if it does converge.
 It may get stuck in a local minima.
 It is sensitive to initial conditions.
 It may start oscillating.
Regularization (1)
 How to adjust the number of hidden units to
get the best performance while avoiding
over-fitting
 Add a penalty term to the error function
 The simplest regularizer is the weight
decay:
 T
~
E (w )  E (w )  w w
2
Changing number of hidden units
Over-fitting
Sinusoidal data set
Regularization (2)
 One approach is to
choose the specific
solution having the
smallest validation set
error
Error vs. Number of hidden units
Consistent Gaussian Priors
 One disadvantage of weight decay is its
inconsistency with certain scaling properties of
network mappings
w
 A linear transformation
in the input would be
reflected to the weights such that the overall
mapping unchanged
1
~
~
w 
 w  w
x 
 x  ax  b
j0
i
i
i
ji
ji
a
ji
b
~
w j0 
 w j 0  w j   w ji
a i
Cont’d
 A similar transformation can be achieved in the
output by changing the 2nd layer weights
accordingly
yk 
 ~yk  cyk  d
~  cw
wkj 
 w
kj
kj
~  cw  d
wk 0 
 w
k0
k0
 Then a regularizer of the following form would be
invariant under the linear transformations:
1
 W1: set of weights in 1st layer
 W2: set of weights in 2nd layer
2
w 
2
wW1
2
2
2
w

wW2
Effect of consistent gaussian priors
Early Stopping
 A method to
– obtain good generalization performance and
– control the effective complexity of the network
 Instead of iteratively reducing the error until
a minimum of the training data set has been
reached
 Stop at the point of smallest error w.r.t. the
validation data set
Effect of early stopping
Training Set
Error vs. Number of iterations
Validation Set
A slight increase in
the validation set error
Invariances
 Alternative approaches for encouraging an
adaptive model to exhibit the required invariances
 E.g. position within the image, size
Various approaches
1. Augment the training set using
transformed replicas according to the
desired invariances
2. Add a regularization term to the error
function; tangent propagation
3. Extract the invariant features in the preprocessing for later use.
4. Build the invariance properteis into the
network structure; convolutional networks
Tangent Propagation (Simard et al., 1992)
 A continuous
transformation on a
particular input vextor
xn can be approximated
by the tangent vector τn
 A regularization
function can be derived
by differentiating the
output function y w.r.t.
the transformation
parameter, ξ
Tangent vector implementation
Original image
x
Tangent vector
corresponding to
a clockwise rotation
Adding a small
contribution from
the tangent vector
x+ετ
True image rotated
References
 Neurocomputing course slides by Erol Sahin. METU,




Turkey.
Backpropagation of a Multi-Layer Perceptron by
Alexander Samborskiy. University of Missouri, Columbia.
Neural Networks - A Systematic Introduction by Raul
Rojas. Springer.
Introduction to Machine Learning by Ethem Alpaydin.
MIT Press.
Neural Networks course slides by Andrew Philippides.
University of Sussex, UK.

How to get that first grant: A young scientist`s guide to (AI) funding in

Transcript How to get that first grant: A young scientist`s guide to (AI) funding in

Directory