งานนำเสนอ PowerPoint - Chiang Mai University

Download Report

Transcript งานนำเสนอ PowerPoint - Chiang Mai University

EE459
Neural Networks
Backpropagation
Kasin Prakobwaitayakit
Department of Electrical Engineering
Chiangmai University
1
Background
Artificial neural networks (ANNs) provide a general,
practical method for learning real-valued, discretevalued, and vector-valued functions from examples.
Algorithms such as BACKPROPAGATION use gradient
descent to tune network parameters to best fit a training
set of input-output pairs. ANN learning is robust to
errors in the training data and has been successfully
applied to problems such as face recognition/detection,
speech recognition, and learning robot control strategies.
2
Autonomous Vehicle Steering
3
Characteristics of ANNs
•Instances are represented by many attribute-value pairs.
•The target function output may be discrete-valued, realvalued, or a vector of several real- or discrete-valued
attributes.
•The training examples may contain errors.
•Long training times are acceptable.
•Fast evaluation of the learned target function may be
required.
•The ability of humans to understand the learned target
function is not important.
4
Very simple example
0
net input = 0.4  0 + -0.1  1 = -0.1
0.4
0
-0.1
1
5
Learning problem to be
solved
• Suppose we have an input pattern (0 1)
• We have a single output pattern (1)
• We have a net input of -0.1, which gives an
output pattern of (0)
• How could we adjust the weights, so that
this situation is remedied and the
spontaneous output matches our target
output pattern of (1)?
6
Answer
• Increase the weights, so that the net
input exceeds 0.0
• E.g., add 0.2 to all weights
• Observation: Weight from input node
with activation 0 does not have any
effect on the net input
• So we will leave it alone
7
Perceptrons
One type of ANN system is based on a unit called a perceptron.
The perceptron function can sometimes be written as
The space H of candidate hypotheses considered in perceptron
learning is the set of all possible real-valued weight vectors.
8
Representational Power of
Perceptrons

9
Decision surface
linear decision surface
nonlinear decision surface
Programming Example of Decision Surface
10
The Perceptron Training Rule
One way to learn an acceptable weight vector is to begin with
random weights, then iteratively apply the perceptron to each
training example, modifying the perceptron weights whenever it
misclassifies an example. This process is repeated, iterating through
the training examples as many times as needed until the perceptron
classifies all training examples correctly. Weights are modified at
each step according to the perceptron training rule, which revises the
weight associated with input according to the rule
11
Gradient Descent and Delta
Rule
The delta training rule is best understood by considering
the task of training an unthresholded perceptron; that is,
a linear unit for which the output o is given by
In order to derive a weight learning rule for linear units,
let us begin by specifying a measure for the training error
of a hypothesis (weight vector), relative to the training
examples.
12
Visualizing the Hypothesis
Space
initial weight vector by random
minimum error
13
Derivation of the Gradient Descent
Rule
The vector derivative is called the gradient of E with respect to
written
,
The gradient specifies the direction that produces the steepest increase
in E. The negative of this vector therefore gives the direction of
steepest decrease. The training rule for gradient descent is
14
Derivation of the Gradient Descent Rule (cont.)
The negative sign is presented because we want to move
the weight vector in the direction that decreases E. This
training rule can also written in its component form
which makes it clear that steepest descent is achieved by
altering each component of in proportion to .
15
Derivation of the Gradient Descent Rule (cont.)
The vector of
derivatives that form
the gradient can be obtained by
differentiating E
The weight update
rule for standard
gradient descent can
be summarized as
16
EECP0720 Expert Systems – Artificial Neural Networks
Stochastic Approximation to Gradient Descent
17
EECP0720 Expert Systems – Artificial Neural Networks
Summary of Perceptron
Perceptron training rule guaranteed to succeed if
•training examples are linearly separable
•sufficiently small learning rate
Linear unit training rule uses gradient descent
•guaranteed to converge to hypothesis with minimum
squared error
•given sufficiently small learning rate
•even when training data contains noise
18
EECP0720 Expert Systems – Artificial Neural Networks
BACKPROPAGATION Algorithm
19
Error Function
The Backpropagation algorithm learns the weights for a multilayer
network, given a network with a fixed set of units and
interconnections. It employs gradient descent to attempt to
minimize the squared error between the network output values and
the target values for those outputs. We begin by redefining E to
sum the errors over all of the network output units
where outputs is the set of output units in the network, and tkd and
okd are the target and output values associated with the kth output
unit and training example d.
20
Architecture of Backpropagation
21
Backpropagation Learning Algorithm
22
Backpropagation Learning Algorithm (cont.)
23
Backpropagation Learning Algorithm (cont.)
24
Backpropagation Learning Algorithm (cont.)
25
Backpropagation Learning Algorithm (cont.)
26
Inputs To Neurons
• Arise from other neurons or from outside
the network
• Nodes whose inputs arise outside the
network are called input nodes and simply
copy values
• An input may excite or inhibit the
response of the neuron to which it is
applied, depending upon the weight of the
connection
27
Weights
• Represent synaptic efficacy and may be
excitatory or inhibitory
• Normally, positive weights are considered
as excitatory while negative weights are
thought of as inhibitory
• Learning is the process of modifying the
weights in order to produce a network that
performs some function
28
Output
• The response function is normally
nonlinear
• Samples include
– Sigmoid
1
f ( x) 
1  e x
– Piecewise linear
 x, if x  
f ( x)  
0, if x  
29
Backpropagation
Preparation
• Training Set
A collection of input-output patterns that are
used to train the network
• Testing Set
A collection of input-output patterns that are
used to assess network performance
• Learning Rate-η
A scalar parameter, analogous to step size in
numerical integration, used to set the rate of
adjustments
30
Network Error
• Total-Sum-Squared-Error (TSSE)
1
2
TSSE    (desired  actual)
2 patterns outputs
• Root-Mean-Squared-Error (RMSE)
RMSE 
2 * TSSE
# patterns*#outputs
31
A Pseudo-Code Algorithm
• Randomly choose the initial weights
• While error is too large
– For each training pattern
• Apply the inputs to the network
• Calculate the output for every neuron from the input layer,
through the hidden layer(s), to the output layer
• Calculate the error at the outputs
• Use the output error to compute error signals for preoutput layers
• Use the error signals to compute weight adjustments
• Apply the weight adjustments
– Periodically evaluate the network performance
32
Face Detection using Neural Networks
Training Process
Face Database
Output=1, for face database
Non-Face Database
Output=0, for non-face database
Neural
Network
Face
or
NonFace?
33
Backpropagation Using
Gradient Descent
• Advantages
– Relatively simple implementation
– Standard method and generally works well
• Disadvantages
– Slow and inefficient
– Can get stuck in local minima resulting in suboptimal solutions
34
Local Minima
Local
Minimum
Global Minimum
35
Alternatives To Gradient
Descent
• Simulated Annealing
– Advantages
• Can guarantee optimal solution (global
minimum)
– Disadvantages
• May be slower than gradient descent
• Much more complicated implementation
36
Alternatives To Gradient
Descent
• Genetic Algorithms/Evolutionary
Strategies
– Advantages
• Faster than simulated annealing
• Less likely to get stuck in local minima
– Disadvantages
• Slower than gradient descent
• Memory intensive for large nets
37
Enhancements To
Gradient Descent
• Momentum
– Adds a percentage of the last movement
to the current movement
38
Enhancements To
Gradient Descent
• Momentum
– Useful to get over small bumps in the error
function
– Often finds a minimum in less steps
– w(t) = -n*d*y + a*w(t-1)
•
•
•
•
w is the change in weight
n is the learning rate
d is the error
y is different depending on which layer we are
calculating
• a is the momentum parameter
39
Enhancements To
Gradient Descent
• Adaptive Backpropagation Algorithm
– It assigns each weight a learning rate
– That learning rate is determined by the sign of
the gradient of the error function from the
last iteration
• If the signs are equal it is more likely to be a shallow
slope so the learning rate is increased
• The signs are more likely to differ on a steep slope
so the learning rate is decreased
– This will speed up the advancement when on
gradual slopes
40
Enhancements To
Gradient Descent
• Adaptive Backpropagation
– Possible Problems:
• Since we minimize the error for each weight
separately the overall error may increase
– Solution:
• Calculate the total output error after each
adaptation and if it is greater than the
previous error reject that adaptation and
calculate new learning rates
41
Enhancements To
Gradient Descent
• SuperSAB(Super Self-Adapting Backpropagation)
– Combines the momentum and adaptive methods.
– Uses adaptive method and momentum so long as the sign
of the gradient does not change
• This is an additive effect of both methods resulting in a
faster traversal of gradual slopes
– When the sign of the gradient does change the
momentum will cancel the drastic drop in learning rate
• This allows for the function to roll up the other side of the
minimum possibly escaping local minima
42
Enhancements To
Gradient Descent
• SuperSAB
– Experiments show that the SuperSAB
converges faster than gradient descent
– Overall this algorithm is less sensitive
(and so is less likely to get caught in
local minima)
43
Other Ways To Minimize
Error
• Varying training data
– Cycle through input classes
– Randomly select from input classes
• Add noise to training data
– Randomly change value of input node (with low
probability)
• Retrain with expected inputs after initial
training
– E.g. Speech recognition
44
Other Ways To Minimize
Error
• Adding and removing neurons from
layers
– Adding neurons speeds up learning but
may cause loss in generalization
– Removing neurons has the opposite
effect
45