Forecasting & Demand Planner Module 4 – Basic Concepts

Download Report

Transcript Forecasting & Demand Planner Module 4 – Basic Concepts

x0 w0

xn wn
Threshold units
n
w x
i 0
i i
o
n
o  1 if  wi xi  0 and 0 o/w
i 0
History
spiking neural networks
Vapnik (1990) ---support vector machine
Broomhead & Lowe (1988) ----Radial basis functions (RBF)
Linsker (1988) ----- Informax principle
Rumelhart, Hinton
& Williams (1986)
--------
Back-propagation
Kohonen(1982)
Hopfield(1982)
-----------
Self-organizing maps
Hopfield Networks
Minsky & Papert(1969)
------
Perceptrons
Rosenblatt(1960)
------
Perceptron
Minsky(1954)
------
Neural Networks (PhD Thesis)
Hebb(1949)
--------The organization of behaviour
McCulloch & Pitts (1943) -----neural networks and artificial intelligence were born
History of Neural Networks
• 1943: McCullough and Pitts - Modeling the
Neuron for Parallel Distributed Processing
• 1958: Rosenblatt - Perceptron
• 1969: Minsky and Papert publish limits on the
ability of a perceptron to generalize
• 1970’s and 1980’s: ANN renaissance
• 1986: Rumelhart, Hinton + Williams present
backpropagation
• 1989: Tsividis: Neural Network on a chip
William McCulloch
Neural Networks
• McCulloch & Pitts (1943) are
generally recognised as the designers
of the first neural network
• Many of their ideas still used today
(e.g. many simple units combine to
give increased computational power
and the idea of a threshold)
Neural Networks
• Hebb (1949) developed the first
learning rule (on the premise that if
two neurons were active at the same
time the strength between them
should be increased)
Neural Networks
• During the 50’s and 60’s many
researchers worked on the
perceptron amidst great excitement.
• 1969 saw the death of neural network
research for about 15 years – Minsky
& Papert
• Only in the mid 80’s (Parker and
LeCun) was interest revived (in fact
Werbos discovered algorithm in 1974)
How Does the Brain Work ? (1)
NEURON
• The cell that perform information
processing in the brain
• Fundamental functional unit of all
nervous system tissue
How Does the Brain Work ? (2)
Each consists of : SOMA, DENDRITES, AXON, and SYNAPSE
Biological neurons
dendrites
cell
axon
synapse
dendrites
Neural Networks
• We are born with about 100 billion
neurons
• A neuron may connect to as many as
100,000 other neurons
Biological inspiration
Dendrites
Soma (cell body)
Axon
Biological inspiration
axon
dendrites
synapses
The information transmission happens at the synapses.
Biological inspiration
The spikes travelling along the axon of the pre-synaptic neuron
trigger the release of neurotransmitter substances at the synapse.
The neurotransmitters cause excitation or inhibition in the dendrite
of the post-synaptic neuron.
The integration of the excitatory and inhibitory signals may produce
spikes in the post-synaptic neuron.
The contribution of the signals depends on the strength of the
synaptic connection.
Biological Neurons
• human information processing system consists of brain
neuron: basic building block
– cell that communicates information to and from various parts of body
• Simplest model of a neuron: considered as a threshold unit –a
processing element (PE)
• Collects inputs & produces output if the sum of the input
exceeds an internal threshold value
Artificial Neural Nets (ANNs)
• Many neuron-like PEs units
– Input & output units receive and broadcast signals to the environment,
respectively
– Internal units called hidden units since they are not in contact with external
environment
– units connected by weighted links (synapses)
• A parallel computation system because
– Signals travel independently on weighted channels & units can update their
state in parallel
– However, most NNs can be simulated in serial computers
•
A directed graph, with labeled edges by weights is typically used to describe the
connections among units
activation
level
aj
Wj,i
input links
input
function
ini
ai = g(ini)
A NODE
activation function
g
output
ai
output
links
Each processing unit has a simple program that:
a) computes a weighted sum of the input data it receives from
those units which feed into it
b) outputs of a single value, which in general is a non-linear
function of the weighted sum of the its inputs ---this output
then becomes an input to those units into which the original
units feeds
g = Activation functions for units
Step function
(Linear Threshold Unit)
step(x) = 1, if x >= threshold
0, if x < threshold
Sign function
Sigmoid function
sign(x) = +1, if x >= 0
-1, if x < 0
sigmoid(x) = 1/(1+e-x)
Real vs artificial neurons
dendrites
axon
cell
synapse
dendrites
x0
w0

xn
wn
Threshold units
n
 wi xi
i 0
o  1 if
o
n
w x
i 0
i
i
 0 and 0 o/w
Artificial neurons
Neurons work by processing information. They receive and
provide information in form of spikes.
x1
w1
x2
Inputs
xn-1
xn
z   wi xi ; y  H ( z )
w2
x3
…
n
i 1
..
w3
.
wn-1
wn
The McCullogh-Pitts model
Output
y
Mathematical representation
The neuron calculates a weighted sum of inputs and
compares it to a threshold. If the sum is higher than the
threshold, the output is set to 1, otherwise to -1.
Non-linearity
Artificial neurons
• x1
• x2
• …
• w1
• w2
• …
• wn
n
•
x w
i 1
i
i
threshold 
• xn
f ( x1 , x2 ,..., xn )  1, if
n
x w 
i 1
i
 0, otherwise
i
• f
Basic Concepts
Definition of a node:
Wb
Input 0
Input 1
...
Input n
W0
W1
...
Wn
+
+
fH(x)
Connection
Output
Node
• A node is an element
which performs the
function
y = fH(∑(wixi) + Wb)
Anatomy of an Artificial Neuron
bias
1
x1
inputs



xi
w0
w1

wi
 x
w
n
 n

f : activation function
y
h(w0 ,wi , xi ) y  f h


h : combine wi & xi
output
Simple Perceptron
• Binary logic application
• fH(x) = u(x) [linear threshold]
• Wi = random(-1,1)
Wb
• Y = u(W0X0 + W1X1 + Wb)
Input 0
Input 1
W0
W1
+
fH(x)
• Now how do we train it?
Output
Artificial Neuron
• From experience:
examples / training
data
• Strength of connection
between the neurons
is stored as a weightvalue for the specific
connection.
• Learning the solution
to a problem =
changing the
connection weights
A physical neuron
An artificial neuron
Mathematical Representation
x1
w1
Inputs
x2
w2
…
wn
..
xn
Output
n
net   wi xi+b
i 1
y  f (net)
y
b
x1
w1
x2
.
.
xn
w2
.
.
wn
b
+
n
y
f(n)
b
x0
Inputs
Weights
Summation
Activation
Output
A simple perceptron
• It’s a single-unit network
• Change the weight by an
amount proportional to the
difference between the desired
output and the actual output.
Δ Wi = η * (D-Y).Ii Input
Learning rate
Actual output
Desired output
Perceptron Learning Rule
Linear Neurons
•Obviously, the fact that threshold units can only output the values 0
and 1 restricts their applicability to certain problems.
•We can overcome this limitation by eliminating the threshold and
simply turning fi into the identity function so that we get:
oi (t )  net i (t )
•With this kind of neuron, we can build networks with m input
neurons and n output neurons that compute a function
f: Rm  Rn.
Linear Neurons
•Linear neurons are quite popular and useful for applications such as
interpolation.
•However, they have a serious limitation: Each neuron computes a
linear function, and therefore the overall network function f: Rm 
Rn is also linear.
•This means that if an input vector x results in an output vector y,
then for any factor  the input x will result in the output y.
•Obviously, many interesting functions cannot be realized by
networks of linear neurons.
Mathematical Representation
1 n  0
a  f ( n)  
0 n  0
a  f ( n) 
1
1  e n
a  f ( n)  n
 n2
a  f (n)  e
Gaussian Neurons
•Another type of neurons overcomes this problem by using a
Gaussian activation function:
f i (net i (t ))  e
fi(neti(t))
net i ( t ) 1
2
•1
•0
•-1
•1
neti(t)
Gaussian Neurons
•Gaussian neurons are able to realize non-linear functions.
•Therefore, networks of Gaussian units are in principle unrestricted
with regard to the functions that they can realize.
•The drawback of Gaussian neurons is that we have to make sure
that their net input does not exceed 1.
•This adds some difficulty to the learning in Gaussian networks.
Sigmoidal Neurons
•Sigmoidal neurons accept any vectors of real numbers as input,
and they output a real number between 0 and 1.
•Sigmoidal neurons are the most common type of artificial neuron,
especially in learning networks.
•A network of sigmoidal units with m input neurons and n output
neurons realizes a network function
f: Rm  (0,1)n
Sigmoidal Neurons
f i (net i (t )) 
fi(neti(t))
1
1  e ( neti (t ) ) /
•1
• = 0.1
• = 1
•0
•-1
•1
neti(t)
•The parameter  controls the slope of the sigmoid function, while the parameter
 controls the horizontal offset of the function in a way similar to the threshold
neurons.
Example: A simple single unit adaptive
network
• The network has 2 inputs,
and one output. All are
binary. The output is
– 1 if W0I0 + W1I1 + Wb > 0
– 0 if W0I0 + W1I1 + Wb ≤ 0
• We want it to learn
simple OR: output a 1 if
either I0 or I1 is 1.
Artificial neurons
The McCullogh-Pitts model:
• spikes are interpreted as spike rates;
• synaptic strength are translated as synaptic weights;
• excitation means positive product between the incoming spike
rate and the corresponding synaptic weight;
• inhibition means negative product between the incoming spike
rate and the corresponding synaptic weight;
Artificial neurons
Nonlinear generalization of the McCullogh-Pitts
neuron:
y  f ( x, w)
y is the neuron’s output, x is the vector of inputs, and w is the
vector of synaptic weights.
Examples:
y
1
1 e
ye
w xa
T
|| x  w|| 2

2a 2
sigmoidal neuron
Gaussian neuron
NNs: Dimensions of a Neural Network
– Knowledge about the learning task is given in the form of
examples called training examples.
– ANN is specified by:
– an architecture: a set of neurons and links connecting neurons.
Each link has a weight,
– a neuron model: the information processing unit of the NN,
– a learning algorithm: used for training the NN by modifying the
weights in order to solve the particular learning task correctly
on the training examples.
The aim is to obtain a NN that generalizes well, that is, that
behaves correctly on new instances of the learning task.
Neural Network Architectures
Many kinds of structures, main distinction made between two classes:
a) feed- forward (a directed acyclic graph (DAG): links are unidirectional, no
cycles
b) recurrent: links form arbitrary topologies e.g., Hopfield Networks and
Boltzmann machines
Recurrent networks: can be unstable, or oscillate, or exhibit chaotic
behavior e.g., given some input values, can take a long time to
compute stable output and learning is made more difficult….
However, can implement more complex agent designs and can model
systems with state
We will focus more on feed- forward networks
Single Layer Feed-forward
Input layer
of
source nodes
Output layer
of
neurons
Multi layer feed-forward
3-4-2 Network
Output
layer
Input
layer
Hidden Layer
Feed-forward networks:
Advantage: lack of cycles = > computation proceeds uniformly
from input units to output units.
-activation from the previous time step plays no part in
computation, as it is not fed back to an earlier unit
- simply computes a function of the input values that depends on
the weight settings –it has no internal state other than the weights
themselves.
- fixed structure and fixed activation function g: thus the functions
representable by a feed-forward network are restricted to have a
certain parameterized structure
Learning in biological systems
Learning = learning by adaptation
The young animal learns that the green fruits are sour, while the
yellowish/reddish ones are sweet. The learning happens by adapting
the fruit picking behaviour.
At the neural level the learning happens by changing of the synaptic
strengths, eliminating some synapses, and building new ones.
Learning as optimisation
The objective of adapting the responses on the basis of the
information received from the environment is to achieve a better
state. E.g., the animal likes to eat many energy rich, juicy fruits that
make its stomach full, and makes it feel happy.
In other words, the objective of learning in biological organisms is to
optimise the amount of available resources, happiness, or in general
to achieve a closer to optimal state.
Synapse concept
• The synapse resistance to the incoming signal can be
changed during a "learning" process [1949]
Hebb’s Rule:
If an input of a neuron is repeatedly and
persistently causing the neuron to fire, a metabolic
change happens in the synapse of that particular
input to reduce its resistance
Neural Network Learning
• Objective of neural network learning: given
a set of examples, find parameter settings
that minimize the error.
• Programmer specifies
- numbers of units in each layer
- connectivity between units,
• Unknowns
- connection weights
Supervised Learning in ANNs
•In supervised learning, we train an ANN with a set of vector pairs,
so-called exemplars.
•Each pair (x, y) consists of an input vector x and a corresponding
output vector y.
•Whenever the network receives input x, we would like it to
provide output y.
•The exemplars thus describe the function that we want to “teach”
our network.
•Besides learning the exemplars, we would like our network to
generalize, that is, give plausible output for inputs that the network
had not been trained with.
Supervised Learning in ANNs
•There is a tradeoff between a network’s ability to precisely learn
the given exemplars and its ability to generalize (i.e., inter- and
extrapolate).
•This problem is similar to fitting a function to a given set of data
points.
•Let us assume that you want to find a fitting function f:RR for a
set of three data points.
•You try to do this with polynomials of degree one (a straight line),
two, and nine.
Supervised Learning in ANNs
•deg. 2
•f(x)
•deg. 1
•deg. 9
•x
•Obviously, the polynomial of degree 2 provides the most plausible
fit.
Overfitting
Real Distribution
Overfitted Model
Supervised Learning in ANNs
•The same principle applies to ANNs:
• If an ANN has too few neurons, it may not have
enough degrees of freedom to precisely
approximate the desired function.
• If an ANN has too many neurons, it will learn the
exemplars perfectly, but its additional degrees of
freedom may cause it to show implausible behavior
for untrained inputs; it then presents poor
ability of generalization.
•Unfortunately, there are no known equations that could tell you
the optimal size of your network for a given application; you always
have to experiment.
Learning in Neural Nets
Learning Tasks
Supervised
Data:
Labeled examples
(input , desired output)
Tasks:
classification
pattern recognition
regression
NN models:
perceptron
adaline
feed-forward NN
radial basis function
support vector machines
Unsupervised
Data:
Unlabeled examples
(different realizations of the
input)
Tasks:
clustering
content addressable memory
NN models:
self-organizing maps (SOM)
Hopfield networks
Learning Algorithms
Depend on the network architecture:
• Error correcting learning (perceptron)
• Delta rule (AdaLine, Backprop)
• Competitive Learning (Self Organizing Maps)
Perceptrons
•
•
•
•
•
Perceptrons are single-layer feedforward networks
Each output unit is independent of the others
Can assume a single output unit
Activation of the output unit is calculated by:
O = Step(
n
 w j x )j
j0
where xj is the activation of input unit j, and we assume an additional weight
and input to represent the threshold
Perceptron
x1
w1
X0 = 1
x2
w2
.
.
xn
wn
w0

n
 wjx j
j0
n
w x
O = 1 if  j j > 0
j0
-1
otherwise
Perceptron
Rosenblatt (1958) defined a perceptron to be a machine that
learns, using examples, to assign input vectors (samples) to
different classes, using linear functions of the inputs
Minsky and Papert (1969) instead describe perceptron as a
stochastic gradient-descent algorithm that attempts to linearly
separate a set of n-dimensional training data.
Linear Separable
x2
+
x2
+
-
-
+
+
(a)
-
x1
x1
-
+
(b)
some functions not representable - e.g., (b) not linearly separable
So what can be represented using
perceptrons?
and
or
Representation theorem: 1 layer feedforward networks can
only represent linearly separable functions. That is,
the decision surface separating positive from negative
examples has to be a plane.
Learning Boolean AND
XOR
• No w0, w1, w2 satisfy:
w0
w2  w 0
w1 
w0
w1  w 2  w 0
0
0
0
0
(Minsky and Papert, 1969)
Expressive limits of perceptrons
• Can the XOR function be represented by a perceptron
(a network without a hidden layer)?
XOR cannot be represented.
How can perceptrons be designed?
• The Perceptron Learning Theorem (Rosenblatt, 1960):
Given enough training examples, there is an algorithm
that will learn any linearly separable function.
Theorem 1 (Minsky and Papert, 1969) The perceptron
rule converges to weights that correctly classify all
training examples provided the given data set represents
a function that is linearly separable
The perceptron learning algorithm
• Inputs: training set {(x1,x2,…,xn,t)}
• Method
– Randomly initialize weights w(i), -0.5<=i<=0.5
– Repeat for several epochs until convergence:
• for each example
– Calculate network output o.
– Adjust weights:
learning rate
error
wi   (t  o) xi Perceptron training
rule
wi  wi  wi
Why does the method work?
•
•
The perceptron learning rule performs gradient descent in weight space.
– Error surface: The surface that describes the error on each example
as a function of all the weights in the network. A set of weights defines
a point on this surface.
– We look at the partial derivative of the surface with respect to each
weight (i.e., the gradient -- how much the error would change if we
made a small change in each weight). Then the weights are being
altered in an amount proportional to the slope in each direction
(corresponding to a weight). Thus the network as a whole is moving
in the direction of steepest descent on the error surface.
The error surface in weight space has a single global minimum and no local
minima. Gradient descent is guaranteed to find the global minimum,
provided the learning rate is not so big that that you overshoot it.
Multi-layer, feed-forward networks
Perceptrons are rather weak as computing models since they can
only learn linearly-separable functions.
Thus, we now focus on multi-layer, feed forward networks of nonlinear sigmoid units: i.e.,
g(x) =
1
1 ex
Multi-layer feed-forward networks
Multi-layer, feed forward networks extend perceptrons i.e., 1-layer
networks into n-layer by:
• Partition units into layers 0 to L such that;
•lowermost layer number, layer 0 indicates the input units
•topmost layer numbered L contains the output units.
•layers numbered 1 to L are the hidden layers
•Connectivity means bottom-up connections only, with no cycles,
hence the name"feed-forward" nets
•Input layers transmit input values to hidden layer nodes hence do not
perform any computation.
Note: layer number indicates the distance of a node from the input
nodes
Multilayer feed forward network
o1
v1
v2
o2 Layer of output units
v3
Layer of hidden units
Layer of input units
x0
x1
x2
x3
x4
Multi-layer feed-forward networks
• Multi-layer feed-forward networks can be trained by
back-propagation provided the activation function g is a
differentiable function.
– Threshold units don’t qualify, but the sigmoid function does.
• Back-propagation learning is a gradient descent
search through the parameter space to minimize the
sum-of-squares error.
– Most common algorithm for learning
algorithms in multilayer networks
Sigmoid units
x0
w0
n

xn
w x
i 0
wn
i i
Sigmoid unit for g
o
1
 (a) 
1  e a
 (a)
  (a)(1   (a))
a
This is g’ (the basis for gradient descent)
Determining optimal network structure
Weak point of fixed structure networks: poor choice can lead to poor
performance
Too small network: model incapable of representing the desired Function
Too big a network: will be able to memorize all examples but forming a large
lookup table, but will not generalize well to inputs that have not been seen
before.
Thus finding a good network structure is another example of a
search problems.
Some approaches to search for a solution for this problem include
Genetic algorithms
But using GAs is very cpu-intensive.
Learning rate
• Ideally, each weight should have its own learning rate
• As a substitute, each neuron or each layer could have its
own rate
• Learning rates should be proportional to the sqrt of the
number of inputs to the neuron
Setting the parameter values
• How are the weights initialized?
• Do weights change after the presentation of each pattern
or only after all patterns of the training set have been
presented?
• How is the value of the learning rate chosen?
• When should training stop?
• How many hidden layers and how many nodes in each
hidden layer should be chosen to build a feedforward
network for a given problem?
• How many patterns should there be in a training set?
• How does one know that the network has learnt
something useful?
When should neural nets be used for
learning a problem
• If instances are given as attribute-value pairs.
– Pre-processing required: Continuous input values
to be scaled in [0-1] range, and discrete values
need to converted to Boolean features.
• Noise in training examples.
• If long training time is acceptable.
Neural Networks: Advantages
•Distributed representations
•Simple computations
•Robust with respect to noisy data
•Robust with respect to node failure
•Empirically shown to work well for many problem domains
•Parallel processing
Neural Networks: Disadvantages
•Training is slow
•Interpretability is hard
•Network topology layouts ad hoc
•Can be hard to debug
•May converge to a local, not global, minimum of error
•May be hard to describe a problem in terms of features with
numerical values
Back-propagation Algorithm
y0=+1
y1(n)
wj0(n)=bj (n)
wj1(n)
yi(n)
∑
netj(n)
f(.)
yj(n)
wji(n)
m
net j (n)   w ji (n) yi (n)
i 0
y j (n)  f j (net j (n))
e j ( n)  d j (n)  y j (n)
Total error
Average squared error
 ( n) 
 av 
1
e 2 ( n)

2 jC
1
N
N
  ( n)
n 1
All output neurons
where N=No. of items in the training set
Back-propagation Algorithm
 (n)
 (n) e j (n) y j (n) net j (n)

w ji (n) e j (n) y j (n) net j (n) w ji (n)
 (n)
 e j (n)
e j (n)
as
1
 ( n)   e 2 ( n )
2 jC
e j (n)
y j (n)
 1
y j (n)
net j (n)
as
 f ' j (net j (n))
as
e j (n)  d j (n)  y j (n)
net j (n)
w ji (n)
as
 yi (n)
m
y j (n)  f j (net j (n)) net j (n)   w ji (n) yi (n)
i 0
 (n)
 e j (n) j (v j (n)) yi (n)
w ji (n)
wji (n)   j (n) yi (n)
wji (n)   j (n) yi (n)
Error term
Gradient decent
where
 j (n)  e j (n) f '( net j (n))


net j
if  j (net j (n)) 
1
1  exp(net j (n))
 '(net j (n))  y j (n)[1  y j (n)]
as y j (n)   j (net j (n))
Back-propagation Algorithm
Neuron k is an output node
 k (n)  e k (n) '(netk (n))  [dk (n)  yk (n)] yk (n)[1  yk (n)]
Neuron j is a hidden node
 j ( n) 
Output of neuron k

 netk

  ' j ( net j ( n))  k (n) wkj (n)
net j

net

net
k
k
k
j
netk   j (net j )wkj
 y j (n)[1  y j (n)]   k (n) wkj (n)
k
j
w1 j  1
w2 j
Hidden layer(j)
2
Output layer (k)
wji (n)   j (n) yi (n)
Weight
adjustment
Learning
rate
Local
gradient
Input signal
wji (n  1)  wji (n)  wji (n)
W2
0.0
-3.000
-2.000
-1.000
0.000
1.000
2.000
3.000
4.000
5.000
6.000
6.000
5.000
4.000
3.000
2.000
1.000
0.000
-1.000
-2.000
-3.000
14.0
12.0
10.0
8.0
Error
6.0
4.0
2.0
W1
Brain vs. Digital Computers (1)
Computers require hundreds of cycles to simulate a
firing of a neuron
The brain can fire all the neurons in a single step.
Parallelism
Serial computers require billions of cycles to perform
some tasks but the brain takes less than a second
e.g. Face Recognition
What are Neural Networks
• An interconnected assembly of simple processing
elements, units, neurons or nodes, whose
functionality is loosely based on the animal
neuron
• The processing ability of the network is stored in
the interunit connection strengths, or weights,
obtained by a process of adaptation to, or learning
from, a set of training patterns.
Definition of Neural Network
A Neural Network is a system composed of many
simple processing elements operating in parallel
which can acquire, store, and utilize experiential
Knowledge.
Architecture
• Connectivity:
– fully connected
– partially connected
• Feedback
– feedforward network: no feedback
• simpler, more stable, proven most useful
– recurrent network: feedback from output to input units
• complex dynamics, may be unstable
• Number of layers i.e. presence of hidden layers
Feedforward, Fully-Connected with One
Hidden Layer
Node
Connection
Outputs
Inputs
Input
Hidden
Output
Layer
Layer
Layer
Hidden Units
• Layer of nodes between input and output
nodes
• Allow a network to learn non-linear
functions
• Allow the net to represent combinations of
the input features
Learning Algorithms
• How the network learns the relationship between
the inputs and outputs
• Type of algorithm used depends on type of
network- architecture, type of learning,etc.
• Back Propagation: most popular
– modifications exist: quick prop, Delta-bar-Delta
• Others: Conjugate gradient descent, LevenbergMarquardt, K-Means, Kohonen, standard pseudoinverse (SVD) linear optimization
Types of Networks
•
•
•
•
•
•
•
•
•
Multilayer Perceptron
Radial Basis Function
Kohonen
Linear
Hopfield
Adaline/Madaline
Probabilistic Neural Network (PNN)
General Regression Neural Network (GRNN)
and at least thirty others
•
•
•
A Neural Network is a system composed of many simple processing elements
operating in parallel which can acquire, store, and utilize experiential knowledge
Basic Artificial Model
– Consists of simple processing elements called neurons, units or nodes
– Each neuron is connected to other nodes with an associated weight(strength)
which typically multiplies the signal transmitted. Each neuron has a single
threshold value
Characterization
– Architecture: the pattern of nodes and connections between them
– Learning algorithm, or training method: method for determining weights of the
connections
– Activation function: function that produces an output based on the input
values received by node
Perceptrons
• First studied in the late 1950s
• Also known as Layered Feed-Forward Networks
• The only efficient learning element at that time
was for single-layered networks
• Today, used as a synonym for a single-layer, feedforward network
Perceptrons
• First studied in the late 1950s
• Also known as Layered Feed-Forward Networks
• The only efficient learning element at that time
was for single-layered networks
• Today, used as a synonym for a single-layer, feedforward network
Single Layer Perceptron
X1
w1

w2
X2
•
•
•
Xn
OUT
OUT = F(NET)
wn
X1
w1
w2
X2
•
•
•
Xn

OUT = F(NET)
wn
Squashing function need not be sigmoidal
Perceptron Architecture
Single-Neuron Perceptron
Decision Boundary
Example OR
OR solution
Multiple-Neuron Perceptron
Learning Rule Test Problem
Starting Point
Tentative Learning Rule
Second Input Vector
Third Input Vector
Unified Learning Rule
Multiple-Neuron Perceptron
Apple / Banana Example
Apple / Banana Example, Second iteration
Apple / Banana Example, Check
Historical Note
There was great interest in Perceptrons in the '50s and '60s
- centred on the work of Rosenblatt.
This was crushed by the publication of
"Perceptrons" by Minsky and Papert.
•
(0,1)
EOR problem
regions must be linearly separable.
(0,0)
•
(1,1)
Training Problems
esp. with higher order nets.
(1,0)
x1
xn
MLP used to describe any general feedforward (no
recurrent connections) network
However, we will concentrate on nets with units
arranged in layers
x1
xn
NB different books refer to the above as either 4 layer (no.
of layers of neurons) or 3 layer (no. of layers of adaptive
weights). We will follow the latter convention
1st question:
what do the extra layers gain you? Start with looking at
what a single layer can’t do
XOR
problem
Single layer generates a linear
decision boundary
XOR (exclusive OR) problem
0+0=0
1+1=2=0 mod 2
1+0=1
0+1=1
Perceptron does not work here
Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
units
+1
1
3
2
+1
1,1
1,1
1,1
1,1
This is a linearly separable problem!
Since for 4 points { (-1,1), (-1,-1), (1,1),(1,-1) } it is always
linearly separable if we want to have three points in a class
Three-layer networks
x1
x2
Input
Output
xn
Hidden layers
What do each of the layers do?
1st layer draws
linear boundaries
2nd layer combines
the boundaries
3rd layer can generate
arbitrarily complex
boundaries
Can also view 2nd layer as using local knowledge while 3rd layer
does global
With sigmoidal activation functions can show that a 3 layer net
can approximate any function to arbitrary accuracy: property of
Universal Approximation
Proof by thinking of superposition of sigmoids
Not practically useful as need arbitrarily large number of units
but more of an existence proof
For a 2 layer net, same is true for a 2 layer net providing function
is continuous and from one finite dimensional space to another
Example: Learning addition
First find the outputs OI , OII . In order to do this, propagate the inputs forward. First
find the outputs for the neurons of hidden layer
O1  O( W1i X i )  O(W101  W11 X 1  W12 X 2 )
i 0
O2  O( W2i X i )  O(W201  W21 X 1  W22 X 2 )
i 0
O3  O( W3i X i )  O(W301  W31 X 1  W32 X 2 )
i 0
Example: Learning addition
Then find the outputs of the neurons of output layer
OI  O(WIi Oi )  O(WI 01  WI 1O1  WI 2O2  WI 3O3 )
i 0
OII  O(WIIi Oi )  O(WII 01  WII 1O1  WII 2O2  WII 3O3 )
i 0
Example: Learning addition
Now propagate back the errors. In order to do that first find the errors for the output layer, also
update the weights between hidden layer and output layer
 I  OI (1  OI )(t I  OI )
 II  OII (1  OII )(t II  OII )
WI 0   I
WI 1   I O1
WI 2   I O2
WI 3   I O3
WII 0   II
WII 1   II O1
WII 2   II O2
WII 3   II O3
Example: Learning addition
And backpropagate the errors to hidden layer.
Example: Learning addition
And backpropagate the errors to
hidden layer.
1  O1 (1  O1 ) Wk1 k O1 (1  O1 )(WI 1 I  WII 1 II )
k 1
 2  O2 (1  O2 ) Wk 2 k O2 (1  O2 )(WI 2 I  WII 2 II )
k 1
 3  O3 (1  O3 ) Wk 3 k O3 (1  O3 )(WI 3 I  WII 3 II )
k 1
W10  1 X 0  1
W11  1 X 1
W12  1 X 2
W20  2 X 0  2
W21  2 X 1
W22  2 X 2
W30  3 X 0  3
W31  3 X 1
W32  3 X 2
Example: Learning addition
W10  W10  W10
W11  W11  W11
W12  W12  W12
W20  W20  W20
W21  W21  W21
W22  W22  W22
W30  W30  W30
W31  W31  W31
W32  W32  W32
WI 0  WI 0  WI 0
Finally update weights!!!!
WI 1  WI 1  WI 1
WI 2  WI 2  WI 2
WI 3  WI 3  WI 3
WII 0  WII 0  WII 0
WII 1  WII 1  WII 1
WII 2  WII 2  WII 2
WII 3  WII 3  WII 3