Neural Network Models of Memory

Download Report

Transcript Neural Network Models of Memory

Neural Network Models of Memory
• Long-term memory:
- weight-based memory; the memory representation
takes its form in the strength or weight of neural
connections
• Short-term memory:
- activity-based memory, in which information is
retained as a sustained or persistent pattern of
activity in specific neural populations
Weight-based memory
• Long-term associative memories can be
formed by Hebbian learning: changes in
synaptic weights between neurons
Donald O. Hebb
The Neuron
• The neuron is the basic information processing unit of a
NN. It consists of:
1 A set of synapses or connecting links, each link
characterized by a weight:
W1, W2, …, Wm
2 An adder function (linear combiner) which computes
m
the weighted sum of
the inputs:
j j
j 1
u wx
3 Activation function (squashing function)  for limiting
the amplitude of the
output of the
neuron.
y   (u  b)
Neural Networks
NN 1
5
Computation at Units
• Compute a 0-1 or a graded function of the
weighted sum of the inputs
• g () is the activation function
x1 w1
x2
xn
w2
wn

g
g ( w.x)
w.x   wi xi
The Neuron
Bias
b
x1
Input
signal
w1
x2

w2

xm

Local
Field
v
Activation
function
 ()
Output
y
Summing
function
wm
Synaptic
weights
Neural Networks
NN 1
7
Common Activation Functions
• Step function:
g(x)=1, if x >= t ( t is a threshold)
g(x) = 0, if x < t
• Sign function:
g(x)=1, if x >= t ( t is a threshold)
g(x) = -1, if x < t
• Sigmoid function: g(x)= 1/(1+exp(-x))
Bias of a Neuron
• Bias b has the effect of applying an affine transformation
to u
v=u+b
• v is the induced field of the neuron
v
u
m
u   wjxj
j 1
Neural Networks
NN 1
9
Bias as extra input
• Bias is an external parameter of the neuron. Can be modeled
by adding an extra input.
m
x0 = +1
j 0
w0  b
x1
Input
signal
v   wj x j
w0
w1
x2

w2

xm
Neural Networks
v
 ()
Output
y
Summing
function

wm
Local
Field
Activation
function
Synaptic
weights
NN 1
10
Face Recognition
90% accurate learning head pose, and recognizing 1-of-20 faces
Neural Networks
NN 1
11
Handwritten digit recognition
Neural Networks
NN 1
12
Computing with spaces
error:
+1 = cat, -1 = dog
y
E  y  g(Wx)
2
cat
dog
y

x1
x2
perceptual features
x2
x1
Quick Time™ an d a
TIFF ( Un compr ess ed) de compr ess or
ar e n eed ed to s ee this pic ture .
y  g(Wx )
Can Implement Boolean Functions
• A unit can implement And, Or, and Not
• Need mapping True and False to numbers:
– e.g. True = 1.0, False= 0.0
• (Exercise) Use a step function and show how to
implement various simple Boolean functions
• Combining the units, we can get any Boolean
function of n variables
Can obtain logical circuits as special case
Network Structures
• Feedforward (no cycles), less power, easier
understood
– Input units
– Hidden layers
– Output units
• Perceptron: No hidden layer, so basically correspond
to one unit, also basically linear threshold functions
(ltf)
• Ltf: defined by weights
and threshold
, value
is 1 iff w. x  t otherwise, 0
w
t
Single Layer Feed-forward
Input layer
of
source nodes
Neural Networks
Output layer
of
neurons
NN 1
16
Multi layer feed-forward
3-4-2 Network
Output
layer
Input
layer
Hidden Layer
Neural Networks
NN 1
17
Network Structures
• Recurrent (cycles exist), more powerful as they
can implement state, but harder to analyze.
Examples:
• Hopfield network, symmetric connections, interesting
properties, useful for implementing associative
memory
• Boltzmann machines: more general, with applications
in constraint satisfaction and combinatorial
optimization
Simple recurrent networks
(i1)
output layer
x
x
x
1
2
copy
hidden layer
input layer
z1
x1
z2
x2
context units
input
x
(i )
(Elman, 1990)
Recurrent network
Recurrent Network with hidden neuron(s): unit delay
operator z-1 implies dynamic system
z-1
input
hidden
output
z-1
z-1
Neural Networks
NN 1
20
Perceptron Capabilities
• Quite expressive: many, but not all Boolean
functions can be expressed. Examples:
– conjuncts and disjunctions, example
( x1  x2 )  x1  x2  1
– more generally, can represent functions that are
true if and only if at least k of the inputs are true:
x1  x2  ...  xn  k
– Can’t represent XOR
Representable Functions
• Perceptrons have a monotinicity property:
If a link has positive weight, activation can
only increase as the corresponding input value
increases (irrespective of other input values)
• Can’t represent functions where input
interactions can cancel one another’s effect
(e.g. XOR)
Representable Functions
• Can represent only linearly separable
functions
• Geometrically: only if there is a line (plane)
separating the positives from the negatives
• The good news: such functions are PAC
learnable and learning algorithms exist
Linearly Separable
+
+
+
_
+
+
+
+
+
+
+
+
+
NOT linearly Separable
_
+
+
+
+
+
+
+
OR
+
Problems with simple networks
x2
y
x1
x1
x2
Some kinds of data are not linearly separable
AND
OR
XOR
x2
x2
x2
x1
x1
x1
A solution: multiple layers
output layer
y
y
z2
hidden layer
z1
z2
z1
x2
input layer
x1
x2
x1
The Perceptron Learning Algorithm
• Example of current-best-hypothesis (CBH)
search (so incremental, etc.):
• Begin with a hypothesis (a perceptron)
• Repeat over all examples several times
– Adjust weights as examples are seen
• Until all examples correctly classified or a
stopping criterion reached
Method for Adjusting Weights
• One weight update possibility:
• If classification correct, don’t change
• Otherwise:
– If false negative, add input:
– If false positive, subtract input:
wj  wj  x j
wj  wj  x j
• Intuition: For instance, if example is positive,
strengthen/increase the weights corresponding to
the positive attributes of the example
Properties of the Algorithm
• In general, also apply a learning rate 
wj  wj   x j
• The adjustment is in the direction of
minimizing error on the example
• If learning rate is appropriate and the
examples are linear separable, after a finite
number of iterations, the algorithm converges
to a linear separator
Another Algorithm
(least-sum-squares algorithm)
• Define and minimize an error function
• S is the set of examples, f ()is the ideal
function, h () is the linear function
corresponding to the current perceptron
• Error of the perceptron (over all examples):
E (h)  (1 / 2) ( f (e)  h(e))
2
eS
• Note: h(e)  w.x(e)   wi . xi (e)
E
w ij  
w ij
The Delta Rule
+1 = cat, -1 = dog
y
E  y  g(Wx)
2

for any function g with derivative g
x1

x2
perceptual features
E
 2y  g(W x)g'(W x) x j
w ij
wij   y  g(Wx)g'(Wx) x j
output
error
influence
of input
Derivative of Error
• Gradient (derivative) of E:
E E
E
E (h)  [
,
,...,
]
w0 w1
wn
• Take the steepest descent direction:
E
wi  wi  wi , where wi  
wi
• E is the gradient along wi ,  is the learning rate
wi
Gradient Descent
• The algorithm: pick initial random perceptron
and repeatedly compute error and modify the
perceptron (take a step along the reverse of
gradient)
E
Gradient direction:
Descent direction:
General-purpose learning mechanisms
E
0
E
 0 w ij
w ij
E
0
w ij




E (error)
E
w ij  
w ij
( is learning rate)
wij
Gradient Calculation
E
 1
1

2

[  ( f (e)  h(e)) ]  
( f (e)  h(e))2
wi wi 2 eS
2 eS wi
1

  2( f (e)  h(e))
( f (e)  h(e))
2 eS
wi


  ( f (e)  h(e))(
( f (e)) 
h(e))
wi
wi
eS
Derivation (cont.)
E
 ( w. x ( e))
  ( f ( e)  h ( e))(0 
),
wi eS
wi
  ( f ( e)  h ( e))( 
eS
 ( w j . x j ( e))
j
  ( f ( e)  h ( e))( xi ( e)),
eS
wi
as
f ( e) is const ant
)
 ( w j . x j ( e))
wi
 0, for i  j
Properties of the algorithm
• Error function has no local minima (is
quadratic)
• The algorithm is a gradient descent method
to the global minimum, and will asymptotically
converge
• Even if not linearly separable, can find a
good (minimum error) linear classifier
• Incremental?
Multilayer Feed-Forward Networks
• Multiple perceptrons, layered
• Example: a two-layer network with 3 inputs one
output, one hidden layer (two hidden units)
x1
x2
x3
output layer
inputs layer
hidden layer
Power/Expressiveness
• Can represent interactions among inputs (unlike
perceptrons)
• Two layer networks can represent any Boolean
function, and continuous functions (within a
tolerance) as long as the number of hidden units is
sufficient and appropriate activation functions used
• Learning algorithms exist, but weaker guarantees
than perceptron learning algorithms
Back-Propagation
• Similar to the perceptron learning algorithm
and gradient descent for perceptrons
• Problem to overcome: How to adjust internal
links (how to distribute the “blame” or the
error)
• Assumption: internal units use differentiable
functions and nonlinear
• sigmoid functions are convenient
Recurrent network
Recurrent Network with hidden neuron(s): unit delay
operator z-1 implies dynamic system
z-1
input
hidden
output
z-1
z-1
Neural Networks
NN 1
42
Back-Propagation (cont.)
• Start with a network with random weights
• Repeat until a stopping criterion is met
– For each example, compute the network
output and for each unit i it’s error term  i
– Update each weight wij (weight of link going
from node i to node j):
wij  wij  wij ,
where wij   jo(i)
Output of unit i
The Error Term
•
•
•
•
•
δi  o'(i)Err(i)
Err(i )  fi (e)  hi (e), if i is an outputnode
Err(i )   wij j , if i is an internalnode
o' (i ) is derivativeof node i
For sigmoid,o'(i)  o(i) (1-o(i))
Derivation
• Write the error for a single training example; as
before use sum of squared error (as it’s convenient
for differentiation, etc):
1
2
E
(
f
(
e
)

h
(
e
))

i
i
2 output unit i
• Differentiate (with respect to each weight…)
• For example, we get E  o( j )o' (i ) Err(i )  o( j ) ,
w ji
for weight wij connecting node j to output i
i
Properties
• Converges to a minimum, but could be a local
minimum
• Could be slow to converge
(Note: Training a three node net is NP-Complete!)
• Must watch for over-fitting just as in decision trees
(use validation sets, etc.)
• Network structure? Often two layers suffices, start
with relatively few hidden units
Properties (cont.)
• Many variations to the basic backpropagation: e.g. use momentum
wij (n)  - o(i) j  wij (n  1),0    1
Nth update amount
• Reduce
as well)
a constant
with time (applies to perceptrons

Networks, features, and spaces
• Artificial neural networks can represent any
continuous function…
• Simple algorithms for learning from data
– fuzzy boundaries
– effects of typicality
NN properties
• Can handle domains with
– continuous and discrete attributes
– Many attributes
– noisy data
• Could be slow at training but fast at evaluation
time
• Human understanding of what the network
does could be limited
Networks, features, and spaces
• Artificial neural networks can represent any
continuous function…
• Simple algorithms for learning from data
– fuzzy boundaries
– effects of typicality
• A way to explain how people could learn things
that look like rules and symbols…
Networks, features, and spaces
• Artificial neural networks can represent any
continuous function…
• Simple algorithms for learning from data
– fuzzy boundaries
– effects of typicality
• A way to explain how people could learn things
that look like rules and symbols…
• Big question: how much of cognition can be
explained by the input data?
Challenges for neural networks
• Being able to learn anything can make it
harder to learn specific things
– this is the “bias-variance tradeoff”