Transcript Uncertainty

Artificial Neural Networks
Artificial Neural Networks
• Interconnected networks of simple units
("artificial neurons").
– Weight wij is the weight of the ith input into
unit j.
– Depending on the final output value we
determine the class.
– If more than one output units then we choose
the one with the greatest value.
• Learning takes place by adjusting the
weights in the network
– so that the desired output is produced
whenever a training instance is presented.
Single Perceptron Unit
• We start by looking at a
simpler kind of "neural-like"
unit called a perceptron.
let x0  1 and w0  b
t hen t hehyperplaneequationis
wx  0
n
or
w x
j 0
j
j
0
h(x)  sign(w  x)
Depending on the value
of h(x) it outputs one
class or the other.
Beyond Linear Separability
• Values of the XOR boolean
function cannot be separated by
a single perceptron unit.
Multi-Layer Perceptron
• Solution: Combine multiple
linear separators.
• Introduction of "hidden" units
into NN make them much
more powerful:
– they are no longer limited to
linearly separable problems.
• Earlier layers transform the
problem into more tractable
problems for the latter layers.
Example: XOR problem
Output:
“class 0”
or
“class 1”
Example: XOR problem
Example: XOR problem
w23o2+w13o1+w03=0
w03=-1/2, w13=-1, w23=1
o2-o1-1/2=0
Multi-Layer Perceptron
• Any set of training points can be separated by a three-layer
perceptron network.
• “Almost any” set of points is separable by a two-layer
perceptron network.
Backpropagation technique
High level summary:
1.
2.
3.
4.
Present a training sample to the neural network.
Calculate the error in each output neuron. This is the local error.
Adjust the weights of each neuron to lower the local error.
Assign "blame" for the local error to neurons at the previous
level, giving greater responsibility to neurons connected by
stronger weights.
5. Repeat from step 3 on the neurons at the previous level, using
each one's "blame" as its error.
Autonomous Land Vehicle In a
Neural Network (ALVINN)
• ALVINN is an automatic steering system for a car based on
input from a camera mounted on the vehicle.
– Successfully demonstrated in a cross-country trip.
ALVINN
• The ALVINN neural
network is shown here. It
has
– 960 inputs (a 30x32
array derived from the
pixels of an image),
– 4 hidden units and
– 30 output units (each
representing a steering
command).
SVMs vs. ANNs
• Comparable in practice.
• Some comment:
"SVMs have been developed in the reverse order to the
development of neural networks (NNs). SVMs evolved from the
sound theory to the implementation and experiments, while the
NNs followed more heuristic path, from applications and
extensive experimentation to the theory.“ (Wang 2005)
Soft Threshold
• A natural question to ask is whether we could use gradient
ascent/descent to train a multi-layer perceptron.
• The answer is that we can't as long as the output is discontinuous
with respect to changes in the inputs and the weights.
– In a perceptron unit it doesn't matter how far a point is from the
decision boundary, we will still get a 0 or a 1.
• We need a smooth output (as a function of changes in the
network weights) if we're to do gradient descent.
Sigmoid Unit
• Commonly used in neural nets is a
"sigmoid" (S-like) function (see on the
right).
– The one used here is called the
logistic function.
• Value z is also called the
"activation" of a neuron.
Training
• Key property of the sigmoid is that it is differentiable.
– This means that we can use gradient based methods of
minimization for training.
• The output of a multi-layer net of sigmoid units is a function of
two vectors, the inputs (x) and the weights (w).
– Well, as we train the ANN the training instances are considered
fixed.
• The output of this function (y) varies smoothly with changes in
the weights.
Training
Training
½ is only to simplify the derivations.
Gradient Descent
We follow gradient descent
Online version:
We consider each time only
the error for one data item
1
E  ( y (x m , w )  y m ) 2
2
 w E  ( y (x m , w )  y m ) w y (x m , w )
where
 y
y 
 w y (x , w )  
,...,


w

w
n
 1
m
We changew as follows
w  w   w E
Gradient of the training error is
computed as a function of the
weights.
As a shorthand, we will
denote y(xm,w) just by y.
Gradient Descent – Single Unit
 y
y 
 w y (x , w )  
,...,


w

w
n
 1
m
n
z  w x
i 0
m
i i
y y z

wi z wi
s ( z ) z

z wi
s ( z ) m

xi
z
1
y  s( z ) 
1  ez
Substituting in the equation of
previous slide we get (for the
arbitrary ith element of w):
s( z ) m
wi  wi   ( y  y )
xi
z
m
E
m s ( z )

 (y  y )
z
z
wi   xim
Delta
rule
Derivative of the sigmoid
1
s( z ) 
1  ez
ds( z ) d

(1  e  z ) 1
dz
dz
 [(1  e  z )  2 ][e  z ]


z


1
e



z 
z 
1

e
1

e



 s ( z )(1  s ( z ))
 y (1  y )
s( z ) m
wi   x   ( y  y )
xi
z
  ( y  y m ) y(1  y ) xim
m
i
m
  y(1  y)( y  y m ) xim
Generalized Delta Rule
For an output unit p we similarly have:
wip   p yim 
  y p (1  y p )( y p  y m ) yim
w03
y3
3
1
w13
1
w01
1
p=3 in this example
z3
w23
y1
y2
z1
z2
w21
w12
w11
2
w02
w22
x1
x2
1
Backpropagation Example
w03
y3
3
z3
1
w13
1
w01
1
First do forward propagation:
Compute zi’s and yi’s.
 3  y3 (1  y3 )( y3  y m )
 2  y2 (1  y2 ) 3 w23
1  y1 (1  y1 ) 3 w13
w23
y1
y2
z1
z2
w21
w12
w11
2
w02
w22
x1
x2
1
We'll see soon why delta2 and
delta3 have these formulas.
w03  w03  3 (1)
w02  w02   2 (1)
w01  w01  1 (1)
w13  w13  3 y1
w12  w12   2 x1
w11  w11  1 x1
w23  w23  3 y2
w22  w22   2 x2
w21  w21  1 x2
Deriving 2 and 2
y3
3
w03
w13
w01
1
E z3

z3 z 2
z3
1
1
E
2 
z 2
w23
y1
y2
z1
z2
w21
w12
w11
z3
 3
z 2
2
w02
w22
x1
x2
We similarly derive delta1.
1
 ( w13 y1  w23 y2 )
 3
z 2
  3 w23
y2
z 2
  3 w23
s ( z 2 )
z 2
  3 w23 s ( z 2 )(1  s ( z 2 ))
  3 w23 y2 (1  y2 )
Backpropagation Algorithm
1.
2.
3.
4.
5.
6.
Initialize weights to small random values
Choose a random sample training item, say (xm, ym)
Compute total input zj and output yj for each unit (forward prop)
Compute p for output layer p = yp(1-yp)(yp-ym)
Compute j for all preceding layers by backprop rule
Compute weight change by descent rule (repeat for all weights)
•
Note that each expression involves data local to a particular unit,
we don't have to look around summing things over the whole
network.
It is for this reason, simplicity, locality and, therefore, efficiency
that backpropagation has become the dominant paradigm for
training neural nets.
•
Generalized Delta Rule
In general, for a hidden unit j we have
 j  y j (1  y j )

 k w jk
kdownstream ( j )
wij  wij   j yi
Input And Output Encoding
• For neural networks, all attribute values must be encoded in a standardized
manner, taking values between 0 and 1, even for categorical variables.
• For continuous variables, we simply apply the min-max normalization:
X* = [X - min(X)]/[max(X)-min(X)]
• For categorical variables use indicator (flag) variables.
– E.g. marital status attribute, containing values single, married, divorced.
– Records for single would have
1 for single, and 0 for the rest, i.e. (1,0,0)
– Records for married would have
1 for married, and 0 for the rest, i.e. (0,1,0)
– Records for divorced would have
1 for divorced, and 0 for the rest, i.e. (0,0,1)
– Records for unknown would have
0 for all, i.e. (0,0,0)
• In general, categorical attributes with k values can be translated into k - 1
indicator attributes.
Output
• Neural network output nodes always return a continuous value between 0 and
1 as output.
• Many classification problems have a dichotomous result, with only two
possible outcomes.
– E.g., “Meningitis, yes or not"
•
For such problems, one option is to use a single output node, with a threshold value
set a priori which would separate the classes.
– For example, with the threshold of “Yes if output  0.3," an output of 0.4 from the
output node would classify that record as likely to be “Yes”.
• Single output nodes may also be used when the classes are clearly ordered.
E.g., suppose that we would like to classify patients’ disease levels. We can
say:
– If 0  output < 0.33, classify “mild”
– If 0.33  output < 0.66, classify “severe”
– If 0.66  output < 1, classify “grave”
Multiple Output Nodes
• If we have unordered categories for the target attribute, we
create one output node for each possible category.
– E.g. for marital status as target attribute, the network would have
four output nodes in the output layer, one for each of:
• single, married, divorced, and unknown.
• Output node with the highest value is then chosen as the
classification for that particular record.
NN for Estimation And Prediction
• Since NN produce continuous output, they can be used for
estimation and prediction.
• Suppose, we are interested in predicting the price of a stock
three months in the future.
– Presumably, we would have encoded price information using the
min-max normalization.
– However, the neural network would output a value between zero
and 1.
• The min-max normalization needs to be inverted.
• This denormalization is:
prediction = output * (max – min) + min
ANN Example
w03
y3
3
z3
1
w13
1
w01
1
w23
y1
y2
z1
z2
w31
w11 w21
x1
w12
w22
x2
x3
w32
2
w02
1
Learning Weights
For an output unit p we similarly have:
wip   p yim 
  y p (1  y p )( y p  y m ) yim
w03
y3
3
1
w13
1
w01
1
p=3 in this example
z3
w23
y1
y2
z1
z2
w31
w11 w21
x1
w12
w22 w32
x2
x3
2
w02
1
Backpropagation
w03
y3
3
z3
1
w13
1
w01
1
First do forward propagation:
Compute zi’s and yi’s.
y1
y2
z1
z2
w31
w11 w21
x1
 3  y3 (1  y3 )( y3  y m )
 2  y2 (1  y2 ) 3 w23
1  y1 (1  y1 ) 3 w13
w23
w12
w22 w32
x2
x3
w03  w03  3 (1)
w13  w13  3 y1
w23  w23  3 y2
2
w02
1
w02  w02   2 (1)
w12  w12   2 x1
w22  w22   2 x2
w32  w32   2 x3
w01  w01  1 (1)
w11  w11  1 x1
w21  w21  1 x2
w31  w31  1 x3
Backpropagation Example
w03
First do forward propagation:
Compute zi’s and yi’s.
y3
3
z3
1
w13
1
w01
1
y1
y2
z1
z2
w31
w11 w21
x1
Suppose we have initially chosen
(randomly) the weights given in
the table.
w23
w12
w22 w32
x2
2
w02
Also, in the table is given one
training instance (first column).
1
x3
x0 = 1.0
w01 = 0.5
w02 = 0.7
w03 = 0.5
x1 = 0.4
w11 = 0.6
w12 = 0.9
w13 = 0.9
x2 = 0.2
w21 = 0.8
w22 = 0.8
w23 = 0.9
x3 = 0.7
w31 = 0.6
w32 = 0.4
Feed-Forward Example
w03
z1 = 1.0*0.5+0.4*0.6+0.2*0.8+0.7*0.6 = 1.32
y1 = 1/(1+e^(-z1)) = 1/(1+e(-1.32)) = 0.7892
y3
3
z3
1
w13
1
w01
1
z2 = 1.0*0.7+0.4*0.9+0.2*0.8+0.7*0.4 = 1.5
y2 = 1/(1+e^(-z2)) = 1/(1+e(-1.5)) = 0.8175
w23
y1
y2
z1
w31
w11 w21
x1
w12
z2
w22 w32
x2
2
z3 = 1.0*0.5+ 0.79*0.9+ 0.82*0.9= 1.95
y3 = 1/(1+e^(-z3)) = 1/(1+e(-1.95)) = 0.87
w02
1
x3
x0 = 1.0
w01 = 0.5
w02 = 0.7
w03 = 0.5
x1 = 0.4
w11 = 0.6
w12 = 0.9
w13 = 0.9
x2 = 0.2
w21 = 0.8
w22 = 0.8
w23 = 0.9
x3 = 0.7
w31 = 0.6
w32 = 0.4
Backpropagation
• So, the network output, for the given training example, is
y3=0.87.
• Assume the actual value of the target attribute is y=0.8
• Then the prediction error equals 0.8 – 0.8750 = -0.075.
Now
• 3 = y3(1-y3)(y3-y) = 0.87*(1-0.87)*(0.87-0.8) = 0.008
• Let’s have a learning rate of =0.01. Then, we update weights:
w03 = w03 -  3 (1) = 0.5 - 0.01*0.008*1 = 0.49918
w13 = w13 -  3 y1 = 0.9 - 0.01*0.008* 0.7892 = 0.8999
w23 = w23 -  3 y2 = 0.9 - 0.01*0.008* 0.8175 = 0.8999
Backpropagation
• 2 =
y2(1-y2)3w23 = 0.8175*(1-0.8175)*0.008*0.9 = 0.001
• 1 =
y1(1-y1)3w13 = 0.7892*(1- 0.7892)*0.008*0.9 = 0.0012
• Then, we update weights:
w02 = w02 -  2 (1) = 0.7 - 0.01*0.001*1 = 0.6999
w12 = w12 -  2 x1 = 0.9 - 0.01*0.001* 0.4 = 0.8999
w22 = w22 -  2 x2 = 0.8 - 0.01*0.001* 0.2 = 0.7999
w32 = w32 -  2 x3 = 0.4 - 0.01*0.001* 0.7 = 0.3999
w01 = w01 -  1 (1) = 0.5 - 0.01*0.001*1 = 0.4999
w11 = w11 -  1 x1 = 0.6 - 0.01*0.001* 0.4 = 0.5999
w21 = w21 -  1 x2 = 0.8 - 0.01*0.001* 0.2 = 0.7999
w31 = w31 -  1 x3 = 0.6 - 0.01*0.001* 0.7 = 0.5999