Transcript Slide 1

Linear Regression
Each dot in the figure provides information about the weight (x-axis,
units: U.S. pounds) and fuel consumption (y-axis, units: miles per
gallon) for one of 74 cars (data from 1979). Clearly weight and fuel
consumption are linked, so that, in general, heavier cars use more
fuel.
Now suppose we are given the weight of a 75th car, and asked to predict
how much fuel it will use,
y = w1 x + w0
(1)
This is a linear model: in an xy-plot, equation 1 describes a straight line
with slope w1 and intercept w0 with the y-axis
The Loss Function
In order to make precise what we mean by being a "good predictor", we
define a loss (also called objective or error) function E over the model
parameters. A popular choice for E is the sum-squared error:
(2)
In words, it is the sum over all points i in our data set of the squared
difference between the target value ti (here: actual fuel consumption) and
the model's prediction yi, calculated from the input value xi (here: weight of
the car) by equation 1. For a linear model, the sum-sqaured error is a
quadratic function of the model parameters. Figure 3 shows E for a range
of values of w0 and w1. Figure 4 shows the same functions as a contour
plot.
Minimizing the Loss
The loss function E provides us with an objective measure of predictive error
for a specific choice of model parameters. We can thus restate our goal of
finding the best (linear) model as finding the values for the model parameters
that minimize E. For linear models, linear regression provides a direct way
to compute these optimal model parameters. (See any statistics textbook for
details.) However, this analytical approach does not generalize to nonlinear
models (which we will get to by the end of this lecture). Even though the
solution cannot be calculated explicitly in that case, the problem can still be
solved by an iterative numerical technique called gradient descent. It works
as follows:
It's a neural network
Our linear model of equation 1 can in fact be implemented by the simple
neural network shown in next Fig. It consists of a bias unit, an input unit,
and a linear output unit. The input unit makes external input x (here: the
weight of a car) available to the network, while the bias unit always has a
constant output of 1. The output unit computes the sum:
y2 = y1 w21 + 1.0 w20
(3)
It is easy to see that this is equivalent to equation 1, with w21 implementing
the slope of the straight line, and w20 its intercept with the y-axis.
Linear Neural Networks
Our car example showed how we could discover an optimal linear
function for predicting one variable (fuel consumption) from one other
(weight). Suppose now that we are also given one or more additional
variables which could be useful as predictors. Our simple neural
network model can easily be extended to this case by adding more
input units (Fig. 1).
Similarly, we may want to predict more than one variable from the
data that we're given. This can easily be accommodated by adding
more output units (Fig. 2). The loss function for a network with
multiple outputs is obtained simply by adding the loss for each output
unit together. The network now has a typical layered structure: a layer
of input units (and the bias), connected by a layer of weights to a layer
of output units.
Linear Separable
x2
+
x2
+
-
-
+
+
(a)
-
x1
x1
-
+
(b)
some functions not representable - e.g., (b) not linearly separable
So what can be represented
using perceptrons?
and
or
Representation theorem: 1 layer feedforward networks can
only represent linearly separable functions. That is,
the decision surface separating positive from negative
examples has to be a plane.
Learning Boolean
AND
XOR
• No w0, w1, w2 satisfy:
w0
w2  w 0
w1 
w0
w1  w 2  w 0
0
0
0
0
(Minsky and Papert, 1969)
Expressive limits of
perceptrons
• Can the XOR function be represented by a
perceptron
(a network without a hidden layer)?
XOR cannot be represented.
How can perceptrons be
designed?
• The Perceptron Learning Theorem (Rosenblatt,
1960): Given enough training examples, there is
an algorithm that will learn any linearly separable
function.
Theorem 1 (Minsky and Papert, 1969) The perceptron
rule converges to weights that correctly classify all
training examples provided the given data set represents
a function that is linearly separable
The perceptron learning
algorithm
• Inputs: training set {(x1,x2,…,xn,t)}
• Method
– Randomly initialize weights w(i), -0.5<=i<=0.5
– Repeat for several epochs until convergence:
• for each example
– Calculate network output o.
– Adjust weights:
learning rate
error
wi   (t  o) xi Perceptron training
rule
wi  wi  wi
Multi-layer feed-forward
networks
Multi-layer, feed forward networks extend perceptrons i.e., 1-layer
networks into n-layer by:
• Partition units into layers 0 to L such that;
•lowermost layer number, layer 0 indicates the input units
•topmost layer numbered L contains the output units.
•layers numbered 1 to L are the hidden layers
•Connectivity means bottom-up connections only, with no cycles,
hence the name"feed-forward" nets
•Input layers transmit input values to hidden layer nodes hence do not
perform any computation.
Note: layer number indicates the distance of a node from the input
nodes
Minimizing the Loss
By repeating this over and over, we move "downhill" in E until we
reach a minimum, where G = 0, so that no further progress is possible
Computing the gradient
The Gradient Descent
Algorithm
1. Initialize all weights to small random values
2. REPEAT until done
1. For each weight wij set
2. For each data point (x, t)p
1. set input units to x
2. compute value of output units
3. For each weight wij set
3. For each weight wij set
The algorithm terminates once we are at, or sufficiently near
to, the minimum of the error function, where G = 0. We say
then that the algorithm has converged.
In summary
general case
linear network
Training data
(x,t)
(x,t)
Model parameters
w
w
Model
y = g(w,x)
Error function
E(y,t)
Gradient with respect to wij
Weight update rule
- (ti - yi) yj
The Learning Rate
An important consideration is the learning rate µ, which determines by
how much we change the weights w at each step. If µ is too small, the
algorithm will take a long time to converge
The Learning Rate
if µ is too large, we may end up bouncing around the error surface out
of control - the algorithm diverges .
Batch vs. Online Learning
Above we have accumulated the gradient
contributions for all data points in the training set
before updating the weights. This method is often
referred to as batch learning. An alternative approach
is online learning, where the weights are updated
immediately after seeing each data point. Since the
gradient for a single data point can be considered a
noisy approximation to the overall gradient G (next
Fig.), this is also called stochastic (noisy) gradient
descent.
Batch vs. Online Learning
Batch vs. Online Learning
Online learning has a number of advantages:
•it is often much faster, especially when the training set is
redundant (contains many similar data points),
•it can be used when there is no fixed training set (new data
keeps coming in),
•it is better at tracking nonstationary environments (where the
best model gradually changes over time),
•the noise in the gradient can help to escape from local minima
(which are a problem for gradient descent in nonlinear models).
Multi-layer networks
A nonlinear problem
Consider again the best linear fit we found for the car data. Notice
that the data points are not evenly distributed around the line: for low
weights, we see more miles per gallon than our model predicts. In
fact, it looks as if a simple curve might fit these data better than the
straight line. We can enable our neural network to do such curve
fitting by giving it an additional node which has a suitably curved
(nonlinear) activation function. A useful function for this purpose is
the S-shaped hyperbolic tangent (tanh) function
Multi-layer networks
Multi-layer networks
The Algorithm
We want to train a multi-layer feedforward network by gradient descent to
approximate an unknown function, based on some training data consisting
of pairs (x,t). The vector x represents a pattern of input to the network, and
the vector t the corresponding target (desired output). As we have seen
before, the overall gradient with respect to the entire training set is just the
sum of the gradients for each pattern; in what follows we will therefore
describe how to compute the gradient for just a single training pattern. As
before, we will number the units, and denote the weight from unit j to unit i
by wij.
1. Definitions:
the error signal for unit j: the (negative) gradient for weight
wij: the set of nodes anterior to unit i: the set of nodes
posterior to unit j:
2. The gradient. As we did for linear networks before, we expand the gradient
into two factors by use of the chain rule:
The first factor is the error of unit i. The second is
Putting the two together, we get
To compute this gradient, we thus need to know the activity and the error
for all relevant nodes in the network.
3. Forward activaction. The activity of the input units is determined by the
network's external input x. For all other units, the activity is propagated
forward:
Note that before the activity of unit i can be calculated, the activity of
all its anterior nodes (forming the set Ai) must be known. Since
feedforward networks do not contain cycles, there is an ordering of
nodes from input to output that respects this condition.
4. Calculating output error. Assuming that we are using the sumsquared loss
the error for output unit o is simply
5. Error backpropagation. For hidden units, we must propagate the error
back from the output nodes (hence the name of the algorithm). Again using
the chain rule, we can expand the error of a hidden unit in terms of its
posterior nodes:
Of the three factors inside the sum, the first is just the error of node i. The second is
while the third is the derivative of node j's activation function:
For hidden units h that use the tanh activation function, we can make
use of the special identity
tanh(u)' = 1 - tanh(u)2, giving us
Putting all the pieces together we get
Note that in order to calculate the error for unit j, we must first
know the error of all its posterior nodes (forming the set Pj).
Again, as long as there are no cycles in the network, there is an
ordering of nodes from the output back to the input that respects
this condition. For example, we can simply use the reverse of the
order in which activity was propagated forward.