Transcript exin4

unit #4
Giansalvo EXIN Cirrincione
Single-layer networks
They directly compute linear discriminant
functions using the TS without need of
determining probability densities.
Linear discriminant functions
Two classes
C1
x is assigned to class 
C2
if yx   0
if yx   0
T


y x  w x  w0
(d-1)-dimensional
hyperplane
Linear discriminant functions
Several classes
x is assigned toclass Ck if yk x  y j x  j  k
yk x  wTk x  wk 0
d
yk x    wki xi  wk 0
w
i 1
 w j  x  wk 0  w j 0   0
T
k
Linear discriminant functions
Several classes
The decision regions are always simply connected and convex.
xˆ   x A  1   x B where 0    1
yk x A   y j x A  and yk x B   y j x B 
 jk
yk xˆ    yk x A   1    yk x B  and hence yk xˆ   y j xˆ 
 jk
Logistic discrimination
monotonic activation function
• two classes
• Gaussians with S1  S2  S
The decision
boundary is still
linear
Logistic discrimination
logistic sigmoid
Logistic discrimination
logistic sigmoid
Logistic discrimination
The use of the logistic sigmoid activation
function allows the outputs of the discriminant
to be interpreted as posterior probabilities.
logistic sigmoid
binary input vectors
Let Pki denote the probability that the input xi takes the value
+1 when the input vector is drawn from the class Ck. The
corresponding probability that xi = 0 is then given by 1- Pki .
pxi Ck   P 1  Pki 
xi
ki
1 xi
(Bernoullidistribution)
Assuming the input variables are statistically independent, the
probability for the complete input vector is given by:
px Ck    Pkixi 1  Pki 
d
i 1
1 xi
binary input vectors
yk x  ln px Ck  ln PCk 
d
yk x    wki xi  wk 0
i 1
where
w ki  ln Pki  ln1  Pki 
d
wk 0   ln1  Pki   ln PCk 
i 1
Linear discriminant functions arise when we consider
input patterns in which the variables are binary.
binary input vectors
Consider a set of independent binary variables having Bernoulli
class-conditional densities. For the two-class problem:

P C1 x   g w x  w0
T

where g a  is thelogistic sigmoid and
1  P1i
PC1 
 ln
w 0   ln
PC2 
1  P2i
i
1  P1i
P1i
 ln
w i  ln
1  P2i
P2i
Both for normally distributed and Bernoulli distributed
class-conditional densities, the posterior probabilities are
obtained by a logistic single-layer network.
homework
Generalized discriminant functions
fixed non-linear basis functions
Extra basis function
equal to one
It can approximate any CONTINUOUS functional
transformation to arbitrary accuracy.
Training
least-squares
techniques
perceptron
learning
Fisher
discriminant
Sum-of-squares error function
target
quadratic in the weights
Geometrical interpretation of least squares
column space
Pseudo-inverse solution
normal equations
Nxc
N x (M+1)
c x (M+1)
Pseudo-inverse solution
singular
bias
The role of the biases is to compensate for the
difference between the averages (over the data set) of
the target values and the averages of the output vectors
gradient descent
Group all of the parameters (weights and biases)
together to form a single weight vector w.
batch
If  is chosen correctly, the
gradient descent becomes the
Robbins-Monro procedure for
finding the root of the regression
function
sequential
gradient descent
Differentiable non-linear activation functions
batch
logistic sigmoid
gradient descent
homework
Generate and plot a set of data
points in two dimensions, drawn
from two classes each of which is
described by a Gaussian classconditional density function.
Implement the gradient descent
algorithm for training a logistic
discriminant, and plot the decision
boundary at regular intervals during
the training procedure on the same
graph as the data. Explore the effect
of choosing different values for the
learning rate. Compare the
behaviour of the sequential and
batch weight update procedures.
homework
The perceptron
Applied to classification problems in which the inputs are
usually binary images of characters or simple shapes
fixed weights connected
to a random subset of the
input pixels
The perceptron
Define the error function in terms of the total number of
misclassifications over the TS. However, an error function
based on a loss matrix is piecewise constant w.r.t. the
weights and gradient descent cannot be applied.
n


1
if
x
 C1
n
t 
n

1
if
x
 C2

proportional to
the absolute
distances of the
T
n n
misclassified
wanted
w  t 0
input patterns to
Minimize the perceptron criterion : the decision
boundary


misclassified
The criterion is continuous and piecewise linear
The perceptron
Apply the sequential gradient descent rule to the perceptron criterion
misclassified
Cycle through all of the patterns in the TS and test each pattern in
turn using the current set of weight values. If the pattern is correctly
classified do nothing, otherwise add the pattern vector to the weight
vector if the pattern is labelled class C1 or subtract the pattern vector
from the weight vector if the pattern is labelled class C2.
The value of  is unimportant since its change is
equivalent to a re-scaling of the weights and biases.
The perceptron
The perceptron convergence theorem
For any data set which is linearly separable, the perceptron learning
rule is guaranteed to find a solution in a finite number of steps.
proof
solution
null initial conditions
The perceptron convergence theorem
For any data set which is linearly separable, the perceptron learning
rule is guaranteed to find a solution in a finite number of steps.
proof
end proof
The perceptron convergence theorem
Prove that, for arbitrary vectors
^ the following equality
w and w,
is satisfied:
homework
Hence, show that an upper limit
on the number of weight updates
needed for convergence of the
perceptron algorithm is given by:
homework
If the data set happens not to be linearly
separable, then the learning algorithm will never
terminate. If we arbitrarily stop the learning
process, there is no guarantee that the weight
vector found will generalize well for new data.
 decrease  during the training process;
 the pocket algorithm.
It involves retaining a copy (in one’s pocket) of the
set of weights which has so far survived unchanged
for the longest number of pattern presentations.
Limitations of the perceptron
Even though the data set of input patterns may not be linearly
separable in the input space, it can become linearly separable in the
 -space. However, it implies the number and complexity of the j’s
to grow very rapidly (typically exponential).
receptive field
Limiting the complexity:
diameter-limited perceptron
Fisher’s linear discriminant
optimal linear dimensionality reduction
yw x
T
no bias
select a projection which maximizes the class separation
•N1 points of class C1
•N2 points of class C2
Fisher’s linear discriminant
Maximize:
unit length
arbitrarily large by increasing
the magnitude of w
class mean of the
projected data
from class Ck
Constrained optimization:
w  (m2 - m1)
Maximize a function which
represents the difference
between the projected class
means, normalized by a
measure of the within-class
scatter along the direction of w.
Fisher’s linear discriminant
The within-class scatter of the transformed data from class
Ck is described by the within-class covariance given by:
Fisher criterion
between-class
covariance matrix
within-class
covariance matrix
Fisher’s linear discriminant
w  PC of SW1S B
w  SW1 m2  m1 
Generalized
eigenvector
problem
S B w  m2  m1
Fisher’s linear discriminant
w  PC of SW1S B
w  SW1 m2  m1 
EXAMPLE
Fisher’s linear discriminant
w  PC of SW1S B
w  SW1 m2  m1 
The projected data can subsequently be used to construct a
discriminant, by choosing a threshold y0 so that we classify a new
point as belonging to C1 if y(x)  y0 and classify it as belonging to C2
otherwise. Note that y = wTx is the sum of a set of random variables
and so we may invoke the central limit theorem and model the classconditional density functions p(y| Ck) using normal distributions.
Once we have obtained a suitable weight vector and a threshold, the
procedure for deciding the class of a new vector is identical to that of
the perceptron network. So, the Fisher criterion can be viewed as a
learning law for the single-layer network.
Fisher’s linear discriminant
relation to the least-squares approach
 N
n

if
x
 C1
 N
1
tn  
N

if x n  C2
 N 2
Fisher’s linear discriminant
relation to the least-squares approach
Bias
threshold
Fisher’s linear discriminant
relation to the least-squares approach
Fisher’s linear discriminant
d’ linear features
Several classes
y  Wx
d x d
(W  
within-class
covariance
)
Fisher’s linear discriminant
Several classes
Fisher’s linear discriminant
Several classes
In the projected
d’-dimensional y-space
Fisher’s linear discriminant
Several classes
One possible criterion ...
1

weights: d PC's of SW S B
This criterion is unable to find
more than (c - 1) linear features