Lecture 4 From linear machine to flexible discriminants (1)

Download Report

Transcript Lecture 4 From linear machine to flexible discriminants (1)

Lecture 4
Linear machine
Linear discriminant functions
Generalized linear discriminant function
Fisher’s Linear discriminant
Perceptron
Optimal separating hyperplane
1
Linear discriminant functions
g(x)=wtx+w0
If w is unit vector,
r is signed
distance. Decide
class by its sign.
2
Linear discriminant functions
If x1 and x2 are both on the decision surface,
From the discriminant function point of view:
3
Linear discriminant functions
More than two
classes.
#classes=c
Dichotomize?
c linear discriminants
Pairwise?
c(c-1)/2 linear
discriminants
4
Linear discriminant functions
Remember what we did in the Bayes Decision class?
Define c linear discriminant functions:
The overall classifier will be to maximize g(x) at every x:
if
"j ¹ i
The resulting classifier is a Linear Machine. The space is
divided into c regions.
The boundary between neighboring regions is linear,
because:
5
Linear discriminant functions
6
Generalized linear discriminant functions
When we transform x,
linear discriminant
functions can lead to nonlinear separation in the
original feature space.
7
Generalized linear discriminant functions
Here in two class case, g(x)=g1(x)-g2(x)
Example:
a’=(-3,2,5)
g(x)=-3+2x+5x2
g(x)=0 when x=3/5 or x=-1
g(x)>0 when x>3/5 or x<-1, decide R1
g(x)<0 when -1<x<3/5, decide R2
8
Generalized linear discriminant functions
9
Generalized linear discriminant functions
10
Fisher Linear discriminant
The goal: project the data from d dimensions onto a line. Find
the line that maximizes the class separation after projection.
The magnitude of w is irrelevant, as it just scales y
The direction of w is what matters.
Projected mean:
11
Fisher Linear discriminant
Then the distance between projected mean:
Our goal is to make the distance large relative to a measure
of variation in each class.
Define the scatter:
is an estimate of the pooled variance.
Fisher linear discriminant aims at maximizing over all w
12
Fisher Linear discriminant
Let
Note, this is the sample version of
Let
Sw:
within-class scatter matrix
Then
Let
SB:
between-class scatter matrix
Then
13
Fisher Linear discriminant
Because for any w, SBw is always in the direction of m1-m2
Notice this is the same result when the two densities are
normal with equal variance matrix, using the Bayes decision
rule.
14
Multiple discriminant analysis
Now there are c classes. The goal is to project to c-1
dimensional space and maximize the between-group scatter
relative to within-group scatter.
Why c-1 ? We need c-1 discriminant functions.
Within-class scatter:
Total mean:
15
Multiple discriminant analysis
Between group scatter
Total scatter
Take a d×(c-1) projection matrix W:
16
Multiple discriminant analysis
The goal is to maximize:
The solution: every column vector in W is among the
first c-1generalized eigen vectors in
Since the projected scatter is not class-specific, this is
more like a dimension reduction procedure which
captures as much class information as possible.
17
Multiple discriminant analysis
18
Multiple discriminant analysis
Eleven classes. Projected onto the first two eigen vectors:
19
Multiple discriminant analysis
With the
increase of the
eigen vector
rank, the
seperability
decreases.
20
Multiple discriminant analysis
21
Separating hyperplane
Let’s do some data augmentation to make things easier.
If we have a decision boundary between two classes:
Let
æ1ö
ç ÷
x1 ÷ æ 1 ö
ç
y=
=ç ÷
ç ... ÷ è x ø
ç ÷
è xd ø
Then
What’s the benefit? The
hyperplane always goes
through the origin.
22
Linearly separable case
Now we want to use the training samples to find the weight
vector a which classifies all samples correctly.
If a exists, the samples are linearly separable.
for every yi in class 1
for every yi in class 2
If all yi in class 2 are replaced by its negative, then we are
trying to find a such that
for every sample.
Such an a is a “separating vector” or “solution vector”.
is a hyperplane through the origin of weight space
with yi as a normal vector.
The overall solution lies on the positive side of every such
23
hyperplane. Or in the intersection of n half-spaces.
Linearly separable case
Every vector in the grey region is a solution vector. The
region is called the “solution region”. A vector in the middle
looks better. We can impose conditions to select it.
24
Linearly separable case
Maximize the minimum distance from the samples to the plane
25
Gradient descent procedure
How to find a solution vector?
A general approach:
Define a function J(a) which is minimized if a is a solution
vector.
Start with an arbitrary vector
Find the gradient
Move from
to the direction of the gradient to find
Iterate; stop when the gain is smaller than a threshold.
26
Gradient descent procedure
27
Perceptron
Y(a) is the set of samples mis-classified by a.
When Y(a) is empty, define J(a)=0.
Because aty <0 when yi is misclassified, J(a) is nonnegative.
The gradient is simple:
The update rule is:
Learning rate
28
Perceptron
29
Perceptron
30
Perceptron
31
Perceptron
32
Perceptron
The perceptron adjusts a only according to misclassified
samples; correctly classified samples are ignored.
The final a is a linear combination of the training points.
To have good testing-sample performance, a large set of
training samples is needed; however, it is almost certain that
a large set of training samples is not linearly separable.
In the case of linearly non-separable, the iteration doesn’t
stop. We can let η(k)  0 as k∞.
However, how to choose the rate of change?
33
Optimal separating hyperplane
The perceptron finds a separating plane out of infinite
possibilities. How do we find the best among them?
The optimal separating hyperplane separates the two
classes and maximizes the distance to the closest point.
•Unique solution
•Better test sample
performance
34
Optimal separating hyperplane
Notation change!!!!
Here we use yi as the class label of sample i.
min ||a||2
s.t. a’yi ≥ 1, i=1,…,N
We shall visit the support vector machine next time.
35