Document 7418897
Download
Report
Transcript Document 7418897
Linear Methods for
Classification
Lecture Notes for CMPUT 466/551
Nilanjan Ray
1
Linear Classification
• What is meant by linear classification?
– The decision boundaries in the in the feature
(input) space is linear
• Should the regions be contiguous?
R1
R2
X2
R3
R4
X1
Piecewise linear decision boundaries in 2D input space
2
Linear Classification…
• There is a discriminant function k(x) for
each class k
• Classification rule: Rk {x : k arg max j ( x)}
j
• In higher dimensional space the decision
boundaries are piecewise hyperplanar
• Remember that 0-1 loss function led to the
classification rule: Rk {x : k arg max P(G j | X x)}
• So,
j
P(G k | X )
can serve as k(x)
3
Linear Classification…
• All we require here is the class boundaries
{x:k(x) = j(x)} be linear for every (k, j) pair
• One can achieve this if k(x) themselves are
linear or any monotone transform of k(x) is
linear
– An example:
exp( 0 T x)
P (G 1 | X x)
1 exp( 0 T x)
P (G 2 | X x)
So that log[
1
1 exp( 0 T x)
P (G 1 | X x)
] 0 T x
P (G 2 | X x)
Linear
4
Linear Classification as a Linear
Regression
2D Input space: X = (X1, X2)
Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)
Training sample, size N=5,
1
1
X 1
1
1
x11
x21
x31
x41
x51
x12
y11
y
x22
21
x32 , Y y31
x42
y41
y51
x52
y12
y22
y32
y42
y52
y13
y23
y33
y43
y53
Each row has
exactly one 1
indicating the
category/class
Indicator Matrix
Regression output: Yˆ (( x1 , x2 )) (1 x1 x2 )( XT X)1 XT Y ( xT 1 xT 2 xT 3 )
Or,
Yˆ1 (( x1 x2 )) (1 x1 x2 ) 1
Yˆ2 (( x1 x2 )) (1 x1 x2 ) 2
Yˆ (( x x )) (1 x x )
3
1
2
1
2
3
Classification rule:
Gˆ (( x1 x2 )) arg max Yˆk (( x1 x2 ))
k
5
The Masking
Linear regression of the indicator matrix can lead to masking
Masking
2D input space and three classes
Yˆ3 (1 x1 x2 )3
Yˆ2 (1 x1 x2 ) 2
Yˆ1 (1 x1 x2 ) 1
Viewing direction
6
LDA can avoid this masking
Linear Discriminant Analysis
Essentially minimum error Bayes’ classifier
Assumes that the conditional class densities are (multivariate) Gaussian
Assumes equal covariance for every class
Posterior probability Pr(G k | X x)
f k ( x) k
K
f ( x)
l 1
l
l
Application of
Bayes rule
k is the prior probability for class k
fk(x) is class conditional density or likelihood density
f k ( x)
1
1
T
1
exp(
(
x
)
Σ
( x k ))
k
p/2
1/ 2
(2 ) | Σ |
2
7
LDA…
k
fk
Pr(G k | X x)
log
log
log
Pr(G l | X x)
l
fl
1 T 1
1 T 1
T 1
(log k x Σ k k Σ k ) (log l x Σ l l Σ l )
2
2
T
1
k (x)
Classification rule:
l (x)
Gˆ ( x) arg max k ( x)
k
ˆ ( x) arg max Pr(G k | X x)
is equivalent to: G
k
The good old Bayes classifier!
8
LDA…
When are we going to use the training data?
Total N input-output pairs
Nk number of pairs in class k
Total number of classes: K
( gi , xi ), i 1 : N
Training data utilized to estimate
Prior probabilities:
ˆ k N k / N
Means:
ˆ k g k xi / N k
Covariance matrix:
i
ˆΣ K ( x ˆ )( x ˆ )T /( N K )
i
k
i
k
k 1
g
i
9
LDA: Example
LDA was able to avoid masking here
10
Quadratic Discriminant Analysis
• Relaxes the same covariance assumption– class conditional
probability densities (still multivariate Gaussians) are allowed to have
different covariant matrices
• The class decision boundaries are not linear rather quadratic
log
f
Pr(G k | X x)
log k log k
Pr(G l | X x)
l
fl
1
1
1
1
(log k ( x k )T Σ k1 ( x k ) log | Σ k |) (log l ( x l )T Σ l1 ( x l ) log | Σ l |)
2
2
2
2
k (x)
l (x)
11
QDA and Masking
Better than Linear Regression in terms of handling masking:
Usually computationally more expensive than LDA
12
Fisher’s Linear Discriminant
[DHS]
From training set we want to find out a direction where the separation
between the class means is high and overlap between the classes is small
13
Fisher’s LD…
wT x
Projection of a vector x on a unit vector w:
x
Geometric interpretation:
w
wT x
From training set we want to find out a direction w where the separation
between the projections of class means is high and
the projections of the class overlap is small
14
Fisher’s LD…
Class means:
m1
1
xi ,
N1 xi R1
Projected class means:
m2
1
N2
x
xi R2
i
~ 1
m
wT xi wT m1 ,
1
N1 xi R1
Difference between projected class means:
~ 1
m
2
N2
w
T
xi R2
xi wT m2
~ m
~ wT (m m )
m
2
1
2
1
Scatter of projected data (this will indicate overlap between the classes):
T
T
2
T
T
w wT S1w
(
w
x
w
m
)
w
(
x
m
)(
x
m
)
i
1
i
1
i
1
x R
yi :xi R1
xi R1
i 1
T
T
2
T
T
~
~ )2
w wT S 2 w
s22 ( yi m
(
w
x
w
m
)
w
(
x
m
)(
x
m
)
2
i
2
i
2
i
2
x R
yi :xi R2
xi R2
i 2
~
s12
( yi m~1 )2
15
Fisher’s LD…
Ratio of difference of projected means over total scatter:
~ m
~ ) 2 wT S w
(m
2
r ( w) ~ 2 ~12 T B
s1 s2
w Sww
where
Rayleigh quotient
S w S1 S 2
S B (m2 m1 )( m2 m1 )T
We want to maximize r(w). The solution is
w S w1 (m2 m1 )
16
Fisher’s LD: Classifier
So far so good. However, how do we get the classifier?
1
All we know at this point is that the direction w S w (m2 m1 )
separates the projected data very well
Since we know that the projected class means are well separated,
we can choose average of the two projected means as a threshold
for classification
Classification rule: x in R2 if y(x)>0, else x in R1, where
1 ~ ~
1 T
1
T
y ( x) wT x (m
w (m1 m2 ) S w1 (m2 m1 )( x (m1 m2 ))
1 m2 ) w x
2
2
2
17
Fisher’s LD: Multiple Classes
There are k clases C1,…,Ck with number of elements ni in the ith class
Compute means for the classes:
Compute the grand mean:
Compute variances:
Sw
mi
m
1
ni
x
xCi
1
n1 ... nk
( x m )( x m )
xC1
1
T
1
...
x
i
i
( x m )( x m )
xCk
k
T
k
S B n1 (m1 m)( m1 m)T ... nk (mk m)( mk m)T
wT S B w
Maximize Rayleigh ratio: r ( w) T
w Sww
The solution largest eigenvector of is S w1S B
At most (k-1) eigenvalues will be non-zero. Dimensionality reduction.
18
Fisher’s LD and LDA
They become same when
(1) Prior probabilities are same
(2) Common covariance matrix for the class conditional densities
(3) Both class conditional densities are multivariate Gaussian
Ex. Show that Fisher’s LD classifier and LDA produce the
same rule of classification given the above assumptions
Note: (1) Fisher’s LD does not assume Gaussian densities
(2) Fisher’s LD can be used in dimension reduction for a multiple class scenario
19
Logistic Regression
• The output of regression is the posterior
probability i.e., Pr(output | input)
• Always ensures that the sum of output variables
is 1 and each output is non-negative
• A linear classification method
• We need to know about two concepts to
understand logistic regression
– Newton-Raphson method
– Maximum likelihood estimation
20
Newton-Raphson Method
A technique for solving non-linear equation f(x)=0
Taylor series: f ( xn1 ) f ( xn ) ( xn1 xn ) f ( xn )
After rearrangement: xn 1 xn
f ( xn 1 ) f ( xn )
f ( xn )
If xn+1 is a root or very close to the root, then: f ( xn1 ) 0
So:
xn 1 xn
Rule for iteration
Need an initial guess x0
f ( xn )
f ( xn )
21
Newton-Raphson in Multi-dimensions
f1 ( x1 , x2 , , x N ) 0
We want to solve the equations:
f 2 ( x1 , x2 , , x N ) 0
f N ( x1 , x2 , , x N ) 0
N
Taylor series:
f j ( x x) f j ( x)
f j
k 1 xk
After some rearrangement etc.
the rule for iteration:
(Need an initial guess)
xk ,
j 1,..., N
f1
x1n 1 x1n 1 x1
n 1 n 1 f 2
x2 x 2
x1
n 1 n 1
x N x N f N
x1
f1
x2
f 2
x2
f N
x2
f1
x N
f 2
x N
f N
x N
1
f1 ( x1n , x2n , , x Nn )
n
n
n
f
(
x
,
x
,
,
x
)
2
1
2
N
n
n
n
f N ( x1 , x2 , , x N )
Jacobian matrix
22
Newton-Raphson : Example
Solve:
f1 ( x1 , x2 ) x12 cos( x2 ) 0
f 2 ( x1 , x2 ) sin( x1 ) x12 x23 0
x1n1 x1n
2 x1n
n1 n
n
n
x2 x2 cos( x1 ) 2 x1
1
sin( x2n ) ( x1n ) 2 cos( x2n )
3( x2n ) 2 sin( x1n ) ( x1n ) 2 ( x2n )3
Iteration rule
need initial guess
23
Maximum Likelihood Parameter
Estimation
Let’s start with an example. We want to find out the unknown
parameters mean and standard deviation of a Gaussian pdf,
given N independent samples from it.
1
( x )2
p( x; , )
exp(
)
2 2
2
Samples: x1,….,xN
N
Form the likelihood function: L( , )
i 1
( xi ) 2
1
exp(
)
2 2
2
Estimate the parameters that maximize the likelihood function
(ˆ , ˆ ) arg max L( , )
,
Let’s find out ( ˆ , ˆ )
24
Logistic Regression Model
The method directly models the posterior probabilities as the output of regression
Pr(G k | X x)
exp( k 0 kT x)
K 1
1 exp( l 0 lT x)
, k 1, , K 1
l 1
Pr(G K | X x)
1
K 1
1 exp( l 0 lT x)
l 1
Note that the class boundaries are linear
How can we show this linear nature?
What is the discriminant function for every class in this model?
25
Logistic Regression Computation
Let’s fit the logistic regression model for K=2, i.e., number of classes is 2
Training set: (xi, gi), i=1,…,N
N
Log-likelihood:
l ( ) {log Pr(G yi | X xi )}
i 1
N
yi log(Pr( G 1 | X xi )) (1 yi ) log(Pr( G 0 | X xi ))
i 1
N
( yi T xi (1 yi ) log
i 1
1
)
1 exp( T xi )
N
( yi T xi (1 yi ) log( 1 exp( T xi )))
i 1
We want to maximize the log-likelihood in order to estimate
26
Logistic Regression Computation…
l ( ) N
exp( T x)
( yi
) xi 0
T
1 exp( x)
i 1
(p+1) Non-linear equations
Solve by Newton-Raphson method:
old
old
l
(
)
l
(
)
new old [Jacobian(
)]-1
Let’s workout the details hidden in the above equation.
In the process we’ll learn a bit about vector differentiation etc.
27