Document

Transcript Document

Linear Classifiers
Dept. Computer Science &
Engineering,
Shanghai Jiao Tong University
Outline
•
•
•
•
•
Linear Regression
Linear and Quadratic Discriminant Functions
Reduced Rank Linear Discriminant Analysis
Logistic Regression
Separating Hyperplanes
2015/7/7
Linear Classifiers
2
Linear Regression
• The response classes is coded by a indicator
variable. A K classes depend on K indicator
variables, as Y(k), k=1,2,…, K, each indicate a
class. And N train instances of the indicator vector
could form a indicator response matrix Y.
• To the Data set X(k), there is map:
X : X(k)
Y(k)
• According to the linear regression model:
f(x)=(1,XT)B^
Y ^=X(XTX)-1XTY
2015/7/7
Linear Classifiers
3
Linear Regression
• Given X, the classification should be:


Gx   arg maxkG f k x 
• In another form, a target tk is constructed for each
class, the tk presents the kth column of a K identity
matrix, according to the a sum-of-squared-norm
N
criterion :
T 2
min  yi  1, xi B
B
i 1
and the classification is:


Gx   arg minkG f x   tk
2015/7/7
Linear Classifiers
2
4
Problems of the linear regression
• The data come from
three classes in IR2
and easily separated
by linear decision
boundaries.
• The left plot shows the
boundaries found by
linear regression of
indicator response
variables.
The rug plot at bottom indicates the positions and the class
• The middle class is
membership of each observations. The 3 curves are the fitted
completely masked .
regressions to the 3-class indicator variables.
2015/7/7
Linear Classifiers
5
Problems of the linear regression
• The left plot shows the boundaries found by linear
discriminant analysis. And the right shows the fitted
regressions to the 3-class indicator variables.
2015/7/7
Linear Classifiers
6
Linear Discriminant Analysis
• According to the Bayes optimal classification
mentioned in chapter 2, the posteriors is needed.
post probability : PrG | X 
assume:
f k x ——condition-density of X in class G=k.
K
 k ——prior probability of class k, with k 1 k  1
Bayes theorem give us the discriminant:
PrG  k | X  x  
f k x  k

K
l 1
2015/7/7
Linear Classifiers
f l x  l
7
Linear Discriminant Analysis
• Multivariate Gaussian density:
f k x  
1
2 
p/2
k
p/2
e
 12  x   k T  k 1  x   k 
• Comparing two classes k and l , assumek
f k ( x)
k
P r(G  k | X  x)
log
 log
 log
P r(G  l | X  x)
f l ( x)
l
 , k
k 1
 log  (  k  l ) 1 (  k  l )
l 2
 x  (  k  l )
T
2015/7/7
1
Linear Classifiers
8
Linear Discriminant Analysis
• The linear log-odds function above implies that the
class k and l is linear in x; in p dimension a
hyperplane.
• Linear Discriminant Function:
1
1
 k ( x)  x   k   k   k  log  k
2
T
1
So we estimate
2015/7/7
ˆ k , ˆ k , ˆ
Linear Classifiers
9
Parameter Estimation
ˆ k 
Nk
, N k is t henumber of Class k dat a
N
ˆ k 
x /N
gi k
i
k
;
K
ˆ    ( xi  ˆ k )(xi  ˆ k )T /( N  K ).
k 1 g i  k
2015/7/7
Linear Classifiers
10
LDA Rule
1
1
 k ( x)  x   k   k   k  log k
2
T
1
LDA rule : g k ( x)  var maxl { l ( x)}
Decisionboundary{ x |  k ( x)   l ( x)}
2015/7/7
Linear Classifiers
11
Linear Discriminant Analysis
• Three Gaussian distribution with the same
covariance and different means. The Bayes
boundaries are shown on the left (solid lines). On
the right is the fitted LDA boundaries on a sample
of 30 drawn from each Gaussian distribution.
2015/7/7
Linear Classifiers
12
Quadratic Discriminant Analysis
• When the covariance are different
1
1
1
 k ( x)   log  k  x   k  x   k   log  k
2
2
• This is the Quadratic Discriminant Function
• The decision boundary is described by a quadratic
equation
{x :  k ( x)   l ( x)}
2015/7/7
Linear Classifiers
13
LDA & QDA
• Boundaries on a 3-classes problem found by both
the linear discriminant analysis in the original
2-dimensional space X1, X2 (the left) and in a
5-dimensional space X1, X2, X12, X12, X22 (the right).
2015/7/7
Linear Classifiers
14
LDA & QDA
• Boundaries on the 3-classes problem found by
LDA in the 5-dimensional space above (the left)
and by Quadratic Discriminant Analysis (the right).
2015/7/7
Linear Classifiers
15
Regularized Discriminant Ana.
• Shrink the separate covariances of QDA toward a
common covariance as in LDA.
ˆ ( )  
ˆ  (1  )
ˆ ,  [0,1]

k
k
Regularized QDA
ˆ was allowed to be shrunk toward the scalar
• 
covariance.
ˆ( )  
ˆ  (1   )ˆ 2,  [0,1]
Regularized LDA
• Together :
2015/7/7
ˆ ( ,  )
Linear Classifiers
16
Regularized Discriminant Ana.
ˆ ( )  ˆ  (1   ) diag(ˆ )
• Could use 
• In recent micro expression work, we can use
2
P (x  u

ˆ
1
j
jk )
 K ( x)  
 log K
2
Sj
2
j 1
where uˆjk in a “SHRUNKEN CENTROID”
2015/7/7
Linear Classifiers
17
• Test and training
errors for the
vowel data, using
regularized
discriminant
analysis with a
series of values
of   [0,1] . The
optimum for the
test data occurs
around   0.9 ,
close to quadratic
discriminant
analysis
2015/7/7
Linear Classifiers
18
Computations for LDA
1
1
 k ( x)   log  k  x   k  1 x   k   log  k
2
2
ˆ k  Uk DkUk T
• The eigen-decomposition for each 
where U k is p  p orthonormal, and Dk is a diagonal
matrix of positive eigenvalues d kl .
• So the ingredients for  k (x) are:
x  ˆ x 
T




ˆ 1 x  ˆ   U T x  ˆ  T D 1 U T x  ˆ 
k
x
k
x
k
k
x
log ˆ k  l log d kl
2015/7/7
Linear Classifiers
19
Reduced Rank LDA
T
ˆ
  UDU
• Let
• Let
ˆ 1/ 2 X
i.e. 
ˆ -1/ 2 U

K
X *  D 1/ 2U T X
Uˆ K*  D 1/ 2U TUˆ K
• LDA:
 K ( x) 
1
2
x  ˆ
*
* 2
K
 logˆ K
• Closest centroid in sphered space( apart from  logˆ K)
• Can project data onto K-1 dim subspace
spanned by Uˆ1* ,,Uˆ K* , and lose nothing!
• Can project even lower dim using principal
components of Uˆ K*, k=1, …, K.
2015/7/7
Linear Classifiers
20
Reduced Rank LDA
• Compute K  p matrix M of centroids
ˆ , and M *  MW 1/ 2
• Compute W  
• Compute B*, cov matrix of M* , and
*
*
*T
B  V DBV
• Zl  vl T X w ithvl  W 1/ 2vl * is l-th discriminant
variable ( canonical variable )
2015/7/7
Linear Classifiers
21
• A two-dimensional
plot of the vowel
training data.
There are eleven
classes
with X  IR 10,and
this is the best
view in terms of a
LDA model. The
heavy circles are
the projected
mean vectors for
each class.
2015/7/7
Linear Classifiers
22
• Projections
onto
different
pairs of
canonical
varieties
2015/7/7
Linear Classifiers
23
Reduced Rank LDA
• Although the line joining the centroids defines the
direction of greatest centroid spread, the projected
data overlap because of the covariance ( left panel).
• The discriminant direction minimizes this overlap
for Gaussian data ( right panel).
2015/7/7
Linear Classifiers
24
Fisher’s problem
• Find Z  a X s.t. “between-class var” is
maximized relative to “within-class var”.
• Maximize “Rayleigh quotient” :
a T Ba
max( T )
a Wa T
T
max(a Ba), s.t. a Wa  1
a  v1 !
T
T
 max(a2 Ba2 ), s.t. a2 Wa 2  1
T
a2 Wa1  0
a

v
etc.
2
2015/7/7
Linear Classifiers
T
25
• Training and
test error rates
for the vowel
data, as a
function of the
dimension of
the discriminant
subspace.
• In this case the
best rate is for
dimension 2.
2015/7/7
Linear Classifiers
26
2015/7/7
Linear Classifiers
27
South
African
Heart
Disease
Data
2015/7/7
Linear Classifiers
28
Logistic Regression
• Model:P r(g  1 X  x)
Log
Log
 1, 0  1T x
P r(g  k X  x)

P r(g  K  1 X  x)
Log Likelihood
  k , 0:   2T x
P r(g  K X  xn )
L( )  log Pgi ( xi ; )
T
i 1 exp( k , 0   k x)
P r(g  k X  x) 
, k  1,, K  1
K 1
T
1  l 1 exp( l , 0   l x)
1
P r(g  K X  x) 
K 1
1  l 1 exp( l , 0   lT x)

2015/7/7
Linear Classifiers
29
Logistic Regression / LDA
P r(g  k X  x)
k 1
log
 log
 (  k   K )T  1 (  k   K )
P r(g  K X  x)
K 2
 xT  1 (  k   K )
  k 0   kT x
Same form as Logistic Regression
Conditional Likelihood
LR : Pr(X , g  k )  Pr(X ) Pr(g  k X  x)
T
exp( k ,0   k x)
Probability
1  l 1 exp( l ,0   lT x)
LDA : Pr(X , g  k )   ( X ;  k , ) k
K
Pr(X )  k 1  k ( X ;  k , )
2015/7/7
K 1
Linear Classifiers
30
Rosenblatt’s Perceptron Learning Algorithm
D(  ,  0 )    yi ( xiT    0 )
iM
yi  {1,1} M  Miscassified Observations
• D( , 0 ) --distance of points in M to boundary
• Stochastic Gradient Descent
D
D
   yi xi
   yi

 0
iM
iM
          yi xi 
   
 y 
 0  0
 i 
2015/7/7
Linear Classifiers
31
Separating Hyperplanes
2015/7/7
• A toy example with two
classes separable by
hyperplane. The orange
line is the least squares
solution, which
misclassifies one of the
training points. Also
shown are two blue
separating hyperplanes
found by the perceptron
learning algorithm with
different random starts.
•
Linear Classifiers
32
Optimal Separating Hyperplanes
max C
 ,  0 ,  1
subject to
yi ( xiT    0 )  C , i  1,, N
2015/7/7
Linear Classifiers
33