Introduction to Decision Theory and Machine Learning

Download Report

Transcript Introduction to Decision Theory and Machine Learning

Elements of
Pattern Recognition
CNS/EE-148 -- Lecture 5
M. Weber
P. Perona
What is Classification?
• We want to assign objects to classes based on a selection of
attributes (features).
• Examples:
– (age, income)  {credit worthy, not credit worthy}
– (blood cell count, body temp)  {flue, hepatitis B, hepatitis C}
– (pixel vector)  {Bill Clinton, coffee cup}
• Feature vector can be continuous, discrete or mixed.
What is Classification?
• Want to find a function
x2
from measurements to
Space
of decision
Feature Vectors
class labels
boundary.
c : R2  C0 , C1, C2 , C? 
Signal 1
Signal 2
• Statistical methods use pdf:
p(C,x)
Noise
• Assume p(C,x) known for
now
x1
Some Terminology
• p(C) is called a prior or a priori probability
• p(x|C) is called a class-conditional density
or likelihood of C with respect to x
• p(C|x) is called a posterior or
a posteriori probability
Examples
• One measurement, symmetric cost, equal priors
P(error)   P(error | x ) p( x )dx
 P(C1 | x ), if c( x )  C2 
P(error | x )  

P
(
C
|
x
),
if
c
(
x
)

C

2
1
bad
p(x|C1)
p(x|C2)
x
Examples
• One measurement, symmetric cost, equal priors
good
p(x|C1)
p(x|C2)
x
How to Make the Best Decision?
(Bayes Decision Theory)
• Define a cost function for mistakes, e.g.
L(i, j ) 1   ij
• Minimize expected loss (risk) over entire p(C,x).
R  E[ L(C , c( x))]  E[ E[ L(C , c( x)) | x]]
  E[ L(C , c( x)) | x]p( x)dx
• Sufficient to assure optimal decision for each individual x.
N
E[ L(C , c( x)) | x]   L(Ci , c( x)) p(Ci | x)
i 1
• Result: decide according to maximum posterior probability:
c( x)  max p (Ci | x)
i
Two Classes, C1, C2
• It is helpful to consider the likelihood ratio:
p(C1 | x) p( x | C1 ) p(C1 )

p(C2 | x) p( x | C2 ) p(C2 )
• Use known priors p(Ci) or ignore them.
• For more elaborate loss function (proof is easy):
p(x | C1 ) ? l12  l22 p(C2 )
g(x) 

p(x | C2 ) l21  l11 p(C1 )
• g(x) is called a discriminant function

Discriminant Functions for Multivariate
Gaussian Class Conditional Densities
• Two multivariate Gaussians in d dimensions
• Since log is monotonic, we can look at log g(x).
p( x | C1 ) p(C1 )
log g ( x)  log
 g1 ( x)  g 2 ( x)
p( x | C2 ) p(C2 )
1
d
1
T 1
g i ( x)   ( x  i )  i ( x  i )  log 2  log  i  log p(Ci )
2
2
2
Mahalanobis Distance2
superfluous
Mahalanobis Distance
d 2 ( x)  ( x  i )T i1 ( x  i )
x2
• iso-distance lines = isoprobability lines
• Decision surface:
2
1
d12 ( x)  d22 ( x)  const.
x1
decision
surface
Case 1: i = 2I
• Discriminant functions…
1
1
T 1
g i ( x)   ( x   i )  i ( x   i )  log  i  log p(Ci )
2
2
• …simplify to:
g1 ( x)  g 2 ( x)  


1
2
2

x  1
2
2

2
x  2
2 2
2
p (C1 )
 log
p(C2 )

p (C1 )
 x x  2 x   1  x x  2 x    2  log
p(C2 )
T
T
1
( 1   2 )T ( x   2 )
2
T
1
T
T
2
T
2
( 1   2 )T ( 1   2 )
p(C1 )

 log
2
2
p(C2 )
Decision Boundary
g1 ( x )  g 2 ( x )  0
 ( 1  2 )T ( x  2 ) 
1
p (C1 )
2
1  2   2 log
2
p (C2 )
( 1  2 )T
1
2
p (C1 )

( x  2 )  1  2 
log
1  2
2
1  2
p (C2 )
•If 2=0, we obtain...

1

p(C1 )
x  1 
log
1
2
1
p(C2 )
T
1
2
The matched filter! With an expression for the threshold.
Two Signals and Additive White Gaussian Noise
( 1  2 )T
1
2
p(C1 )
( x  2 )  1  2 
log
1  2
2
1  2
p(C2 )
x2
Signal 1
1-2
1
2
p(C1 )
1  2 
log
2
1  2
p(C2 )
1
x
2
x-2
Signal 2
x1
Case 2: i = 
• Two classes, 2D measurements, p(x|C) are
multivariate Gaussians with equal covariance
matrices.
• Derivation is similar
– Quadratic term vanishes since it is independent of class
– We obtain a linear decision surface
• Matlab demo
Case 3: General Covariance Matrix
• See transparency
Isn’t this to simple?
• Not at all…
• It is true that images form complicated manifolds (from a
pixel point of view, translation, rotation and scaling are all
highly non-linear operations)
• The high dimensionality helps
Assume Unknown Class Densitites
• In real life, we do not know the class conditional densities.
• But we do have example data.
• This puts us in the typical machine learning scenario:
We want to learn a function, c(x), from examples.
• Why not just estimate class densities from examples and
apply the previous ideas?
– Learn Gaussian (simple density): in N dimensions need N2 samples
at least!
• 10x10 pixels 10,000 examples!
– Avoid estimating densities whenever you can! (too general)
– posterior is generally simpler than class conditional (see transparency)
Remember PCA?
• Principal components are
eigenvectors of covariance
matrix
1
T
T
(
x


)(
x


)

C

USU
 i
i
N i
x    Uˆz
• Use reconstruction error for
recognition (e.g. Eigenfaces)
– good
• reduces dimensionality
– bad
• no model within subspace
• linearity may be inappropriate
• covariance not appropriate to
optimize discrimination
x2
x
u1

x1
Fisher’s Linear Discriminant
• Goal: Reduce dimensionality
before training classifiers etc.
(Feature Selection)
• Similar goal as PCA!
• Fisher has classification in
mind…
• Find projection directions such
that separation is easiest
• Eigenfaces vs. Fisherfaces
x2
x1
Fisher’s Linear Discriminant
• Assume we have n d-dimensional samples x1,…,xn
• n1 from set (class) X1 and n2 from set X2
• we form linear combinations:
y  wT x
• and obtain y1…,yn
• only direction of w is important
Objective for Fisher
• Measure the separation as the distance between the means
after projecting (k = 1,2):
1
~
mk 
nk
1
y

nk
yYk
T
T
w
x

w
mk

xX k
• Measure the scatter after projecting:
~
sk2 
~ )2
(
y

m

k
yYk
• Objective becomes to maximize
2
~
~
m1  m2
J ( w)  ~ 2 ~ 2
s1  s2
• We need to make the dependence on w explicit:
~
sk2 

 ( wT x  wT mk )
2
xX k
T
T
T
w
(
x

m
)(
x

m
)
w

w
Sk w

k
k
xX k
• Defining the within-class scatter matrix, SW=S1+S2, we obtain
~
s12  ~
s22  wT SW w
• Similarly for the separation (between-class scatter matrix)
~ m
~ )2  ( wT m  wT m )2 
(m
1
2
1
2
wT (m1  m2 )(m1  m2 )T w  wT SB w
• Finally we can write
wT SB w
J ( w)  T
w SW w
Fisher’s Solution
wT SB w
J ( w)  T
w SW w
• Is called a generalized Rayleigh quotient. Any w that
maximizes J must satisfy the generalized eigenvalue
problem
SB w  SW w
• Since SB is very singular (rank 1), and SBw is in the
direction of (m1-m2), we are done:
w  SW1 (m1  m2 )
Comments on FLD
• We did not follow Bayes Decision Theory
• FLD is useful for many types of densities
• Fisher can be extended (see demo):
– more than one projection direction
– more than two clusters
• Let’s try it out: Matlab Demo
Fisher vs. Bayes
• Assume we do have identical Gaussian class densities, then Bayes
says:
wT x  w0  0
w   1 ( 1   2 )
• while Fisher says:
w  SW1 (m1  m2 )
• Since SW is proportional to the covariance matrix, w is in the same
direction in both cases.
• Comforting...
What have we achieved?
• Found out that maximum posterior strategy is
optimal. Always.
• Looked at different cases of Gaussian class
densities, where we could derive simple decision
rules.
• Gaussian classifiers do reasonable jobs!
• Learned about FLD which is useful and often
preferable to PCA.
Just for Fun: Support Vector Machine
• Very fashionable…s.o.t.a?
• Does not model densities
• Fits decision surface
directly
• Maximizes margin 
reduces “complexity”
• Decision surface only
depends on nearby
samples
• Matlab Demo
x2
x1
Learning Algorithms
y = f(x)
f=?
Set of functions
p(x,y)
Examples:
(xi,yi)
Learning
Algorithm
Learned
function
Assume Unknown Class Densitites
• SVM Examples
• Densitites are hard to estimate -> avoid it
– example from Ripley
• Give intuitions on overfitting
• Need to learn
– Standard machine learning problem
– Training/Test sets