Introduction to Decision Theory and Machine Learning
Download
Report
Transcript Introduction to Decision Theory and Machine Learning
Elements of
Pattern Recognition
CNS/EE-148 -- Lecture 5
M. Weber
P. Perona
What is Classification?
• We want to assign objects to classes based on a selection of
attributes (features).
• Examples:
– (age, income) {credit worthy, not credit worthy}
– (blood cell count, body temp) {flue, hepatitis B, hepatitis C}
– (pixel vector) {Bill Clinton, coffee cup}
• Feature vector can be continuous, discrete or mixed.
What is Classification?
• Want to find a function
x2
from measurements to
Space
of decision
Feature Vectors
class labels
boundary.
c : R2 C0 , C1, C2 , C?
Signal 1
Signal 2
• Statistical methods use pdf:
p(C,x)
Noise
• Assume p(C,x) known for
now
x1
Some Terminology
• p(C) is called a prior or a priori probability
• p(x|C) is called a class-conditional density
or likelihood of C with respect to x
• p(C|x) is called a posterior or
a posteriori probability
Examples
• One measurement, symmetric cost, equal priors
P(error) P(error | x ) p( x )dx
P(C1 | x ), if c( x ) C2
P(error | x )
P
(
C
|
x
),
if
c
(
x
)
C
2
1
bad
p(x|C1)
p(x|C2)
x
Examples
• One measurement, symmetric cost, equal priors
good
p(x|C1)
p(x|C2)
x
How to Make the Best Decision?
(Bayes Decision Theory)
• Define a cost function for mistakes, e.g.
L(i, j ) 1 ij
• Minimize expected loss (risk) over entire p(C,x).
R E[ L(C , c( x))] E[ E[ L(C , c( x)) | x]]
E[ L(C , c( x)) | x]p( x)dx
• Sufficient to assure optimal decision for each individual x.
N
E[ L(C , c( x)) | x] L(Ci , c( x)) p(Ci | x)
i 1
• Result: decide according to maximum posterior probability:
c( x) max p (Ci | x)
i
Two Classes, C1, C2
• It is helpful to consider the likelihood ratio:
p(C1 | x) p( x | C1 ) p(C1 )
p(C2 | x) p( x | C2 ) p(C2 )
• Use known priors p(Ci) or ignore them.
• For more elaborate loss function (proof is easy):
p(x | C1 ) ? l12 l22 p(C2 )
g(x)
p(x | C2 ) l21 l11 p(C1 )
• g(x) is called a discriminant function
Discriminant Functions for Multivariate
Gaussian Class Conditional Densities
• Two multivariate Gaussians in d dimensions
• Since log is monotonic, we can look at log g(x).
p( x | C1 ) p(C1 )
log g ( x) log
g1 ( x) g 2 ( x)
p( x | C2 ) p(C2 )
1
d
1
T 1
g i ( x) ( x i ) i ( x i ) log 2 log i log p(Ci )
2
2
2
Mahalanobis Distance2
superfluous
Mahalanobis Distance
d 2 ( x) ( x i )T i1 ( x i )
x2
• iso-distance lines = isoprobability lines
• Decision surface:
2
1
d12 ( x) d22 ( x) const.
x1
decision
surface
Case 1: i = 2I
• Discriminant functions…
1
1
T 1
g i ( x) ( x i ) i ( x i ) log i log p(Ci )
2
2
• …simplify to:
g1 ( x) g 2 ( x)
1
2
2
x 1
2
2
2
x 2
2 2
2
p (C1 )
log
p(C2 )
p (C1 )
x x 2 x 1 x x 2 x 2 log
p(C2 )
T
T
1
( 1 2 )T ( x 2 )
2
T
1
T
T
2
T
2
( 1 2 )T ( 1 2 )
p(C1 )
log
2
2
p(C2 )
Decision Boundary
g1 ( x ) g 2 ( x ) 0
( 1 2 )T ( x 2 )
1
p (C1 )
2
1 2 2 log
2
p (C2 )
( 1 2 )T
1
2
p (C1 )
( x 2 ) 1 2
log
1 2
2
1 2
p (C2 )
•If 2=0, we obtain...
1
p(C1 )
x 1
log
1
2
1
p(C2 )
T
1
2
The matched filter! With an expression for the threshold.
Two Signals and Additive White Gaussian Noise
( 1 2 )T
1
2
p(C1 )
( x 2 ) 1 2
log
1 2
2
1 2
p(C2 )
x2
Signal 1
1-2
1
2
p(C1 )
1 2
log
2
1 2
p(C2 )
1
x
2
x-2
Signal 2
x1
Case 2: i =
• Two classes, 2D measurements, p(x|C) are
multivariate Gaussians with equal covariance
matrices.
• Derivation is similar
– Quadratic term vanishes since it is independent of class
– We obtain a linear decision surface
• Matlab demo
Case 3: General Covariance Matrix
• See transparency
Isn’t this to simple?
• Not at all…
• It is true that images form complicated manifolds (from a
pixel point of view, translation, rotation and scaling are all
highly non-linear operations)
• The high dimensionality helps
Assume Unknown Class Densitites
• In real life, we do not know the class conditional densities.
• But we do have example data.
• This puts us in the typical machine learning scenario:
We want to learn a function, c(x), from examples.
• Why not just estimate class densities from examples and
apply the previous ideas?
– Learn Gaussian (simple density): in N dimensions need N2 samples
at least!
• 10x10 pixels 10,000 examples!
– Avoid estimating densities whenever you can! (too general)
– posterior is generally simpler than class conditional (see transparency)
Remember PCA?
• Principal components are
eigenvectors of covariance
matrix
1
T
T
(
x
)(
x
)
C
USU
i
i
N i
x Uˆz
• Use reconstruction error for
recognition (e.g. Eigenfaces)
– good
• reduces dimensionality
– bad
• no model within subspace
• linearity may be inappropriate
• covariance not appropriate to
optimize discrimination
x2
x
u1
x1
Fisher’s Linear Discriminant
• Goal: Reduce dimensionality
before training classifiers etc.
(Feature Selection)
• Similar goal as PCA!
• Fisher has classification in
mind…
• Find projection directions such
that separation is easiest
• Eigenfaces vs. Fisherfaces
x2
x1
Fisher’s Linear Discriminant
• Assume we have n d-dimensional samples x1,…,xn
• n1 from set (class) X1 and n2 from set X2
• we form linear combinations:
y wT x
• and obtain y1…,yn
• only direction of w is important
Objective for Fisher
• Measure the separation as the distance between the means
after projecting (k = 1,2):
1
~
mk
nk
1
y
nk
yYk
T
T
w
x
w
mk
xX k
• Measure the scatter after projecting:
~
sk2
~ )2
(
y
m
k
yYk
• Objective becomes to maximize
2
~
~
m1 m2
J ( w) ~ 2 ~ 2
s1 s2
• We need to make the dependence on w explicit:
~
sk2
( wT x wT mk )
2
xX k
T
T
T
w
(
x
m
)(
x
m
)
w
w
Sk w
k
k
xX k
• Defining the within-class scatter matrix, SW=S1+S2, we obtain
~
s12 ~
s22 wT SW w
• Similarly for the separation (between-class scatter matrix)
~ m
~ )2 ( wT m wT m )2
(m
1
2
1
2
wT (m1 m2 )(m1 m2 )T w wT SB w
• Finally we can write
wT SB w
J ( w) T
w SW w
Fisher’s Solution
wT SB w
J ( w) T
w SW w
• Is called a generalized Rayleigh quotient. Any w that
maximizes J must satisfy the generalized eigenvalue
problem
SB w SW w
• Since SB is very singular (rank 1), and SBw is in the
direction of (m1-m2), we are done:
w SW1 (m1 m2 )
Comments on FLD
• We did not follow Bayes Decision Theory
• FLD is useful for many types of densities
• Fisher can be extended (see demo):
– more than one projection direction
– more than two clusters
• Let’s try it out: Matlab Demo
Fisher vs. Bayes
• Assume we do have identical Gaussian class densities, then Bayes
says:
wT x w0 0
w 1 ( 1 2 )
• while Fisher says:
w SW1 (m1 m2 )
• Since SW is proportional to the covariance matrix, w is in the same
direction in both cases.
• Comforting...
What have we achieved?
• Found out that maximum posterior strategy is
optimal. Always.
• Looked at different cases of Gaussian class
densities, where we could derive simple decision
rules.
• Gaussian classifiers do reasonable jobs!
• Learned about FLD which is useful and often
preferable to PCA.
Just for Fun: Support Vector Machine
• Very fashionable…s.o.t.a?
• Does not model densities
• Fits decision surface
directly
• Maximizes margin
reduces “complexity”
• Decision surface only
depends on nearby
samples
• Matlab Demo
x2
x1
Learning Algorithms
y = f(x)
f=?
Set of functions
p(x,y)
Examples:
(xi,yi)
Learning
Algorithm
Learned
function
Assume Unknown Class Densitites
• SVM Examples
• Densitites are hard to estimate -> avoid it
– example from Ripley
• Give intuitions on overfitting
• Need to learn
– Standard machine learning problem
– Training/Test sets