Pattern Classification Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) • Introduction • Bayesian Decision Theory–Continuous Features.

Download Report

Transcript Pattern Classification Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) • Introduction • Bayesian Decision Theory–Continuous Features.

Pattern
Classification
Chapter 2 (Part 1):
Bayesian Decision Theory
(Sections 2.1-2.2)
• Introduction
• Bayesian Decision Theory–Continuous Features
2
Introduction
• The sea bass/salmon example
• State of nature, prior
• State of nature is a random variable
• The catch of salmon and sea bass is equiprobable
•
P(1) = P(2) (uniform priors)
•
P(1) + P( 2) = 1 (exclusivity and exhaustivity)
Pattern Classification, Chapter 2 (Part 1)
3
• Decision rule with only the prior information
• Decide 1 if P(1) > P(2) otherwise decide 2
• Use of the class–conditional information
• P(x | 1) and P(x | 2) describe the difference in
lightness between populations of sea-bass and
salmon
Pattern Classification, Chapter 2 (Part 1)
4
Pattern Classification, Chapter 2 (Part 1)
5
• Posterior, likelihood, evidence
• P(j | x) = (P(x | j) * P (j)) / P(x)
(BAYES RULE)
• Posterior = (Likelihood * Prior) / Evidence
• Where in case of two categories
j 2
P( x)   P( x |  j ) P( j )
j 1
Pattern Classification, Chapter 2 (Part 1)
6
Pattern Classification, Chapter 2 (Part 1)
7
•
Intuitive decision rule given the posterior probabilities:
Given x:
if P(1 | x) > P(2 | x)
if P(1 | x) < P(2 | x)
True state of nature = 1
True state of nature = 2
Why do this?: Whenever we observe a particular x, the
probability of error is :
P(error | x) = P(1 | x) if we decide 2
P(error | x) = P(2 | x) if we decide 1
Pattern Classification, Chapter 2 (Part 1)
8
• Minimizing the probability of error
• Decide 1 if P(1 | x) > P(2 | x);
otherwise decide 2
Therefore:
P(error | x) = min [P(1 | x), P(2 | x)]
(Bayes decision)
Pattern Classification, Chapter 2 (Part 1)
Bayesian Decision Theory – Continuous Features
9
Generalization of the preceding ideas
•
•
•
•
•
•
•
Use of more than one feature
Use more than two states of nature
Allowing actions and not only decide on the state of nature
Introduce a loss of function which is more general than the probability of error
Allowing actions other than classification primarily allows the possibility of rejection
Refusing to make a decision in close or bad cases!
Letting loss function state how costly each action taken is
Pattern Classification, Chapter 2 (Part 1)
Bayesian Decision Theory – Continuous Features
•
Let {1, 2,…, c} be the set of c states of nature (or “classes”)
•
Let {1, 2,…, a} be the set of possible actions
•
Let (i | j) be the loss for action i when the state of nature is j
10
Pattern Classification, Chapter 2 (Part 1)
11
What is the Expected Loss for action i ?
For any given x the expected loss is
j c
R(  i | x )    (  i |  j )P (  j | x )
j 1
Conditional risk = Expected Loss
Pattern Classification, Chapter 2 (Part 1)
12
Overall risk
R = Sum of all R(i | x) for i = 1,…,a
Conditional risk
Minimizing R
Minimizing R(i | x) for i = 1,…, a
j c
R(  i | x )    (  i |  j )P (  j | x )
j 1
for i = 1,…,a
Pattern Classification, Chapter 2 (Part 1)
13
Select the action i for which R(i | x) is minimum
R is minimum and R in this case is called the
Bayes risk = best performance that can be achieved!
Pattern Classification, Chapter 2 (Part 1)
14
Two-Category Classification
1 : deciding 1
2 : deciding 2
ij = (i | j)
loss incurred for deciding i when the true state of nature is j
Conditional risk:
R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)
Pattern Classification, Chapter 2 (Part 1)
15
Our rule is the following:
if R(1 | x) < R(2 | x)
action 1: “decide 1” is taken
This results in the equivalent rule :
decide 1 if:
(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
and decide 2 otherwise
Pattern Classification, Chapter 2 (Part 1)
16
x (21- 11)
x (12- 22)
Pattern Classification, Chapter 2 (Part 1)
17
Two-Category Decision Theory: Chopping Machine
1 = chop
2 = DO NOT chop
1 = NO hand in machine
2 = hand in machine
11
12
21
22
= (1 | 1) = $
0.00
= (1 | 2) = $ 100.00
= (2 | 1) = $
0.01
= (1 | 1) = $
0.01
Therefore our rule becomes
(21- 11) P(x | 1) P(1) > (12- 22) P(x | 2) P(2)
0.01 P(x | 1) P(1) > 99.99 P(x | 2) P(2)
Pattern Classification, Chapter 2 (Part 1)
18
Our rule is the following:
if R(1 | x) < R(2 | x)
action 1: “decide 1” is taken
This results in the equivalent rule :
decide 1 if:
(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
and decide 2 otherwise
Pattern Classification, Chapter 2 (Part 1)
19
x 0.01
x 99.99
1 = chop
2 = DO NOT chop
1 = NO hand in machine
2 = hand in machine
Pattern Classification, Chapter 2 (Part 1)
20
Exercise
Select the optimal decision where:
= {1, 2}
P(x | 1)
P(x | 2)
P(1) = 2/3
P(2) = 1/3
N(2, 0.5) (Normal distribution)
N(1.5, 0.2)
1 2


3 4 
Pattern Classification, Chapter 2 (Part 1)
Chapter 2 (Part 2):
Bayesian Decision Theory
(Sections 2.3-2.5)
• Minimum-Error-Rate Classification
• Classifiers, Discriminant Functions and Decision Surfaces
• The Normal Density
22
Minimum-Error-Rate Classification
• Actions are decisions on classes
If action i is taken and the true state of nature is j then
the decision is correct if i = j and in error if i  j
• Seek a decision rule that minimizes the probability
of error which is the error rate
Pattern Classification, Chapter 2 (Part 1)
23
•
Introduction of the zero-one loss function:
0 i  j
 (  i , j )  
1 i  j
i , j  1,...,c
Therefore, the conditional risk for each action is:
j c
R( i | x)    ( i |  j ) P( j | x)
j 1
  P( j | x)  1  P(i | x) Average Prob. of Error
j i
“The risk corresponding to this loss function is the
average (or expected) probability error”
Pattern Classification, Chapter 2 (Part 1)
24
• Minimize the risk requires maximize P(i | x)
(since R(i | x) = 1 – P(i | x))
• For Minimum error rate
• Decide i if P (i | x) > P(j | x) j  i
Pattern Classification, Chapter 2 (Part 1)
25
Two-Category Classification
1 : deciding 1
2 : deciding 2
ij = (i | j)
loss incurred for deciding i when the true state of nature is j
Conditional risk:
R(1 | x) = 11P(1 | x) + 12P(2 | x)
R(2 | x) = 21P(1 | x) + 22P(2 | x)
Pattern Classification, Chapter 2 (Part 1)
26
Our rule is the following:
if R(1 | x) < R(2 | x)
action 1: “decide 1” is taken
This results in the equivalent rule :
decide 1 if:
(21- 11) P(x | 1) P(1) >
(12- 22) P(x | 2) P(2)
and decide 2 otherwise
Pattern Classification, Chapter 2 (Part 1)
27
Likelihood ratio:
The preceding rule is equivalent to the following rule:
P( x | 1 ) 12  22 P(2 )
if

.
P( x | 2 ) 21  11 P(1 )
Then take action 1 (decide 1)
Otherwise take action 2 (decide 2)
Pattern Classification, Chapter 2 (Part 1)
28
•
Regions of decision and zero-one loss function,
therefore:
12  22 P (  2 )
P( x | 1 )
Let
.
   then decide 1 if :
 
21  11 P (  1 )
P( x |  2 )
•
If  is the zero-one loss function which means:
 0 1

  
1 0
P(  2 )
then   
 a
P( 1 )
0 2 
2 P(  2 )
 then   
if   
 b
P( 1 )
1 0
Pattern Classification, Chapter 2 (Part 1)
29
Pattern Classification, Chapter 2 (Part 1)
Classifiers, Discriminant Functions
and Decision Surfaces
30
• The multi-category case
• Set of discriminant functions gi(x), i = 1,…, c
• The classifier assigns a feature vector x to class i
if:
gi(x) > gj(x) j  i
Pattern Classification, Chapter 2 (Part 1)
31
Pattern Classification, Chapter 2 (Part 1)
32
• Let gi(x) = - R(i | x)
(max. discriminant corresponds to min. risk!)
• For the minimum error rate, we take
gi(x) = P(i | x)
(max. discrimination corresponds to max.
posterior!)
gi(x)  P(x | i) P(i)
gi(x) = ln P(x | i) + ln P(i)
(ln: natural logarithm!)
Pattern Classification, Chapter 2 (Part 1)
33
• Feature space divided into c decision regions
if gi(x) > gj(x) j  i then x is in Ri
(Ri means assign x to i)
• The two-category case
• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x)  g1(x) – g2(x)
Decide 1 if g(x) > 0 ; Otherwise decide 2
Pattern Classification, Chapter 2 (Part 1)
34
• The computation of g(x)
g( x )  P (  1 | x )  P (  2 | x )
P( x | 1 )
P( 1 )
 ln
 ln
P( x |  2 )
P(  2 )
Pattern Classification, Chapter 2 (Part 1)
35
On to higher dimensions!
Pattern Classification, Chapter 2 (Part 1)
36
•
The Normal Density
Univariate density
•
•
•
•
Density which is analytically tractable
Continuous density
A lot of processes are asymptotically Gaussian
Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
P( x ) 
2

1
1 x 
exp  
 ,
2 
 2    
Where:
 = mean (or expected value) of x
2 = expected squared deviation or variance
Pattern Classification, Chapter 2 (Part 1)
37
Pattern Classification, Chapter 2 (Part 1)
38
•
Multivariate density
•
Multivariate normal density in d dimensions is:
P( x ) 
1
( 2 )
d/2

1/ 2
 1

t
1
exp ( x   )  ( x   )
 2

where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
 = (1, 2, …, d)t mean vector
 = d*d covariance matrix
|| and -1 are determinant and inverse respectively
Pattern Classification, Chapter 2 (Part 1)
Chapter 2 (part 3)
Bayesian Decision Theory (Sections
2-6,2-9)
• Discriminant Functions for the Normal Density
• Bayes Decision Theory – Discrete Features
Discriminant Functions for the Normal
Density
40
• We saw that the minimum error-rate classification
can be achieved by the discriminant function
gi(x) = ln P(x | i) + ln P(i)
• Case of multivariate normal
1
d
1
t
1
gi ( x)   ( x  i )  ( x  i )  ln 2  ln i  ln P(i )
2
2
2
i
Pattern Classification, Chapter 2 (Part 1)
41
Case i = 2I
(I stands for the identity matrix)
gi ( x )  wit x  wi 0 (linear discriminant function)
where :
i
1
t
wi  2 ; wi 0  

i  i  ln P (  i )
2

2
(  i 0 is called the threshold for the ith category! )
Pattern Classification, Chapter 2 (Part 1)
42
• A classifier that uses linear discriminant functions
is called “a linear machine”
• The decision surfaces for a linear machine are
pieces of hyperplanes defined by:
gi(x) = gj(x)
Pattern Classification, Chapter 2 (Part 1)
43
Pattern Classification, Chapter 2 (Part 1)
44
The hyperplane separating Ri and Rj are given by
w t (x  x0)  0
w  i   j
1
2
x 0  ( i   j ) 
2
i   j
P(i )
ln
( i   j )
2
P( j )
always orthogonal to the line linking the means!
1
if P(i )  P( j ) then x 0  ( i   j )
2
Pattern Classification, Chapter 2 (Part 1)
45
Pattern Classification, Chapter 2 (Part 1)
46
Pattern Classification, Chapter 2 (Part 1)
47
Case i =  (covariance of all classes are
identical but arbitrary!)
Hyperplane separating Ri and Rj
w t (x  x 0 )  0
w   1 ( i   j )


ln P(i ) / P( j )
1
x0  ( i   j ) 
( i   j )
t 1
2
( i   j )  ( i   j )
Here the hyperplane separating Ri and Rj is
generally not orthogonal to the line between
the means!
Pattern Classification, Chapter 2 (Part 1)
48
Pattern Classification, Chapter 2 (Part 1)
49
Pattern Classification, Chapter 2 (Part 1)
50
Case i = arbitrary
The covariance matrices are different for each category
g i ( x )  x tWi x  w it x  w i 0
where :
1 1
Wi    i
2
wi   i 1  i
1 t 1
1
wi 0    i  i  i  ln  i  ln P (  i )
2
2
Here the separating surfaces are Hyperquadrics which are:
hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, hyperparaboloids, hyperhyperboloids)
Pattern Classification, Chapter 2 (Part 1)
51
Pattern Classification, Chapter 2 (Part 1)
52
Pattern Classification, Chapter 2 (Part 1)
53
Bayes Decision Theory – Discrete
Features
•
•
Components of x are binary or integer valued, x can
take only one of m discrete values
v1, v2, …, vm
Case of independent binary features in 2 category
problem
Let x = [x1, x2, …, xd ]t where each xi is either 0 or 1, with
probabilities:
pi = P(xi = 1 | 1)
qi = P(xi = 1 | 2)
Pattern Classification, Chapter 2 (Part 1)
54
• The discriminant function in this case is:
d
g ( x )   w i x i  w0
i 1
where :
pi ( 1  q i )
w i  ln
q i ( 1  pi )
i  1 ,...,d
and :
1  pi
P( 1 )
w0   ln
 ln
1  qi
P(  2 )
i 1
d
decide 1 if g(x)  0 and  2 if g(x)  0
Pattern Classification, Chapter 2 (Part 1)