Transcript Chapter02

240-650
Principles of Pattern Recognition
Montri Karnjanadecha
[email protected]
http://fivedots.coe.psu.ac.th/~montri
240-650: Chapter 2: Bayesian Decision Theory
1
Chapter 2
Bayesian Decision Theory
240-650: Chapter 2: Bayesian Decision Theory
2
Statistical Approach to Pattern Recognition
240-650: Chapter 2: Bayesian Decision Theory
3
A Simple Example
• Suppose that we are given two classes w1 and w2
– P(w1) = 0.7
– P(w2) = 0.3
– No measurement is given
• Guessing
– What shall we do to recognize a given input?
– What is the best we can do statistically? Why?
240-650: Chapter 2: Bayesian Decision Theory
4
A More Complicated Example
• Suppose that we are given two classes
– A single measurement x
– P(w1|x) and P(w2|x) are given graphically
240-650: Chapter 2: Bayesian Decision Theory
5
A Bayesian Example
• Suppose that we are given two classes
– A single measurement x
– We are given p(x|w1) and p(x|w2) this time
240-650: Chapter 2: Bayesian Decision Theory
6
A Bayesian Example – cont.
240-650: Chapter 2: Bayesian Decision Theory
7
Bayesian Decision Theory
• Bayes formula
p(w j , x)  P(w j | x) p( x)  p( x | w j ) P(w j )
P(w j | x) 
• In case of two
categories
p( x | w j ) P(w j )
p( x)
 p( x | w j ) P(w j )
2
p( x)   p( x | w j ) P(w j )
• In English, it can be
expressed as
240-650: Chapter 2: Bayesian Decision Theory
j 1
likelihoodx prior
posterior
evidence
8
Bayesian Decision Theory – cont.
• A posterior probability
– The probability of the state of nature being w j
given that feature value x has been measured
• Likelihood
– p( x | w j ) is the likelihood of w j with respect to x
• Evidence
– The evidence factor can be viewed as a scaling
factor that guarantees that the posterior
probabilities sum to one.
240-650: Chapter 2: Bayesian Decision Theory
9
Bayesian Decision Theory – cont.
• Whenever we observe a particular x, the
prob. of error is
 P(w1 | x) if we decide w2
P(error | x)  
P(w2 | x) if we decide w1
• The average prob. of error is given by




P(error)   P(error, x)dx   P(error | x) p( x)dx
240-650: Chapter 2: Bayesian Decision Theory
10
Bayesian Decision Theory – cont.
• Bayes decision rule
Decide w1 if P(w1|x) > P(w2|x); otherwise decide w2
• Prob. of error
P(error|x)=min[P(w1|x), P(w2|x)]
• If we ignore the “evidence”, the decision rule
becomes:
Decide w1 if P(x|w1) P(w1) > P(x|w2) P(w2)
Otherwise decide w2
240-650: Chapter 2: Bayesian Decision Theory
11
Bayesian Decision Theory--continuous
features
• Feature space
– In general, an input can be represented by a
vector, a point in a d-dimensional Euclidean space
Rd
• Loss function
– The loss function states exactly how costly each
action is and is used to convert a probability
determination into a decision
– Written as
 ( i | w j )
240-650: Chapter 2: Bayesian Decision Theory
12
Loss Function
 ( i | w j )
• Describe the loss incurred for taking action i
when the state of nature is wj
240-650: Chapter 2: Bayesian Decision Theory
13
Conditional Risk
•
•
•
•
•
Suppose we observe a particular x
We take action i
If the true state of nature is wj
By definition we will incur the loss (i|wj)
We can minimize our expected loss by
selecting the action that minimize the
condition risk, R(i|x)
R( i | x)    ( i | w j P(w j | x
c
j 1
240-650: Chapter 2: Bayesian Decision Theory
14
Bayesian Decision Theory
• Suppose that there are c categories
{w1, w2, ..., wc}
• Conditional risk
c
R( i | x)    ( i | w j ) P(w j | x)
j 1
• Risk is the average expected loss
R   R( (x) | x) p (x)dx
240-650: Chapter 2: Bayesian Decision Theory
15
Bayesian Decision Theory
• Bayes decision rule
– For a given x, select the action i for which the
conditional risk is minimum
i*  min R(i | x)
i
– The resulting minimum overall risk is called the
Bayes risk, denoted as R*, which is the best
performance that can be achieved
240-650: Chapter 2: Bayesian Decision Theory
16
Two-Category Classification
• Let ij = (i|wj)
• Conditional risk
R(1 | x)  11 P(w1 | x)  12 P(w2 | x)
R( 2 | x)  21 P(w1 | x)  22 P(w2 | x)
• Fundamental decision rule
Decide w1 if R(1|x) < R(w2|x)
240-650: Chapter 2: Bayesian Decision Theory
17
Two-Category Classification
– cont.
• The decision rule can be written in several
ways
– Decide w1 if one of the followings is true
(21  11 ) P(w1 | x)  (12  22 ) P(w2 | x)
(21  11 ) p(x | w1 ) P(w1 )  (12  22 ) p(x | w2 ) P(w2 )
p(x | w1 ) 12  22 P(w2 )

p(x | w2 ) 21  11 P(w1 )
Likelihood Ratio
240-650: Chapter 2: Bayesian Decision Theory
These rules are
equivalent
18
Minimum-Error-Rate Classification
• A special case of the Bayes decision rule with
the following zero-one loss function
0
 ( i | w j )  
1
if i  j
if i  j
– Assigns no loss to correct decision
– Assigns unit loss to any error
– All errors are equally costly
240-650: Chapter 2: Bayesian Decision Theory
19
Minimum-Error-Rate Classification
• Conditional risk
R( i | x  
  (
c
j 1


240-650: Chapter 2: Bayesian Decision Theory
i
| w j P(w j | x 
 P(w
1  P(w
j i
j
| x
j
| x
20
Minimum-Error-Rate Classification
• We should select i that maximizes the
posterior probability P(w j | x)
• For minimum error rate:
Decide wi if P(wi | x)  P(w j | x)
240-650: Chapter 2: Bayesian Decision Theory
for all j  i
21
Minimum-Error-Rate Classification
240-650: Chapter 2: Bayesian Decision Theory
22
Classifiers, Discriminant Functions, and
Decision Surfaces
• There are many ways to represent pattern
classifiers
• One of the most useful is in terms of a set of
discriminant functions gi(x), i=1,…,c
• The classifier assigns a feature vector x to
class wi if
gi (x)  g j (x)
240-650: Chapter 2: Bayesian Decision Theory
for all i  j
23
The Multicategory Classifier
240-650: Chapter 2: Bayesian Decision Theory
24
Classifiers, Discriminant Functions, and
Decision Surfaces
• There are many equivalent discriminant
functions
– i.e., the classification results will be the same even
though they are different functions
– For example, if f is a monotonically increasing
function, then
g i ( x)  f ( g i ( x))
240-650: Chapter 2: Bayesian Decision Theory
25
Classifiers, Discriminant Functions, and
Decision Surfaces
• Some of discriminant functions are easier to
understand or to compute
240-650: Chapter 2: Bayesian Decision Theory
26
Decision Regions
• The effect of any decision is to divide the
feature space into c decision regions, R1, ..., Rc
If gi (x)  g j (x) for all j  i, thenx  Ri
– The regions are separated with decision boundaries,
where ties occur among the largest discriminant
functions
240-650: Chapter 2: Bayesian Decision Theory
27
Decision Regions – cont.
240-650: Chapter 2: Bayesian Decision Theory
28
Two-Category Case (Dichotomizer)
• Two-category case is a special case
– Instead of two discriminant functions, a single one
can be used
g (x)  g1 (x)  g 2 (x)
g (x)  P(w1 | x)  P(w2 | x)
p(x | w1 )
P(w1 )
g (x)  ln
 ln
p ( x | w2 )
P(w2 )
240-650: Chapter 2: Bayesian Decision Theory
29
The Normal Density
• Univariate Gaussian Density
 1  x   2 
1
p ( x) 
exp 
 
2 
 2    
• Mean
   x   xp( x)dx
• Variance
   (x     


2

240-650: Chapter 2: Bayesian Decision Theory
2



(x   
2
p( x)dx
30
The Normal Density
240-650: Chapter 2: Bayesian Decision Theory
31
The Normal Density
• Central Limit Theorem
– The aggregate effect of the sum of a large
number of small, independent random
disturbances will lead to a Gaussian distribution
– Gaussian is often a good model for the actual
probability distribution
240-650: Chapter 2: Bayesian Decision Theory
32
The Multivariate Normal Density
• Multivariate Density (in d dimension)
 1

t
1
p ( x) 
exp (x  μ  Σ (x  μ 
1/ 2
d /2
 2

(2  Σ
1
Abbreviation
p(x)  N (μ, Σ
240-650: Chapter 2: Bayesian Decision Theory
33
The Multivariate Normal Density
• Mean
μ   x   xp (x)dx
• Covariance matrix
Σ   (x  μ (x  μ     (x  μ (x  μ  p (x)dx
t
t
• The ijth component of Σ
 ij   (xi  i (x j   j 
240-650: Chapter 2: Bayesian Decision Theory
34
Statistically Independence
– If xi and xj are statistically independence then
 ij  0
– The covariance matrix will become a diagonal
matrix where all off-diagonal elements are zero
240-650: Chapter 2: Bayesian Decision Theory
35
Whitening Transform
Diagonal matrix of the
corresponding eigenvalues of Σ
1/ 2
Aw  ΦΛ
matrix whose columns are the
orthonormal eigenvectors of Σ
240-650: Chapter 2: Bayesian Decision Theory
36
Whitening Transform
240-650: Chapter 2: Bayesian Decision Theory
37
Squared Mahalanobis Distance from x to 
r  (x  μ Σ1 (x  μ
2
t
Constant density
Principle axes of hyperellipsiods
are given by the eigenvectors of S
Length of axes are determined
by eigenvalues of S
240-650: Chapter 2: Bayesian Decision Theory
38
Discriminant Functions for the Normal Density
• Minimum distance classifier
gi (x)  ln p(x | wi )  ln P(wi )
• If the density p(x | wi ) are multivariate normal–
i.e., if p(x | wi )  N (μi , Σi )
Then we have:
1
d
1
t
1
g i (x)   (x  μ i  Σ i (x  μ i   ln 2  ln Σ i  ln P(wi )
2
2
2
240-650: Chapter 2: Bayesian Decision Theory
39
Discriminant Functions for the Normal Density
• Case 1: Σi   2I
– Features are statistically independence and each
feature has the same variance
gi (x)  
x  μi
2
2
2
 ln P(wi )
– Where || . || denotes the Euclidean norm
x  μi
2
240-650: Chapter 2: Bayesian Decision Theory
 (x  μ i )t (x  μ i )
40
Case 1: Si = 2I
240-650: Chapter 2: Bayesian Decision Theory
41
Linear Discriminant Function
• It is not necessary to compute distances
– Expanding the form (x  μi )t (x  μi ) yields
g i ( x)  
1
2
x x  2μ x  μ μ  ln P(w )
t
2
t
i
t
i
i
i
t
– The term x x is the same for all i
– We have the following linear discriminant function
gi (x)  wti x  wi 0
240-650: Chapter 2: Bayesian Decision Theory
42
Linear Discriminant Function
where
and
wi 
1

2
μi
1 t
wi 0 
μ μ  ln P(w )
  
2
Threshold or bias for the ith category
240-650: Chapter 2: Bayesian Decision Theory
43
Linear Machine
• A classifier that uses linear discriminant
functions is called a linear machine
• Its decision surfaces are pieces of
hyperplanes defined by the linear equations
gi (x)  g j (x) for the two categories with the
highest posterior probabilities. For our case
this equation can be written as
w (x  x 0 )  0
t
240-650: Chapter 2: Bayesian Decision Theory
44
Linear Machine
Where
And
w  μ  μ j
1

x 0  (μ i  μ j  
2
μi  μ j
2
P(wi )
(μi  μ j 
ln
P(w j )
If P(wi )  P(w j ) then the second term vanishes
It is called a minimum-distance classifier
240-650: Chapter 2: Bayesian Decision Theory
45
Priors change -> decision boundaries shift
240-650: Chapter 2: Bayesian Decision Theory
46
Priors change -> decision boundaries shift
240-650: Chapter 2: Bayesian Decision Theory
47
Priors change -> decision boundaries shift
240-650: Chapter 2: Bayesian Decision Theory
48
Case 2: Si = S
• Covariance matrices for all of the classes are
identical but otherwise arbitrary
• The cluster for the ith class is centered about
i
• Discriminant function:
1
t
g i (x)   (x  μ i  Σ 1 (x  μ i   ln P(wi )
2
240-650: Chapter 2: Bayesian Decision Theory
Can be ignored if prior probabilities
are the same for all classes
49
Case 2: Discriminant function
gi (x)  w x  wi 0
t
i
Where
and
wi  Σ1μi
1 t 1
wi 0   μ Σ μ  ln P(w )
2
240-650: Chapter 2: Bayesian Decision Theory
50
For 2-category case
• If Ri and Rj are contiguous, the boundary
between them has the equation
w (x  x 0 )  0
t
where
and
w  Σ (μ  μ j 
1


ln P(wi ) / P(w j )
1
(μi  μ j 
x 0  (μ i  μ j  
t 1
2
(μi  μ j  Σ (μi  μ j 
240-650: Chapter 2: Bayesian Decision Theory
51
240-650: Chapter 2: Bayesian Decision Theory
52
240-650: Chapter 2: Bayesian Decision Theory
53
Case 3: Si = arbitrary
• In general, the covariance matrices are
different for each category
• The only term that can be dropped is the
(d/2) ln 2 term
240-650: Chapter 2: Bayesian Decision Theory
54
Case 3: Si = arbitrary
The discriminant functions are
gi (x)  x Wi x  w x  wi 0
t
Where
t
i
1 1
Wi   Σ i
2
wi  Σi1i
and
1 t 1
1
wi 0   μ Σ μ  ln Σ i  ln P(w )
2
2
240-650: Chapter 2: Bayesian Decision Theory
55
Two-category case
• The decision surface are hyperquadrics
(hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids,…)
240-650: Chapter 2: Bayesian Decision Theory
56
240-650: Chapter 2: Bayesian Decision Theory
57
240-650: Chapter 2: Bayesian Decision Theory
58
240-650: Chapter 2: Bayesian Decision Theory
59
240-650: Chapter 2: Bayesian Decision Theory
60
Example
240-650: Chapter 2: Bayesian Decision Theory
61