Pattern Classification All materials in these slides were taken from

Download Report

Transcript Pattern Classification All materials in these slides were taken from

Pattern
Classification
All materials in these slides were taken
from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation (part 1)

Introduction
 Maximum-Likelihood Estimation
 Example of a Specific Case
 The Gaussian Case: unknown  and 
 Bias
 Appendix: ML Problem Statement
2

Introduction

Data availability in a Bayesian framework

We could design an optimal classifier if we knew:


P(i) (priors)
P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

Design a classifier from a training sample


No problem with prior estimation
Samples are often too small for class-conditional
estimation (large dimension of feature space!)
Pattern Classification, Chapter1 3
3

A priori information about the problem



Do we know something about the distribution?
 find parameters to characterize the distribution
Example: Normality of P(x | i)
P(x | i) ~ N( i, i)


Characterized by 2 parameters
Estimation techniques


Maximum-Likelihood (ML) and the Bayesian estimations
Results are nearly identical, but the approaches are different
Pattern Classification, Chapter1 3
4

Parameters in ML estimation are fixed but
unknown!

Best parameters are obtained by maximizing the
probability of obtaining the samples observed

Bayesian methods view the parameters as
random variables having some known distribution

In either approach, we use P(i | x)
for our classification rule!
Pattern Classification, Chapter1 3

Maximum-Likelihood Estimation



5
Has good convergence properties as the
sample size increases
Simpler than any other alternative techniques
General principle

Assume we have c classes and
P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:
  ( j ,  j ) 
1 2
11 22
m
n
( j ,  j ,..., j ,  j , cov(x j , x j )...)
Pattern Classification, Chapter2 3
6


Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated
with each category
Suppose that D contains n samples, x1, x2,…, xn
k n
P(D | )   P(x k | )  F()
k 1
P(D | ) is called the likelihoodof  w.r.t. the set of samples)

ML estimate of  is, by definition the value that ̂
maximizes P(D | )
“It is the value of  that best agrees with the actually
observed training sample”
Pattern Classification, Chapter2 3
7
Pattern Classification, Chapter2 3
8

Optimal estimation

Let  = (1, 2, …, p)t and let  be the gradient operator
 

 
θ  
,
,...,

 p 
 1  2
t

We define l() as the log-likelihood function
l() = ln P(D | )
(recall D is the training data)

New problem statement:
determine  that maximizes the log-likelihood
θˆ  arg max l (θ)
θ
Pattern Classification, Chapter2 3
9
The definition of l() is:
n
l (θ)   ln p(x k | θ)
k 1
and
k n
( θl    θ ln P(x k | θ))
(eq 6)
k 1
Set of necessary conditions for an optimum is:
l = 0 (eq. 7)
Pattern Classification, Chapter2 3

Example, the Gaussian case: unknown 
 We assume we know the covariance
 p(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal
population)


10
1
1
d
ln p (x k | μ)   ln (2 )   (x k  μ) t Σ 1 x k  μ 
2
2
and  μ ln p (x k | μ)  Σ 1 x k  μ  (eq. 9)
 =  therefore:
The ML estimate for  must satisfy:
n
1
Σ
 (x k  μˆ )  0 from eqs 6,7 & 9
k 1
Pattern Classification, Chapter2 3
11
•
Multiplying by  and rearranging, we
obtain:
1 k n
μˆ   x k
n k 1
Just the arithmetic average of the samples
of the training samples!
Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a ddimensional feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!
Pattern Classification, Chapter2 3
12

Example, Gaussian Case: unknown  and 

First consider univariate case: unknown  and 
 = (1, 2) = (, 2)
l  ln p ( xk | θ)  
1
1
ln 22 
( xk  1 ) 2
2
2 2
 

(ln P ( xk | θ)) 

1

0
 θl  


(ln P ( xk | θ)) 


2


1
 ( xk  1 )  0
 2

2
 1  ( xk  1 )  0
2

2

2

2
2

Pattern Classification, Chapter2 3
13
Summation (over the training set):
n 1
ˆ)0
(
x


 ˆ k 1
 k 1  2
 n
2
n
ˆ
( xk  1 )
1



0

2
 k 1 ˆ k 1
ˆ
2
2

(1)
(2)
Combining (1) and (2), one obtains:
1 n
ˆ   xk
n k 1
;
n
1
2
2
ˆ
ˆ
   ( xk   )
n k 1
Pattern Classification, Chapter2 3
14

The ML estimates for the multivariate case is
similar


The scalars c and  are replaced with vectors
The variance 2 is replaced by the covariance matrix
n
1
μˆ   x k
n k 1
n
ˆΣ  1  (x  μˆ )( x  μˆ ) t
k
k
n k 1
Pattern Classification, Chapter 3
15
Bias
 ML estimate for 2 is biased
n 1 2
1 n
2
E   ( xi  x )  
 2
n
 n i 1


Extreme case: n=1, E[ ] = 0 ≠ 2
As the n increases the bias is reduced
 this type of estimator is called asymptotically
unbiased

Pattern Classification, Chapter2 3
16

An elementary unbiased estimator for  is:
1 n
t
C
(x k  μˆ )( x k  μˆ )

n1
k 1




Sample covariance matrix
This estimator is unbiased for all distributions
 Such estimators are called absolutely
unbiased
Pattern Classification, Chapter2 3
17

Our earlier estimator for  is biased:
n
1
ˆΣ   (x  μˆ )( x  μˆ ) t
k
k
n k 1
In fact it is asymptotically unbiased:
Observe that
n

1
ˆ 
C
n
Pattern Classification, Chapter2 3
18

Appendix: ML Problem Statement

Let D = {x1, x2, …, xn}
P(x1,…, xn | ) = 1,nP(xk | ); |D| = n
Our goal is to determine ̂ (value of  that
maximizes the likelihood of this sample set!)
Pattern Classification, Chapter2 3
19
|D| = n
x1
. . .x
.x . .
2
n
N(j, j) = P(xj, 1)
P(xj | 1)
P(xj | k)
D1
x10 x11
. ..
x
.
20
Dk
x8
.
.
x
x
. .
1
9
Dc
.
. . .
Pattern Classification, Chapter2 3
20
 = (1, 2, …, c)
Problem: find ̂ such that:
Max P(D | )  MaxP (x 1,...,x n | )

n
 Max  P(x k | )
k 1
Pattern Classification, Chapter2 3