Transcript lecture_05
ECE 8443 – Pattern Recognition
LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION
• Objectives:
Discrete Features
Maximum Likelihood
• Resources:
D.H.S: Chapter 3 (Part 1)
D.H.S.: Chapter 3 (Part 2)
J.O.S.: Tutorial
Nebula: Links
BGSU: Example
A.W.M.: Tutorial
A.W.M.: Links
S.P.: Primer
CSRN: Unbiased
A.W.M.: Bias
URL:
Audio:
Discrete Features
• For problems where features are discrete:
p(x j )dx P(x |ω j )
x
• Bayes formula involves probabilities (not densities):
P j x
p x j P j
px
P j x
P x j P j
Px
where
c
Px P x j P j
j 1
• Bayes rule remains the same:
α* arg min R(αi | x )
i
• The maximum entropy distribution is a uniform distribution:
P( x x i )
1
N
ECE 8443: Lecture 05, Slide 1
Discriminant Functions For Discrete Features
• Consider independent binary features:
pi Pr [xi 1|ω1 ) qi Pr [xi 1|ω2 )
x ( x1,..., xd )t
• Assuming conditional independence:
d
P(x | ω1)
i 1
pixi (1
1 xi
pi )
d
P(x | ω2 ) qixi (1 qi )1 xi
i 1
• The likelihood ratio is:
x
P(x | ω1) d pi i (1 pi )1 xi
x
P(x | ω2 ) i 1 qi i (1 qi )1 xi
• The discriminant function is:
d
g (x ) xi ln
i 1
pi
(1 pi )
P (1 )
(1 xi ) ln
ln
qi
(1 qi )
P (2 )
d
pi (1 qi )
(1 pi )
P (1 )
wi xi w0 ln
xi ln
ln
P (2 )
i 1
i 1 qi (1 pi )
i 1 (1 qi )
d
ECE 8443: Lecture 05, Slide 2
d
Introduction to Maximum Likelihood Estimation
• In Chapter 2, we learned how to design an optimal classifier if we knew the
prior probabilities, P(i), and class-conditional densities, p(x|i).
• What can we do if we do not have this information?
• What limitations do we face?
• There are two common approaches to parameter estimation: maximum
likelihood and Bayesian estimation.
• Maximum Likelihood: treat the parameters as quantities whose values are
fixed but unknown.
• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.
• Bayesian Learning: sharpen the a posteriori density causing it to peak near
the true value.
ECE 8443: Lecture 05, Slide 3
General Principle
• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|j).
• Assume p(x|j) has a known parametric form and is completely determined
by the parameter vector j (e.g., p(x|j) N(j,j),
where j=[1, ..., j , 11, 12, ...,dd]).
• p(x|j) has an explicit dependence on j: p(x|j,j)
• Use training samples to estimate 1, 2,..., c
• Functional independence: assume Di gives no useful information
about j for ij.
• Simplifies notation to a set D of training samples (x1,... xn) drawn
independently from p(x|) to estimate .
• Because the samples were drawn independently:
n
p ( D | ) p ( x k )
k 1
ECE 8443: Lecture 05, Slide 4
Example of ML Estimation
• p(D|) is called the likelihood of with respect to the data.
• The value of that maximizes this likelihood, denoted ̂ ,
is the maximum likelihood estimate (ML) of .
• Given several training points
• Top: candidate source distributions are
shown
• Which distribution is the ML estimate?
• Middle: an estimate of the likelihood of
the data as a function of (the mean)
• Bottom: log likelihood
ECE 8443: Lecture 05, Slide 5
General Mathematics
Let (1, 2 ,..., p ) t .
1
Let .
p
Define : l ln p D
ˆ arg max l
θ
n
ln( p ( x k ))
k 1
ln p x k
n
k 1
ECE 8443: Lecture 05, Slide 6
• The ML estimate is found by
solving this equation:
l [ ln p x k ]
n
k 1
ln p x k 0.
n
k 1
• The solution to this equation can
be a global maximum, a local
maximum, or even an inflection
point.
• Under what conditions is it a global
maximum?
Maximum A Posteriori Estimation
• A class of estimators – maximum a posteriori (MAP) – maximize l p
where p describes the prior probability of different parameter values.
• An ML estimator is a MAP estimator for uniform priors.
• A MAP estimator finds the peak, or mode, of a posterior density.
• MAP estimators are not transformation invariant (if we perform a nonlinear
transformation of the input data, the estimator is no longer optimum in the
new space). This observation will be useful later in the course.
ECE 8443: Lecture 05, Slide 7
Gaussian Case: Unknown Mean
• Consider the case where only the mean, = , is unknown:
ln p x k 0
n
k 1
ln( p (xk )) ln[
1
(2 ) d / 2
exp[
1/ 2
1
(x k ) t 1 (x k )]
2
1
1
ln[( 2 ) d ] (x k ) t 1 (x k )
2
2
which implies: ln( p (xk )) 1 (x k )
because:
1
1
d
t 1
[
ln[(
2
)
]
(
x
)
(
x
)]
k
k
2
2
1
1
[ ln[( 2 ) d ] [ (x k ) t 1 (x k )]
2
2
1 (x k )
ECE 8443: Lecture 05, Slide 8
Gaussian Case: Unknown Mean
• Substituting into the expression for the total likelihood:
l ln p x k 1 (x k ) 0
n
n
k 1
k 1
• Rearranging terms:
n
1
(x k ˆ) 0
k 1
n
(x k ˆ) 0
k 1
n
n
x k ˆ 0
k 1
n
k 1
x k n ˆ 0
k 1
n
1
ˆ x k
n k 1
• Significance???
ECE 8443: Lecture 05, Slide 9
Gaussian Case: Unknown Mean and Variance
• Let = [,2]. The log likelihood of a SINGLE point is:
1
1
1
ln( p( xk )) ln[( 2 ) 2 ] ( xk 1 ) t 2 (xk 1 )
2
2
1
(
x
)
k
1
2
θl θ ln( p ( xk θ))
2
1 ( xk 1 )
2 2
2 22
• The full likelihood leads to:
n
1
ˆ ( xk ˆ1 ) 0
k 1 2
n
n
1
( xk ˆ1 ) 2
2
ˆ
0 ( xk 1 ) ˆ2
ˆ
2
2ˆ2
k 1 2 2
k 1
k 1
n
ECE 8443: Lecture 05, Slide 10
Gaussian Case: Unknown Mean and Variance
1 n
ˆ
• This leads to these equations: 1 ˆ xk
n k 1
• In the multivariate case:
1 n
2
ˆ
2 ˆ ( xk ˆ ) 2
n k 1
1 n
ˆ x k
n k 1
ˆ 2
1 n
t
x k ˆ x k ˆ
n k 1
• The true covariance is the expected value of the matrix x k ˆ x k ˆ ,
which is a familiar result.
t
ECE 8443: Lecture 05, Slide 11
Convergence of the Mean
• Does the maximum likelihood estimate of the variance converge to the true
value of the variance? Let’s start with a few simple results we will need later.
• Expected value of the ML estimate of the mean:
1 n
E[ ˆ ] E[ xi ]
n i 1
n
1
E[ xi ]
n i 1
1 n
n i 1
ECE 8443: Lecture 05, Slide 12
var[ˆ ] E[ ˆ 2 ] ( E[ ˆ ]) 2
E[ ˆ 2 ] 2
1 n 1 n
E[ xi x j ] 2
n i 1 n j 1
2
1 n n
2 E[ xi x j ] 2
n i 1 j 1
Variance of the ML Estimate of the Mean
• The expected value of xixj will be 2 for j k since the two random variables are
independent.
• The expected value of xi2 will be 2 + 2.
• Hence, in the summation above, we have n2-n terms with expected value 2
and n terms with expected value 2 + 2.
• Thus,
var[ˆ ]
1
n
2
n
2
n n
2
2
2
2
2
n
which implies:
E[ ˆ ] var[ˆ ] ( E[ ˆ ])
2
2
2
n
2
• We see that the variance of the estimate goes to zero as n goes to infinity, and
our estimate converges to the true estimate (error goes to zero).
ECE 8443: Lecture 05, Slide 13
Summary
• Discriminant functions for discrete features are completely analogous to the
continuous case (end of Chapter 2).
• To develop an optimal classifier, we need reliable estimates of the statistics of
the features.
• In Maximum Likelihood (ML) estimation, we treat the parameters as having
unknown but fixed values.
• Justified many well-known results for estimating parameters (e.g., computing
the mean by summing the observations).
• Biased and unbiased estimators.
• Convergence of the mean and variance estimates.
ECE 8443: Lecture 05, Slide 14