Transcript lecture_05
ECE 8527
to Machine Learning and Pattern Recognition
8443 – Introduction
Pattern Recognition
LECTURE 05: BAYESIAN ESTIMATION
• Objectives:
Bayesian Estimation
Example
• Resources:
D.H.S.: Chapter 3 (Part 2)
J.O.S.: Bayesian Parameter Estimation
A.K.: The Holy Trinity
A.E.: Bayesian Methods
J.H.: Euro Coin
Introduction to Bayesian Parameter Estimation
• In Chapter 2, we learned how to design an optimal classifier if we knew the
prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).
• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.
• Bayesian learning: sharpen the a posteriori density causing it to peak near
the true value.
• Supervised vs. unsupervised: do we know the class assignments of the
training data.
• Bayesian estimation and ML estimation produce very similar results in many
cases.
• Reduces statistical inference (prior knowledge or beliefs about the world) to
probabilities.
ECE 8527: Lecture 05, Slide 1
Class-Conditional Densities
• Posterior probabilities, P(ωi|x), are central to Bayesian classification.
• Bayes formula allows us to compute P(ωi|x) from the priors, P(ωi), and the
likelihood, p(x|ωi).
• But what If the priors and class-conditional densities are unknown?
• The answer is that we can compute the posterior, P(ωi|x), using all of the
information at our disposal (e.g., training data).
• For a training set, D, Bayes formula becomes:
P(i | x, D)
likelihood prior
evidence
p(x i , D) P(i D)
c
p(x j , D) P( j D)
j 1
• We assume priors are known: P(ωi|D) = P(ωi).
• Also, assume functional independence:
Di have no influence on
This gives:
P(i | x, D)
p(x j , D) if i j
p(x i , Di ) P(i )
c
p(x j , D j ) P( j )
j 1
ECE 8527: Lecture 05, Slide 2
The Parameter Distribution
• Assume the parametric form of the evidence, p(x), is known: p(x|θ).
• Any information we have about θ prior to collecting samples is contained in a
known prior density p(θ).
• Observation of samples converts this to a posterior, p(θ|D), which we hope is
peaked around the true value of θ.
• Our goal is to estimate a parameter vector:
p(x D) p(x, D)d
• We can write the joint distribution as a product:
p(x D) p(x , D ) p( D)d
p(x ) p( D)d
because the samples are drawn independently.
• This equation links the class-conditional density p(x D)
to the posterior, p( D) . But numerical solutions are typically required!
ECE 8527: Lecture 05, Slide 3
Univariate Gaussian Case
• Case: only mean unknown
p(x ) N (, 2 )
• Known prior density:
p( ) N (0 , 02 )
• Using Bayes formula:
p( D) p( D) p( D ) p( )
• Rationale: Once a value of μ is
known, the density for x is
completely known. α is a
normalization factor that
p( D)
p( D ) p( )
p( D)
p( D ) p( )
p ( D ) p ( ) d
[ p ( D ) p ( )]
n
depends on the data, D.
ECE 8527: Lecture 05, Slide 4
p( x k ) p( )
k 1
Univariate Gaussian Case
• Applying our Gaussian assumptions:
2
n 1
1
x
1
k
1
0
p | D
exp
exp
k 1 2
2 0
2 2 0
2
1 2 n x 2
0
k
exp
2 0
k 1
1 2 2 0 02 n x 2k
x k 2
2 2 2 2
exp
2
0
2
k 1
n
1 02
x 2k
exp 2 2
2 0 k 1
1 2 2 0
exp 2
02
2 0
1 2 2 0
exp 2
02
2 0
ECE 8527: Lecture 05, Slide 5
n
x k 2
(2 2 2 )
k 1
n
x k 2
(2 2 2 )
k 1
Univariate Gaussian Case (Cont.)
• Now we need to work this into a simpler form:
1 2 2 0 n
x k 2
(2 2 2 )
p | D exp 2
2
0 k 1
2 0
1 n 2 2 n
x
exp 2 2 2 k 2 2 20
0
2 k 1 0 k 1
1 2 2
exp n 2 2 2 2
0
2
n
x k 2
k 1
0
2
0
1 n
1
exp( 2 2
0
2
2
1 n
0
2
n
x
2
2 k
1
k
0
1 n
1
exp( 2 2
0
2
1 n
where ˆ n x k
n k 1
2
2 1 (nˆ n ) 0
2
02
ECE 8527: Lecture 05, Slide 6
Univariate Gaussian Case (Cont.)
• p(μ|D) is an exponential of a quadratic function, which makes it a normal
distribution. Because this is true for any n, it is referred to as a reproducing
density.
• p(μ) is referred to as a conjugate prior.
• Write p(μ|D) ~ N(μn,σn
2):
p( D)
1 n 2
exp[ (
) ]
2 n
2 n
1
• Expand the quadratic term:
1 n 2
1
1 2 2 n n2
p( D)
exp[ (
) ]
exp[ (
)]
2
2 n
2
n
2 n
2 n
1
• Equate coefficients of our two functions:
1 2 2 n n2
exp
2
n
2 n
2
1 n
2
1
1
0
exp 2 2 2 2 (nˆ n ) 2
2
0
0
1
ECE 8527: Lecture 05, Slide 7
Univariate Gaussian Case (Cont.)
• Rearrange terms so that the dependencies on μ are clear:
1 n2
1 1 2
n
exp 2 exp 2 2 2
2
2
n
2 n
n
n
1 n
1
0
1 2
exp 2 2 2 2 (nˆ n ) 2
2
0
0
1
• Associate terms related to σ2 and μ:
• There is actually a third equation involving terms not related to :
1 n2
exp 2 or
2
2 n
n
1
1 n2
1
exp 2
2
2 n
2 0
n
1
1
2
n
but we can ignore this since it is not a function of μ and is a complicated
equation to solve.
ECE 8527: Lecture 05, Slide 8
Univariate Gaussian Case (Cont.)
• Two equations and two unknowns. Solve for μn and σn2. First, solve for μn2 :
2 02
02 2
n
1
n 02 2 n 02 2
2
2
0
2
n
1
• Next, solve for μn:
n n2
n2
n ˆ n ( 2 ) 0 2
0
1
n 02 2
2
ˆ n ( 2 )
0
2
2
n 0
0
n 02
ˆ n
2
2
n 0
02 2
2
2
n 0
2
0
n 2 2
0
• Summarizing:
2
n ( 2
) ˆ
2 n
2
2 0
n 0
n 0
n 02
n2
02 2
n 02 2
ECE 8527: Lecture 05, Slide 9
Bayesian Learning
• μn represents our best guess after n samples.
• σn2 represents our uncertainty about this guess.
• σn2 approaches σ2/n for large n – each additional observation decreases our
uncertainty.
• The posterior, p(μ|D), becomes more sharply peaked as n grows large. This is
known as Bayesian learning.
ECE 8527: Lecture 05, Slide 10
Class-Conditional Density
• How do we obtain p(x|D) (derivation is tedious):
p ( x D) p ( x ) p( D)d
1
1 x 2
1
1 n 2
exp[ (
) ][
exp[ (
) ]d
2
2
2 n
2 n
1 x
exp[ ( 2 n2 ) 2 ] f ( , n )
2n
2 n
1
where:
f ( , n )
• Note that:
2
2
2
2
1 n
n x n
d
exp[ ( 2 2 )
2
2
2 n
n
p(x | D) ~ N (n , 2 n2 )
2
• The conditional mean, μn, is treated as the true mean.
• p(x|D) and P(ωj) can be used to design the classifier.
ECE 8527: Lecture 05, Slide 11
Multivariate Case
• Assume: p(x ) ~ N (, ) and p() ~ N (0 , 0 )
where 0 , and 0 are assumed to be known.
• Applying Bayes formula:
n
p | D p(x k | ) p
k 1
n
1 t
1
1
t
1
exp ( (n 0 ) 2 ( x k 01 0 ))
2
i 1
which has the form:
1
p | D exp ( n ) t 1 ( n )
2
• Once again: p | D ~ N ( n , n )
and we have a reproducing density.
ECE 8527: Lecture 05, Slide 12
Estimation Equations
• Equating coefficients between the two Gaussians:
n1 n 1 01
n1 n n 1 ˆ n 01 0
ˆ n
1 n
xk
n k 1
• The solution to these equations is:
1
1
1
n 0 ( 0 ) 1ˆ n ( 0 ) 1 0
n
n
n
1
1
n 0 ( 0 ) 1
n
n
• It also follows that: p(x | D) ~ N ( n , n )
ECE 8527: Lecture 05, Slide 13
General Theory
• p(x | D) computation can be applied to any situation in which the unknown
density can be parameterized.
• The basic assumptions are:
The form of p(x | θ) is assumed known, but the value of θ is not known
exactly.
Our knowledge about θ is assumed to be contained in a known prior
density p(θ).
The rest of our knowledge about θ is contained in a set D of n random
variables x1, x2, …, xn drawn independently according to the unknown
probability density function p(x).
ECE 8527: Lecture 05, Slide 14
Formal Solution
• The posterior is given by:
p(x | D) p(x | ) p( | D)d
• Using Bayes formula, we can write p(D| θ ) as:
p ( | D )
p (D | ) p
p (D | ) p d
• and by the independence assumption:
n
p ( D | ) p ( x k
k 1
• This constitutes the formal solution to the problem because we have an
expression for the probability of the data given the parameters.
• This also illuminates the relation to the maximum likelihood estimate:
• Suppose p(D| θ ) reaches a sharp peak at θ θˆ .
ECE 8527: Lecture 05, Slide 15
Comparison to Maximum Likelihood
• This also illuminates the relation to the maximum likelihood estimate:
Suppose p(D| θ ) reaches a sharp peak at θ θˆ .
p(θ| D) will also peak at the same place if p(θ) is well-behaved.
p(x|D) will be approximately p x | ˆ , which is the ML result.
If the peak of p(D| θ ) is very sharp, then the influence of prior information
on the uncertainty of the true value of θ can be ignored.
However, the Bayes solution tells us how to use all of the available
information to compute the desired density p(x|D).
ECE 8527: Lecture 05, Slide 16
Recursive Bayes Incremental Learning
• To indicate explicitly the dependence on the number of samples, let:
Dn x1, x2 ,..., xn
• We can then write our expression for p(D| θ ):
p( Dn | ) p(xn | ) p( Dn1 | )
0
where p( D | ) p() .
• We can write the posterior density using a recursive relation:
p ( | D )
n
p ( D n | ) p
n
p ( D | ) p d
p ( xn | ) p ( D n 1 | )
n 1
p ( xn | ) p ( | D ) d
0
• where p( | D ) p( ) .
• This is called the Recursive Bayes Incremental Learning because we have a
method for incrementally updating our estimates.
ECE 8527: Lecture 05, Slide 17
When do ML and Bayesian Estimation Differ?
• For infinite amounts of data, the solutions converge. However, limited data
is always a problem.
• If prior information is reliable, a Bayesian estimate can be superior.
• Bayesian estimates for uniform priors are similar to an ML solution.
• If p(θ| D) is broad or asymmetric around the true value, the approaches are
likely to produce different solutions.
• When designing a classifier using these techniques, there are three
sources of error:
Bayes Error: the error due to overlapping distributions
Model Error: the error due to an incorrect model or incorrect assumption
about the parametric form.
Estimation Error: the error arising from the fact that the parameters are
estimated from a finite amount of data.
ECE 8527: Lecture 05, Slide 18
Noninformative Priors and Invariance
• The information about the prior is based on the designer’s knowledge of the
problem domain.
• We expect the prior distributions to be “translation and scale invariance” –
they should not depend on the actual value of the parameter.
• A prior that satisfies this property is referred to as a
“noninformative prior”:
The Bayesian approach remains applicable even when little or no prior
information is available.
Such situations can be handled by choosing a prior density giving equal
weight to all possible values of θ.
Priors that seemingly impart no prior preference, the so-called
noninformative priors, also arise when the prior is required to be invariant
under certain transformations.
Frequently, the desire to treat all possible values of θ equitably leads to
priors with infinite mass. Such noninformative priors are called improper
priors.
ECE 8527: Lecture 05, Slide 19
Example of Noninformative Priors
• For example, if we assume the prior distribution of a mean of a continuous
random variable is independent of the choice of the origin, the only prior
that could satisfy this is a uniform distribution (which isn’t possible).
• Consider a parameter σ, and a transformation of this variable:
new variable, ~ ln( ) . Suppose we also scale by a positive constant:
ˆ ln( a ) ln a ln . A noninformative prior on σ is the inverse distribution
p(σ) = 1/ σ, which is also improper.
ECE 8527: Lecture 05, Slide 20
Sufficient Statistics
• Direct computation of p(D|θ) and p(θ|D) for large data sets is challenging
(e.g. neural networks)
• We need a parametric form for p(x|θ) (e.g., Gaussian)
• Gaussian case: computation of the sample mean and covariance, which was
straightforward, contained all the information relevant to estimating the
unknown population mean and covariance.
• This property exists for other distributions.
• A sufficient statistic is a function s of the samples D that contains all the
information relevant to a parameter, θ.
• A statistic, s, is said to be sufficient for θ if p(D|s,θ) is independent of θ:
p ( | s , D )
p ( D | s , ) p ( | s )
p ( | s )
p( D | s)
ECE 8527: Lecture 05, Slide 21
The Factorization Theorem
• Theorem: A statistic, s, is sufficient for θ, if and only if p(D|θ) can be written
as: p(D | ) g (s, )h(D) .
• There are many ways to formulate sufficient statistics
(e.g., define a vector of the samples themselves).
• Useful only when the function g() and the sufficient statistic are simple
(e.g., sample mean calculation).
• The factoring of p(D|θ) is not unique:
g ( s, ) f ( s) g ( s, )
h( D) h( D) / f ( s)
• Define a kernel density invariant to scaling:
g ( s , )
g~ ( s, )
g ( s , ) d
• Significance: most practical applications of parameter estimation involve
simple sufficient statistics and simple kernel densities.
ECE 8527: Lecture 05, Slide 22
Gaussian Distributions
1
t 1
exp
x
x
k
k
2
d / 2 1/ 2
k 1 2
n
1
p ( D )
1 n t 1
t 1
t 1
exp
x
x
k
k
2
d / 2 1/ 2
k 1
2
1
n
n t 1
t 1
exp x k
k 1
2
1
1 n t 1
exp x k x k
2 d / 2 1/ 2
2 k 1
• This isolates the θ dependence in the first term, and hence, the sample mean
is a sufficient statistic using the Factorization Theorem.
• The kernel is:
~ ,
g~
n
1
~ t 1 1
~
exp
n
n
n
2 d / 2 1/ 2 2
1
ECE 8527: Lecture 05, Slide 23
The Exponential Family
t
• This can be generalized: p(x | ) x exp[ a() b() c()]
n
n
k 1
k 1
and: p( D | ) exp[ na() b() c(x k )] x k g (s, )h( D)
t
• Examples:
ECE 8527: Lecture 05, Slide 24
Summary
• Introduction of Bayesian parameter estimation.
• The role of the class-conditional distribution in a Bayesian estimate.
• Estimation of the posterior and probability density function assuming the only
unknown parameter is the mean, and the conditional density of the “features”
given the mean, p(x|θ), can be modeled as a Gaussian distribution.
• Bayesian estimates of the mean for the multivariate Gaussian case.
• General theory for Bayesian estimation.
• Comparison to maximum likelihood estimates.
• Recursive Bayesian incremental learning.
• Noninformative priors.
• Sufficient statistics
• Kernel density.
ECE 8527: Lecture 05, Slide 25
“The Euro Coin”
• Getting ahead a bit, let’s see how we can put these ideas to work on a simple
example due to David MacKay, and explained by Jon Hamaker.
ECE 8527: Lecture 05, Slide 26