Transcript Document
Bayes Decision Theory
Course Outline
MODEL INFORMATION COMPLETE INCOMPLETE Supervised Learning Unsupervised Learning Parametric Approach Nonparametric Approach Parametric Approach Nonparametric Approach “Optimal” Rules Plug-in Rules Density Estimation Geometric Rules (K-NN, MLP) Mixture Resolving Cluster Analysis (Hard, Fuzzy)
Two-dimensional Feature Space
Supervised Learning
Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation
Introduction Maximum-Likelihood Estimation Bayesian Estimation Curse of Dimensionality Component analysis & Discriminants EM Algorithm
Introduction
Bayesian framework We could design an optimal classifier if we knew: P( i ) : priors P(x | i ) : class-conditional densities Unfortunately, we rarely have this complete information!
Design a classifier based on a set of labeled training samples (supervised learning) Assume priors are known Need sufficient no. of training samples for estimating class-conditional densities, especially when the dimensionality of the feature space is large
Pattern Classification
, Chapter 3
3
4
Assumption about the problem: parametric model of P(x | i ) is available Assume P(x | i ) is multivariate Gaussian P(x | i ) ~ N( i , i ) Characterized by 2 parameters Parameter estimation techniques Maximum-Likelihood (ML) and Bayesian estimation Results of the two procedures are nearly identical, but there is a subtle difference
Pattern Classification
, Chapter 3
In ML estimation parameters are assumed to be fixed but unknown! Bayesian parameter estimation procedure, by its nature, utilizes whatever prior information parameter is available about the unknown MLE: Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayesian methods view the parameters as random variables having some known prior distribution; How do we know the priors?
In either approach, we use P( i classification rule!
| x) for our
5 Pattern Classification
, Chapter 3
Maximum-Likelihood Estimation
Has good convergence properties as the sample size increases; estimated parameter value approaches the true value as n increases Simpler than any other alternative technique General principle Assume we have c classes and P(x | j ) ~ N( j , j ) P(x | j ) P (x | j , j ), where
6
(
j
,
j
)
(
1 j
,
j 2
,...,
11 j
,
j 22
, cov( x
j m
, x
n j
)...)
Use class j samples to estimate class j parameters
Pattern Classification
, Chapter 3
7
Use the information in training samples to estimate = ( 1 , 2 , …, c ); i ith category (i = 1, 2, …, c) is associated with the Suppose sample set D contains n iid samples , x 1 , x 2 ,…, x n
P ( D P ( D | |
)
k
n k
1 P ( x k |
)
F (
)
) is called the likelihood of
w.r.t.
the set of samples)
ML estimate of is, by definition, the value that maximizes P(D | ) “It is the value of that best agrees with the actually observed training samples”
Pattern Classification
, Chapter 3
Pattern Classification
, Chapter 3
8
Optimal estimation Let = ( 1 , 2 , …, p ) t and be the gradient operator
1
,
2
,...,
p
t
We define l( ) as the log-likelihood function l( ) = ln P(D | ) New problem statement: determine that maximizes the log-likelihood
arg max
l (
) Pattern Classification
, Chapter 3
9
Set of necessary conditions for an optimum is:
(
l
k
n
k
1
ln P ( x k |
))
l = 0
10 Pattern Classification
, Chapter 3
Example of a specific case: unknown P(x | ) ~ N( , ) (Samples are drawn from a multivariate normal population)
ln P ( x k |
)
2 1 ln
( 2
) d
1 ( x k 2
) t
1
( x k
) and
ln P ( x k |
)
1
( x k
) 11
= , therefore the ML estimate for must satisfy:
k
n
k
1
1 ( x k
)
0 Pattern Classification
, Chapter 3
• Multiplying by and rearranging, we obtain:
1 n k k
n
1 x k
which is the arithmetic average or the mean of the samples of the training samples!
12
Conclusion: Given P(x k | j ), j = 1, 2, …, c to be Gaussian in a
d
dimensional feature space, estimate the vector 2 , …, c ) t = ( 1 , and perform a classification using the Bayes decision rule of chapter 2!
Pattern Classification
, Chapter 3
ML Estimation: Univariate Gaussian Case:
unknown
= ( 1 , 2 ) = ( , 2 )
&
l
ln P ( x k |
)
1 ln 2
2 2
l
1
2 (ln P ( x k (ln P ( x k
1 2 ( 2
x 1 2 k
1 )
0 ( x k
2
2 2 1 ) 2 | |
))
))
0
0 1 2
2 ( x k
1 ) 2 Pattern Classification
, Chapter 3
13
Summation:
k
n
k
1
1 2 ( x k
1 )
0 (1)
k
n
k
1
1 2
k k
n
1 ( x k
2 2 1 ) 2
0 (2)
Combining (1) and (2), one obtains:
k
n k
1 x k n ;
2
k
n k
1 ( x k n
) 2 Pattern Classification
, Chapter 3
14
Bias ML estimate for 2 is biased
E
1
n
( x i
x ) 2
n
1 .
2 n
2
An unbiased estimator for is:
C
n 1 1 k k
n
( x
k
)( x
k
)
t Sample covariance matrix Pattern Classification
, Chapter 3
15
Bayesian Estimation (Bayesian learning approach
16
for pattern classification problems) In MLE was supposed to have a fixed value In BE is a random variable The computation of posterior probabilities P( i x) lies at the heart of Bayesian classification | Goal: compute P( i | x, D ) Given the training sample set D , Bayes formula can be written
P (
i | x ,
D
)
P ( x c
j
1 P ( x |
i ,
D
).
P (
i |
j ,
D
).
P (
j |
D
) |
D
)
To demonstrate the preceding equation, use:
P ( x ,
D
P ( x |
D
| )
i
)
P ( x |
j P ( x ,
j
D
i |
D
) ).
P (
D
P (
i )
P (
i |
D
) (Training |
i ) sample provides this!
) Thus : P (
i | x ,
D
)
P ( x j c
1 P ( x |
i ,
D
i ).
P (
i ) |
j ,
D
).
P (
j ) 17
Bayesian Parameter Estimation: Gaussian Case
18
Goal: P( Estimate | D ) using the a-posteriori density The univariate Gaussian case: P( is the only unknown parameter | D )
P(x |
) ~ N(
,
2 ) P(
) ~ N(
0 ,
2 0 )
0 and 0 are known!
P (
|
D
)
P P (
D
(
D
| |
).
).
P (
P (
) ) d
(1)
k k
n
1 P ( x k |
).
P (
)
Reproducing density
P (
|
D
) ~ N (
n ,
2 n ) (2) 19
The updated parameters of the prior:
n
n 0
n
2 0 2 0
2
and
2 n
2 0
2 n
2 0
2 n
n
2 0
2
2 .
0
20
The univariate case P(x | D ) P( | D ) has been computed P(x | D ) remains to be computed!
P ( x |
D
)
P ( x |
).
P (
|
D
) d
is Gaussian 21
It provides:
P ( x |
D
) ~ N (
n ,
2
2 n )
Desired class-conditional density P(x | D j , j ) P(x | D j , j ) together with P( j ) and using Bayes formula, we obtain the Bayesian classification rule:
Max
j
P (
j | x ,
D
Max
j
P ( x |
j ,
D
j ).
P (
j )
22
Bayesian Parameter Estimation: General Theory P(x | D ) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are: The form of P(x | ) is assumed known, but the value of is not known exactly Our knowledge about is assumed to be contained in a known prior density P( ) The rest of our knowledge about D of n random variables x 1 , x 2 , …, x is contained in a set n that follows P(x)
The basic problem is: “Compute the posterior density P( then “Derive P(x | D )” | D )” Using Bayes formula, we have:
P (
|
D
)
P P (
D
(
D
| |
).
P (
)
).
P (
) d
,
And by independence assumption:
P (
D
|
)
k
n
k
1 P ( x k |
) 23
Overfitting
Problem of Insufficient Data
• How to train a classifier (e.g., estimate the covariance matrix) when the training set size is small (compared to the number of features) • Reduce the dimensionality – Select a subset of features – Combine available features to get a smaller number of more “salient” features.
• Bayesian techniques – Assume a reasonable prior on the parameters to compensate for small amount of training data • Model Simplification – Assume statistical independence • Heuristics – Threshold the estimated covariance matrix such that only correlations above a threshold are retained.
Practical Observations
• Most heuristics and model simplifications are almost surely incorrect • In practice, however, the performance of the classifiers base don model simplification is better than with full parameter estimation • Paradox: How can a suboptimal/simplified model perform better than the MLE of full parameter set, on test dataset?
– The answer involves the problem of insufficient data
Insufficient Data in Curve Fitting
Curve Fitting Example
(contd) • The example shows that a 10 th -degree polynomial fits the training data with zero error – However, the test or the generalization error is much higher for this fitted curve • When the data size is small, one cannot be sure about how complex the model should be • A small change in the data will change the parameters of the 10 stability th -degree polynomial significantly, which is not a desirable quality;
Handling insufficient data
• Heuristics and model simplifications • Shrinkage is an intermediate approach, which combines “common covariance” with individual covariance matrices – Individual covariance matrices shrink towards a common covariance matrix.
– Also called regularized discriminant analysis • Shrinkage Estimator for a covariance matrix, given shrinkage factor 0 < < 1,
i
( ) ( 1 ( 1 )
n i
)
n i i
n
n
• Further, the common covariance can be shrunk towards the Identity matrix,
(
)
( 1
)
I
Problems of Dimensionality
Introduction
• Real world applications usually come with a large number of features – Text in documents is represented using frequencies of tens of thousands of words – Images are often represented by extracting local features from a large number of regions within an image • Naive intuition: more the number of features, the better the classification performance? – Not always !
• There are two issues that must be confronted with high dimensional feature spaces – How does the classification accuracy depend on the dimensionality and the number of training samples?
– What is the computational complexity of the classifier?
Statistically Independent Features
• If features are statistically independent, it is possible to get excellent performance as dimensionality increases • For a two class problem with multivariate normal classes , and equal prior probabilities, the probability of error is
P
(
e
) 1 2
r
2
e
u
2 2
du
where the Mahalanobis distance is defined as
r
2 (
μ
1
μ
2 )
T
1 (
μ
1
μ
2 )
Statistically Independent Features
• When features are independent, the covariance matrix is diagonal, and we have
r
2
i d
1
i
1
i
i
2 2 • Since r
2
increases monotonically with an increase in the number of features, P(e) decreases • As long as the means of features in the differ, the error decreases
Increasing Dimensionality
• If a given set of features does not result in good classification performance, it is natural to add more features • High dimensionality results in increased cost and complexity for both feature extraction and classification • If the probabilistic structure of the problem is completely known , adding new features will not possibly increase the Bayes risk
Curse of Dimensionality
• In practice, increasing dimensionality beyond a certain point in the presence of finite number of training samples often leads to lower performance, rather than better performance • The main reasons for this paradox are as follows: – the Gaussian assumption, that is typically made, is almost surely incorrect – Training sample size is always finite, so the estimation of the class conditional density is not very accurate • Analysis of this “curse of dimensionality” problem is difficult
A Simple Example
• Trunk (PAMI, 1979) provided a simple example illustrating this phenomenon.
p
1 2 1 2
μ 1
μ, μ 2
μ
p p
X X
| 1 | 2 ~ ~
G
μ 1
,
G
μ 2
, N: Number of features
I
I
p
x
| 1
i N
1
p
x
| 2
i N
1 1 2 1 2
e
1 2
x i e
1 2
x i
1
i
2 1
i
2
i
1
i
i th
component of the mean vecto r
μ 1
1 , 1 2 , 1 3 , 1 4 ,...
μ 2
1 , 1 2 , 1 3 , 1 4 ,...
Case 1: Mean Values Known
• Bayes decision rule:
P e
Decide 1 if
x
t
μ
x
1 1
x
2 2 ...
x N
N
or
i N
1
x i i
/ 2 1 2
e
1 2
z
2
dz
2 0 0
μ 1
μ 2
2 4
i N
1 1
i P e
i N
1 1
i
1 2
e
1 2
z
2
dz
1 is a divergent series
i
P e
0 as
N
Case 2: Mean Values Unknown
• m labeled training samples are available
1
m
i m
1
x
i
x
i
x
i
POOLED ESTIMATE Plug-in decision rule
Decide
P e
N
,
m
1 if
x
t
P
2 .
P P
x
t
0
x
1
x
|
t
ˆ 1
x
x
2 0 2 ˆ ...
x
ˆ 0 |
x
2 2
N P N
1 .
P
x
due to symmetry
t
x
i
2 0 |
x
1 Let
z
x
t
μ
i N
1
x i
ˆ
i
It is difficult to computer t he distributi on of
z
Case 2: Mean Values Unknown
E N
lim
i N
1
z
E VAR
i
1 ~
VAR
1
G
Standard Normal
P e
N
,
m
P
z
0 |
x
2
P z
E VAR
P e
m
,
N
/ 2 1 2
e
1 2
z
2
dz
N
1
m
E
VAR
N
lim
P N
0
i N
1 1 1
i
E VAR
1
m
i N
1
N m
i N
1 1
c
1
i
N
lim
P e
m
,
N
1 2
N m
Case 2: Mean Values Unknown
Component Analysis and Discriminants Combine features in order to reduce the dimension of the feature space Linear combinations are simple to compute and tractable Project high dimensional data onto a lower dimensional space Two classical approaches for finding “optimal” linear transformation PCA (Principal Component Analysis) “Projection that best represents the data in a least square sense” MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least squares sense”
42