Transcript Document

Bayes Decision Theory

Course Outline

MODEL INFORMATION COMPLETE INCOMPLETE Supervised Learning Unsupervised Learning Parametric Approach Nonparametric Approach Parametric Approach Nonparametric Approach “Optimal” Rules Plug-in Rules Density Estimation Geometric Rules (K-NN, MLP) Mixture Resolving Cluster Analysis (Hard, Fuzzy)

Two-dimensional Feature Space

Supervised Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation

 Introduction  Maximum-Likelihood Estimation  Bayesian Estimation  Curse of Dimensionality  Component analysis & Discriminants  EM Algorithm

Introduction

 Bayesian framework  We could design an optimal classifier if we knew:   P(  i ) : priors P(x |  i ) : class-conditional densities Unfortunately, we rarely have this complete information!

 Design a classifier based on a set of labeled training samples (supervised learning)  Assume priors are known  Need sufficient no. of training samples for estimating class-conditional densities, especially when the dimensionality of the feature space is large

Pattern Classification

, Chapter 3

3

4

 Assumption about the problem: parametric model of P(x |  i ) is available  Assume P(x |  i ) is multivariate Gaussian P(x |  i ) ~ N(  i ,  i )  Characterized by 2 parameters  Parameter estimation techniques  Maximum-Likelihood (ML) and Bayesian estimation  Results of the two procedures are nearly identical, but there is a subtle difference

Pattern Classification

, Chapter 3

 In ML estimation parameters are assumed to be fixed but unknown! Bayesian parameter estimation procedure, by its nature, utilizes whatever prior information parameter is available about the unknown  MLE: Best parameters are obtained by maximizing the probability of obtaining the samples observed  Bayesian methods view the parameters as random variables having some known prior distribution; How do we know the priors?

 In either approach, we use P(  i classification rule!

| x) for our

5 Pattern Classification

, Chapter 3

Maximum-Likelihood Estimation

 Has good convergence properties as the sample size increases; estimated parameter value approaches the true value as n increases  Simpler than any other alternative technique  General principle  Assume we have c classes and P(x |  j ) ~ N(  j ,  j ) P(x |  j )  P (x |  j ,  j ), where

6

 

(

j

,

j

)

(

1 j

,

j 2

,...,

11 j

,

j 22

, cov( x

j m

, x

n j

)...)

Use class  j samples to estimate class  j parameters

Pattern Classification

, Chapter 3

7

Use the information in training samples to estimate  = (  1 ,  2 , …,  c );  i ith category (i = 1, 2, …, c) is associated with the  Suppose sample set D contains n iid samples , x 1 , x 2 ,…, x n

P ( D P ( D | |

)

k

 

n k

1 P ( x k |

)

F (

)

) is called the likelihood of

w.r.t.

the set of samples)

 ML estimate of  is, by definition, the value that maximizes P(D |  )  “It is the value of  that best agrees with the actually observed training samples”

Pattern Classification

, Chapter 3

Pattern Classification

, Chapter 3

8

 Optimal estimation  Let  = (  1 ,  2 , …,  p ) t and   be the gradient operator         

1

,

 

2

,...,

 

p

  

t

 We define l(  ) as the log-likelihood function l(  ) = ln P(D |  )  New problem statement: determine  that maximizes the log-likelihood 

arg max

l (

) Pattern Classification

, Chapter 3

9

Set of necessary conditions for an optimum is:

(

 

l

k

n



k

1

ln P ( x k |

))

 

l = 0

10 Pattern Classification

, Chapter 3

 Example of a specific case: unknown   P(x |  ) ~ N(  ,  ) (Samples are drawn from a multivariate normal population)

ln P ( x k |

)

 

2 1 ln

( 2

) d

  

1 ( x k 2

 

) t

1

( x k

 

) and

 

ln P ( x k |

)

 

1

( x k

 

) 11

 =  , therefore the ML estimate for  must satisfy:

k

n

 

k

1

1 ( x k

)

0 Pattern Classification

, Chapter 3

• Multiplying by  and rearranging, we obtain: 

1 n k k

n

 

1 x k

which is the arithmetic average or the mean of the samples of the training samples!

12

Conclusion: Given P(x k |  j ), j = 1, 2, …, c to be Gaussian in a

d

dimensional feature space, estimate the vector   2 , …,  c ) t = (  1 , and perform a classification using the Bayes decision rule of chapter 2!

Pattern Classification

, Chapter 3

 ML Estimation:  Univariate Gaussian Case:

unknown

 = (  1 ,  2 ) = (  ,  2 ) 

&

l

ln P ( x k |

)

 

1 ln 2



2 2

 

l

      

1

 

2 (ln P ( x k (ln P ( x k

      

1 2 ( 2

x 1 2 k

  

1 )

0 ( x k

2

 

2 2 1 ) 2 | |

))

))

    

0

 

0 1 2

2 ( x k

 

1 ) 2 Pattern Classification

, Chapter 3

13

Summation:    

k

n

k

1

1 2 ( x k

 

1 )

0 (1)

k

n

k

1

1 2

k k

n

 

1 ( x k

2 2 1 ) 2

0 (2)

Combining (1) and (2), one obtains:  

k

 

n k

1 x k n ;

2

k

 

n k

1 ( x k n

 

) 2 Pattern Classification

, Chapter 3

14

 Bias  ML estimate for  2 is biased

E

1



n

( x i

x ) 2

  

n

1 .

2 n

 

2

 An unbiased estimator for  is:

C

  

n 1 1 k k

n

 

( x

k

   

)( x

 

k

  

)

t Sample covariance matrix Pattern Classification

, Chapter 3

15

 Bayesian Estimation (Bayesian learning approach

16

for pattern classification problems)  In MLE  was supposed to have a fixed value    In BE  is a random variable The computation of posterior probabilities P(  i x) lies at the heart of Bayesian classification | Goal: compute P(  i | x, D ) Given the training sample set D , Bayes formula can be written

P (

i | x ,

D

)

P ( x c

j

1 P ( x |

i ,

D

).

P (

i |

j ,

D

).

P (

j |

D

) |

D

)

 To demonstrate the preceding equation, use:

P ( x ,

D

P ( x |

D

| )

i

)

P ( x |

j P ( x ,

j

D 

i |

D

) ).

P (

D

P (

i )

P (

i |

D

) (Training |

i ) sample provides this!

) Thus : P (

i | x ,

D

)

P ( x j c

 

1 P ( x |

i ,

D

i ).

P (

i ) |

j ,

D

).

P (

j ) 17

 Bayesian Parameter Estimation: Gaussian Case

18

Goal: P(  Estimate  | D ) using the a-posteriori density  The univariate Gaussian case: P(   is the only unknown parameter | D )

P(x |

) ~ N(

,

2 ) P(

) ~ N(

0 ,

2 0 )

 0 and  0 are known!

P (

|

D

)

  

P P (

D

(

D

| |

 

).

).

P (

P (

) ) d

(1)

k k

n

 

1 P ( x k |

).

P (

)

 Reproducing density

P (

|

D

) ~ N (

n ,

2 n ) (2) 19

The updated parameters of the prior: 

n

  

n 0

n

2 0 2 0

 

2

 

and

2 n

 

2 0

2 n

2 0

 

2 n

n

2 0

2

 

2 .

0

20

 The univariate case P(x | D )   P(  | D ) has been computed P(x | D ) remains to be computed!

P ( x |

D

)

 

P ( x |

).

P (

|

D

) d

is Gaussian 21

It provides:

P ( x |

D

) ~ N (

n ,

2

 

2 n )

Desired class-conditional density P(x | D j ,  j ) P(x | D j ,  j ) together with P(  j ) and using Bayes formula, we obtain the Bayesian classification rule:

Max

j

P (

j | x ,

D  

Max

j

P ( x |

j ,

D

j ).

P (

j )

22

 Bayesian Parameter Estimation: General Theory  P(x | D ) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are:    The form of P(x |  ) is assumed known, but the value of  is not known exactly Our knowledge about  is assumed to be contained in a known prior density P(  ) The rest of our knowledge about  D of n random variables x 1 , x 2 , …, x is contained in a set n that follows P(x)

The basic problem is: “Compute the posterior density P(  then “Derive P(x | D )” | D )” Using Bayes formula, we have:

P (

|

D

)

 

P P (

D

(

D

| |

).

P (

)

).

P (

) d

,

And by independence assumption:

P (

D

|

)

k

n

k

1 P ( x k |

) 23

Overfitting

Problem of Insufficient Data

• How to train a classifier (e.g., estimate the covariance matrix) when the training set size is small (compared to the number of features) • Reduce the dimensionality – Select a subset of features – Combine available features to get a smaller number of more “salient” features.

• Bayesian techniques – Assume a reasonable prior on the parameters to compensate for small amount of training data • Model Simplification – Assume statistical independence • Heuristics – Threshold the estimated covariance matrix such that only correlations above a threshold are retained.

Practical Observations

• Most heuristics and model simplifications are almost surely incorrect • In practice, however, the performance of the classifiers base don model simplification is better than with full parameter estimation • Paradox: How can a suboptimal/simplified model perform better than the MLE of full parameter set, on test dataset?

– The answer involves the problem of insufficient data

Insufficient Data in Curve Fitting

Curve Fitting Example

(contd) • The example shows that a 10 th -degree polynomial fits the training data with zero error – However, the test or the generalization error is much higher for this fitted curve • When the data size is small, one cannot be sure about how complex the model should be • A small change in the data will change the parameters of the 10 stability th -degree polynomial significantly, which is not a desirable quality;

Handling insufficient data

• Heuristics and model simplifications • Shrinkage is an intermediate approach, which combines “common covariance” with individual covariance matrices – Individual covariance matrices shrink towards a common covariance matrix.

– Also called regularized discriminant analysis • Shrinkage Estimator for a covariance matrix, given shrinkage factor 0 <  < 1, 

i

(  )  ( 1   ( 1  ) 

n i

) 

n i i

 

n

  

n

• Further, the common covariance can be shrunk towards the Identity matrix, 

(

)

( 1

 

)

  

I

Problems of Dimensionality

Introduction

• Real world applications usually come with a large number of features – Text in documents is represented using frequencies of tens of thousands of words – Images are often represented by extracting local features from a large number of regions within an image • Naive intuition: more the number of features, the better the classification performance? – Not always !

• There are two issues that must be confronted with high dimensional feature spaces – How does the classification accuracy depend on the dimensionality and the number of training samples?

– What is the computational complexity of the classifier?

Statistically Independent Features

• If features are statistically independent, it is possible to get excellent performance as dimensionality increases • For a two class problem with multivariate normal classes , and equal prior probabilities, the probability of error is

P

(

e

)  1 2 

r

  2

e

u

2 2

du

where the Mahalanobis distance is defined as

r

2  (

μ

1 

μ

2 )

T

  1 (

μ

1 

μ

2 )

Statistically Independent Features

• When features are independent, the covariance matrix is diagonal, and we have

r

2 

i d

  1    

i

1 

i

i

2   2 • Since r

2

increases monotonically with an increase in the number of features, P(e) decreases • As long as the means of features in the differ, the error decreases

Increasing Dimensionality

• If a given set of features does not result in good classification performance, it is natural to add more features • High dimensionality results in increased cost and complexity for both feature extraction and classification • If the probabilistic structure of the problem is completely known , adding new features will not possibly increase the Bayes risk

Curse of Dimensionality

• In practice, increasing dimensionality beyond a certain point in the presence of finite number of training samples often leads to lower performance, rather than better performance • The main reasons for this paradox are as follows: – the Gaussian assumption, that is typically made, is almost surely incorrect – Training sample size is always finite, so the estimation of the class conditional density is not very accurate • Analysis of this “curse of dimensionality” problem is difficult

A Simple Example

• Trunk (PAMI, 1979) provided a simple example illustrating this phenomenon.

p

    1 2  1 2

μ 1

μ, μ 2

 

μ

p p

 

X X

|  1  |  2  ~ ~

G

μ 1

,

G

μ 2

, N: Number of features

I

I

p

x

|  1  

i N

  1

p

x

|  2  

i N

  1 1 2  1 2 

e

 1 2  

x i e

 1 2  

x i

1

i

  2 1

i

  2 

i

 1

i

i th

component of the mean vecto r

μ 1

   1 , 1 2 , 1 3 , 1 4 ,...

 

μ 2

    1 ,  1 2 ,  1 3 ,  1 4 ,...

 

Case 1: Mean Values Known

• Bayes decision rule:

P e

Decide  1 if

x

t

μ

x

1  1 

x

2  2  ...

x N

N

or 

i N

 1

x i i

   /  2 1 2 

e

 1 2

z

2

dz

 2   0  0

μ 1

μ 2

2  4 

i N

 1 1

i P e

 

i N

 1   1

i

1 2 

e

 1 2

z

2

dz

 1 is a divergent series

i

P e

 0 as

N

 

Case 2: Mean Values Unknown

• m labeled training samples are available 

1

m

i m

 1

x

i

x

i

x

i

POOLED ESTIMATE Plug-in decision rule

Decide

P e

N

,

m

 1 if   

x

t

P

  2 .

P P

x

t

 0

x

1 

x

|

t

 ˆ 1

x

  

x

2 0  2  ˆ  ...

x

 ˆ  0 | 

x

2   2  

N P N

  1 .

P

x

 due to symmetry

t

x

i

  2  0 |

x

  1  Let

z

x

t

μ

 

i N

 1

x i

ˆ

i

It is difficult to computer t he distributi on of

z

Case 2: Mean Values Unknown

E N

lim    

i N

 1

z

E VAR

   

i

1 ~

VAR

    1

G

   Standard Normal

P e

N

,

m

 

P

z

 0 |

x

  2  

P z

E VAR

   

P e

m

,

N

    /  2 1 2 

e

 1 2

z

2

dz

N

  1

m

E

 

VAR

 

N

lim  

P N

 0 

i N

 1   1 1

i

E VAR

       1

m

 

i N

 1

N m

i N

 1 1

c

1

i

N

lim  

P e

m

,

N

  1 2 

N m

Case 2: Mean Values Unknown

 Component Analysis and Discriminants  Combine features in order to reduce the dimension of the feature space  Linear combinations are simple to compute and tractable   Project high dimensional data onto a lower dimensional space Two classical approaches for finding “optimal” linear transformation  PCA (Principal Component Analysis) “Projection that best represents the data in a least square sense”  MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least squares sense”

42