Ch 3. Linear Models for Regression (2/2) Summarized by Yung-Kyun Noh

Download Report

Transcript Ch 3. Linear Models for Regression (2/2) Summarized by Yung-Kyun Noh

Ch 3. Linear Models for Regression
(2/2)
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Summarized by
Yung-Kyun Noh
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
Contents


3.4 Bayesian Model Comparison
3.5 The Evidence Approximation
 3.5.1
 3.5.2
 3.5.3

Evaluation of the evidence function
Maximizing the evidence function
Effective number of parameters
3.6 Limitations of Fixed Basis Functions
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
Bayesian Model Comparison (1/3)

The problem of model selection from a Bayesian perspective
 Over-fitting associated with maximum likelihood can be avoided by
marginalizing over the model parameters instead of making point
estimates of their values.
 It also allow multiple complexity parameters to be determined
simultaneously as part of the training process. (relevance vector
machine)
 The Bayesian view of model comparison simply involves the use of
probabilities to represent uncertainty in the choice of model.

Posterior
p(M i |D )  p( M i ) p(D |M i )
 p( M i ): prior, a preference for different models.
 p(D |M i ): model evidence (marginal likelihood), the preference
shown by the data for different models. Parameters have been
marginalized out.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
Bayesian Model Comparison (2/3)

Bayes factor: the ratio of model evidences for two models
p(D |M i )/ p(D |M j )

Predictive distribution: mixture distribution. Averaging the predictive distribution
weighted by the posterior probabilities.
L

p(t | x,D )   p(t | x, M i , D ) p(M i |D )
i 1
Model evidence
p(D |M i ) 
 p(D | w, M
i
) p( w | M i )dw
 Sampling perspective: Marginal likelihood can be viewed as the probability of generating the
data set D from a model whose parameters are sampled at random from the prior.

Posterior distribution over parameters
p(D | w, M i ) p( w | M i )
p( w |D ,M i ) 
p(D | M i )
 Evidence is the normalizing term that appears in the denominator when evaluating the
posterior distribution over parameters
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
Bayesian Model Comparison (3/3)

Assume that the posterior distribution is sharply peaked
around the most probable value wMAP.
p(D ) 
 p(D | w) p( w)dw
l n p(D )
p(D | wMAP )
l n p(D | wMAP )  l n(
wposterior
wprior
wposterior
wprior
)
 For a model having a set of M parameters,
l n p(D )


l n p(D | w MAP )  M l n(
wposterior
wprior
)
A simple model has little variability and so will generate data
sets that are fairly similar to each other.
A complex model spreads its predictive probability over too
broad a range of data sets and so assigns relatively small
probability to any one of them.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
The Evidence Approximation (1/2)

Fully Bayesian treatment of linear basis function model
 Hyperparameters: α, β.
 Prediction: Marginalize w.r.t. hyperparameters as well as w.

Predictive distribution
p(t | t) 
 p(t | w,  ) p( w | t, ,  ) p(,  | t)dwd d 
 If the posterior distribution p(,  |t) is sharply peaked around
values  ,  , the predictive distribution is obtained simply by
marginalizing over w in which  ,  are fixed to the values  ,  .
p(t |t)
p(t |t, ,  ) 
 p(t | w,  ) p(w |t, ,  )dw
p(,  |t)  p(t |,  ) p(,  )
 ,   a r g ma x p(,  |t)
 ,
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
The Evidence Approximation (2/2)

If the prior p(,  ) is relatively flat,
 In the evidence framework the values of  ,  are obtained by
maximizing the marginal likelihood function p(t |, .)

Hyperparameters can be determined from the training data alone
from this method. (w/o recourse to cross-validation)
 Recall that the ratio α/β is analogous to a regularization parameter.
 Maximizing evidence
 Set evidence function’s derivative equal to zero, re-estimate equations
for α,β.
 Use technique called the expectation maximization (EM) algorithm.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
Evaluation of the Evidence Function

Marginal likelihood p(t | ,  )   p(t | w,  ) p( w | )dw
  
p(t | ,  )  

2



N /2
 


2



M /2
 exp{ E(w)}dw
E( w)   ED( w)   EW ( w) 

2
2
t  Φw 

2
wT w
M
N
1
N
l n p(t |,  ) 
l n   l n   E(m N )  l n A  l n(2 )
2
2
2
2
m N   A1ΦT t
A   I   ΦT Φ
Model
evidence
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
Maximizing the Evidence Function

Maximization of p(t |,  )
 Set derivative w.r.t α, β to zero.
 w.r.t. α
 ui
and λi are eigenvector and eigenvalue described by
(  ΦT Φ)ui  i ui
 Maximizing
 
hyperparameter

mTN m N
i
 
i   i
 w.r.t. β
1
1

 N 
N
T
2
{
t

m

(
x
)}
 n N n
n 1
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Effective Number of Parameters (1/2)

γ: effective total number of
well determined parameters
i
 
i   i
1
 2

i
 i /   i  1
i
 i /   i  0
w ML  w MAP
w ML  0
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
Effective Number of Parameters (2/2)
2 EW (m N )
Optimal α
Log evidence
l n p(t |,  )
Optimal α

Test err.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
Limitations of Fixed Basis Functions

Models comprising a linear combination of fixed,
nonlinear basis functions.
 Have closed-form solutions to the least-squares problem.
 Have a tractable Bayesian treatment.

The difficulty
 The basis functions  j( x) are fixed before the training data set is
observed, and is a manifestation of the curse of dimensionality.

Properties of data sets to alleviate this problem
 The data vectors {xn} typically lie close to a nonlinear manifold
whose intrinsic dimensionality is smaller than that of the input
space
 Target variables may have significant dependence on only a
small number of possible directions within the data manifold.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12