Document 7885807

Download Report

Transcript Document 7885807

Announcement


Talk by Prof. Buhmann on “Statistical Models
on Image Segmentation and Clustering in
Computer Vision” in EB3105 from 4:00pm to
5:00pm
Homework 5 is out
http://www.cse.msu.edu/~cse847/assignments
.html
Bayesian Learning
Rong Jin
Outline




MAP learning vs. ML learning
Minimum description length principle
Bayes optimal classifier
Bagging
Maximum Likelihood Learning (ML)


Find the model that best explains the observations by
maximizing the log-likelihood of the training data
Logistic regression
1
p( y | x; ) 
1  exp  y ( x  w  c)
  {w1 , w2 ,..., wm , c}


Parameters are found by maximizing the likelihood of training data

w* , c*  max l ( Dtrain )  max i 1 log
n
w,c
w,c

1

1  exp  y x  w  c 
Maximum A Posterior Learning (MAP)




In ML learning, models are determined solely by the training examples
Very often, we have prior knowledge/preference about parameters/models
ML learning doesn’t incorporate the prior knowledge/preference on
parameters/models
Maximum a posterior learning (MAP)

Knowledge/preference about parameters/models are incorporated through a
prior
 *  arg max Pr( D |  ) Pr( )

Prior for
parameters
Example: Logistic Regression

ML learning



w* , c*  max l ( Dtrain )  max i 1 log
n
w,c
w,c

1
Prior knowledge/Preference
No feature should dominate over all other features
 Prefer small weights
 Gaussian prior for parameters/models:

 1
Pr(w)  exp   2
 


1  exp  y x  w  c 
m
2
w
i 1 i 

Example (cont’d)

MAP learning for logistic regression
w , c   arg max Pr(D | w, c) Pr(w, c)
*
*
w,c
 arg max  log Pr( D | w, c)  log Pr( w, c) 

 n
1
1
 arg max  i 1 log
 2
1  exp  y  x  w  c  
w,c





Compared to regularized logistic regression
lreg ( Dtrain )  i 1 log
N

m
2
w
i 1 i 
1
m
 s i 1 wi2
1  exp  y(c  x  w)

Minimum Description Length Principle

Occam’s razor: prefer the simplest hypothesis


Simplest hypothesis  hypothesis with shortest
# of bits to encode
# of bits to
encode
description
length
data D given h
hypothesis h
Minimum
description
length
Complexity
of

# of Mistakes
Model hypothesis
Prefer shortest
hMDL  arg min LC1 (h)  LC2 ( D | h)
hH

LC (x) is the description length for message x
under coding scheme c
Minimum Description Length Principle
hMDL  arg min LC1 (h)  LC2 ( D | h)
hH
Sender
Receiver
Send only D ?
Send only h ?
D
Send h + D/h ?
Example: Decision Tree



H = decision trees, D = training data labels
LC1(h) is # bits to describe tree h
LC2(D|h) is # bits to describe D given tree h



Note LC2(D|h)=0 if examples are classified
perfectly by h.
Only need to describe exceptions
hMDL trades off tree size for training errors
MAP vs. MDL

MAP learning:
hMAP  arg max Pr( D | h) Pr(h)  arg max log 2 Pr( D | h)  log 2 Pr( h)
hH
hH
 arg min  log 2 Pr(h)  log 2 Pr( D | h)
hH

Description length of
exceptions under
The optimal (shortest expected coding length)
code
for an
optimal
coding
Fact from information theory

event with probability p is –log2p
 length
Interpret
Description
ofMAP
h using MDL principle
under optimal coding
hMDL  arg min LC1 (h)  LC2 ( D | h)
hH
Problems with Maximum Approaches

Consider

Three possible hypotheses:
Pr(h1 | D)  0.4, Pr(h2 | D)  0.3, Pr(h3 | D)  0.3


Maximum approaches will pick h1
Given new instance x
h1 ( x)  , h2 ( x)  , h3 ( x)  


Maximum approaches will output +
However, is this most probably result?
Bayes Optimal Classifier (Bayesian Average)

Bayes optimal classification:
c* ( x)  arg max
c

 Pr(h | D) Pr(c | h, x)
hH
Example:
Pr(h1 | D)  0.4, Pr( | h1 , x)  1, Pr(  | h1 , x)  0
Pr(h2 | D)  0.3, Pr( | h2 , x)  0, Pr(  | h2 , x)  1
Pr(h3 | D)  0.3, Pr( | h3 , x)  0, Pr(  | h3 , x)  1
 Pr(h | D) Pr( | h, x)  0.4,  Pr(h | D) Pr( | h, x)  0.6
h

h
The most probably class is -
When do We Need Bayesian Average?

Bayes optimal classification
c* ( x)  arg max
c
 Pr(h | D) Pr(c | h, x)
hH
When do we need Bayesian average?
Multiple mode case
Optimal mode is flat
When NOT Bayesian Average?
Can’t estimate Pr(h|D) accurately
Computational Issues with Bayes
Optimal Classifier

Bayes optimal classification
c* ( x)  arg max
c

 Pr(h | D) Pr(c | h, x)
hH
Computational issues:


Need to sum over all possible models/hypotheses h
It is expensive or impossible when the model/hypothesis
space is large


Example: decision tree
Solution: sampling !
Gibbs Classifier

Gibbs algorithm
1.
2.

Choose one hypothesis at random, according to P(h|D)
Use this to classify new instance
Surprising fact:
E errGibbs   2 E errBayesOptimal 

Improve by sampling multiple hypotheses from
P(h|D) and average their classification results


Markov chain Monte Carlo (MCMC) sampling
Importance sampling
Bagging Classifiers

In general, sampling from P(h|D) is difficult
because
1.
P(h|D) is rather difficult to compute

2.
3.

Example: how to compute P(h|D) for decision tree?
P(h|D) is impossible to compute for non-probabilistic
classifier such as SVM
P(h|D) is extremely small when hypothesis space is large
Bagging Classifiers:

Realize sampling P(h|D) through a sampling of training
examples
Boostrap Sampling


Bagging = Boostrap aggregating
Boostrap sampling: given set D containing m
training examples


Create Di by drawing m examples at random with
replacement from D
Di expects to leave out about 0.37 of examples
from D
Bagging Algorithm



Create k boostrap samples D1, D1,…, Dk
Train distinct classifier hi on each Di
Classify new instance by classifier vote with
equal weights
c ( x)  arg max 
*
c
k
Pr(c | hi , x)
i 1
Bagging  Bayesian Average
Bayesian Average
Bagging
D
P(h|D)
Boostrap Sampling
Sampling
D1
h1
h2
…
hk
D2
Dk
…
h1
h2
Boostrap sampling is almost equivalent
i Pr(c | hi , xto) sampling from posterior P(h|D)i Pr(c | hi , x)
hk
Empirical Study of Bagging

Bagging decision trees




Boostrap 50 different samples
from the original training data
Learn a decision tree over
each boostrap sample
Predicate the class labels for
test instances by the majority
vote of 50 decision trees
Bagging decision tree
performances better than a
single decision tree
Bias-Variance Tradeoff


Why Bagging works better than a single classifier?
Bias-variance tradeoff




Real value case
Output y for x follows y~f(x)+, ~N(0,)
(x|D) is a predicator learned from training data D
Bias-variance decomposition
2
2
2
ED, y  y   ( x | D)    E y  y  f ( x)    ED  f ( x)   ( x | D)  










2
2
  2  f ( x)  ED  ( x | D)  ED   ( x | D)  ED  ( x | D) 


Irreducible
variance
Model bias:
Model variance:
The simpler the (x|D),
the
larger
the
The simpler the bias
(x|D), the smaller the variance
Bias-Variance Tradeoff
Small model
bias
True Model
Fit with Complicated
Models
Large model
variance
Bias-Variance Tradeoff
Large model
bias
True Model
Fit with Simple
Models
Small model
variance
Bagging

Bagging performs better than a single classifier because it
effectively reduces the model variance
variance
bias
single decision tree
Bagging decision tree