Document 7885807
Download
Report
Transcript Document 7885807
Announcement
Talk by Prof. Buhmann on “Statistical Models
on Image Segmentation and Clustering in
Computer Vision” in EB3105 from 4:00pm to
5:00pm
Homework 5 is out
http://www.cse.msu.edu/~cse847/assignments
.html
Bayesian Learning
Rong Jin
Outline
MAP learning vs. ML learning
Minimum description length principle
Bayes optimal classifier
Bagging
Maximum Likelihood Learning (ML)
Find the model that best explains the observations by
maximizing the log-likelihood of the training data
Logistic regression
1
p( y | x; )
1 exp y ( x w c)
{w1 , w2 ,..., wm , c}
Parameters are found by maximizing the likelihood of training data
w* , c* max l ( Dtrain ) max i 1 log
n
w,c
w,c
1
1 exp y x w c
Maximum A Posterior Learning (MAP)
In ML learning, models are determined solely by the training examples
Very often, we have prior knowledge/preference about parameters/models
ML learning doesn’t incorporate the prior knowledge/preference on
parameters/models
Maximum a posterior learning (MAP)
Knowledge/preference about parameters/models are incorporated through a
prior
* arg max Pr( D | ) Pr( )
Prior for
parameters
Example: Logistic Regression
ML learning
w* , c* max l ( Dtrain ) max i 1 log
n
w,c
w,c
1
Prior knowledge/Preference
No feature should dominate over all other features
Prefer small weights
Gaussian prior for parameters/models:
1
Pr(w) exp 2
1 exp y x w c
m
2
w
i 1 i
Example (cont’d)
MAP learning for logistic regression
w , c arg max Pr(D | w, c) Pr(w, c)
*
*
w,c
arg max log Pr( D | w, c) log Pr( w, c)
n
1
1
arg max i 1 log
2
1 exp y x w c
w,c
Compared to regularized logistic regression
lreg ( Dtrain ) i 1 log
N
m
2
w
i 1 i
1
m
s i 1 wi2
1 exp y(c x w)
Minimum Description Length Principle
Occam’s razor: prefer the simplest hypothesis
Simplest hypothesis hypothesis with shortest
# of bits to encode
# of bits to
encode
description
length
data D given h
hypothesis h
Minimum
description
length
Complexity
of
# of Mistakes
Model hypothesis
Prefer shortest
hMDL arg min LC1 (h) LC2 ( D | h)
hH
LC (x) is the description length for message x
under coding scheme c
Minimum Description Length Principle
hMDL arg min LC1 (h) LC2 ( D | h)
hH
Sender
Receiver
Send only D ?
Send only h ?
D
Send h + D/h ?
Example: Decision Tree
H = decision trees, D = training data labels
LC1(h) is # bits to describe tree h
LC2(D|h) is # bits to describe D given tree h
Note LC2(D|h)=0 if examples are classified
perfectly by h.
Only need to describe exceptions
hMDL trades off tree size for training errors
MAP vs. MDL
MAP learning:
hMAP arg max Pr( D | h) Pr(h) arg max log 2 Pr( D | h) log 2 Pr( h)
hH
hH
arg min log 2 Pr(h) log 2 Pr( D | h)
hH
Description length of
exceptions under
The optimal (shortest expected coding length)
code
for an
optimal
coding
Fact from information theory
event with probability p is –log2p
length
Interpret
Description
ofMAP
h using MDL principle
under optimal coding
hMDL arg min LC1 (h) LC2 ( D | h)
hH
Problems with Maximum Approaches
Consider
Three possible hypotheses:
Pr(h1 | D) 0.4, Pr(h2 | D) 0.3, Pr(h3 | D) 0.3
Maximum approaches will pick h1
Given new instance x
h1 ( x) , h2 ( x) , h3 ( x)
Maximum approaches will output +
However, is this most probably result?
Bayes Optimal Classifier (Bayesian Average)
Bayes optimal classification:
c* ( x) arg max
c
Pr(h | D) Pr(c | h, x)
hH
Example:
Pr(h1 | D) 0.4, Pr( | h1 , x) 1, Pr( | h1 , x) 0
Pr(h2 | D) 0.3, Pr( | h2 , x) 0, Pr( | h2 , x) 1
Pr(h3 | D) 0.3, Pr( | h3 , x) 0, Pr( | h3 , x) 1
Pr(h | D) Pr( | h, x) 0.4, Pr(h | D) Pr( | h, x) 0.6
h
h
The most probably class is -
When do We Need Bayesian Average?
Bayes optimal classification
c* ( x) arg max
c
Pr(h | D) Pr(c | h, x)
hH
When do we need Bayesian average?
Multiple mode case
Optimal mode is flat
When NOT Bayesian Average?
Can’t estimate Pr(h|D) accurately
Computational Issues with Bayes
Optimal Classifier
Bayes optimal classification
c* ( x) arg max
c
Pr(h | D) Pr(c | h, x)
hH
Computational issues:
Need to sum over all possible models/hypotheses h
It is expensive or impossible when the model/hypothesis
space is large
Example: decision tree
Solution: sampling !
Gibbs Classifier
Gibbs algorithm
1.
2.
Choose one hypothesis at random, according to P(h|D)
Use this to classify new instance
Surprising fact:
E errGibbs 2 E errBayesOptimal
Improve by sampling multiple hypotheses from
P(h|D) and average their classification results
Markov chain Monte Carlo (MCMC) sampling
Importance sampling
Bagging Classifiers
In general, sampling from P(h|D) is difficult
because
1.
P(h|D) is rather difficult to compute
2.
3.
Example: how to compute P(h|D) for decision tree?
P(h|D) is impossible to compute for non-probabilistic
classifier such as SVM
P(h|D) is extremely small when hypothesis space is large
Bagging Classifiers:
Realize sampling P(h|D) through a sampling of training
examples
Boostrap Sampling
Bagging = Boostrap aggregating
Boostrap sampling: given set D containing m
training examples
Create Di by drawing m examples at random with
replacement from D
Di expects to leave out about 0.37 of examples
from D
Bagging Algorithm
Create k boostrap samples D1, D1,…, Dk
Train distinct classifier hi on each Di
Classify new instance by classifier vote with
equal weights
c ( x) arg max
*
c
k
Pr(c | hi , x)
i 1
Bagging Bayesian Average
Bayesian Average
Bagging
D
P(h|D)
Boostrap Sampling
Sampling
D1
h1
h2
…
hk
D2
Dk
…
h1
h2
Boostrap sampling is almost equivalent
i Pr(c | hi , xto) sampling from posterior P(h|D)i Pr(c | hi , x)
hk
Empirical Study of Bagging
Bagging decision trees
Boostrap 50 different samples
from the original training data
Learn a decision tree over
each boostrap sample
Predicate the class labels for
test instances by the majority
vote of 50 decision trees
Bagging decision tree
performances better than a
single decision tree
Bias-Variance Tradeoff
Why Bagging works better than a single classifier?
Bias-variance tradeoff
Real value case
Output y for x follows y~f(x)+, ~N(0,)
(x|D) is a predicator learned from training data D
Bias-variance decomposition
2
2
2
ED, y y ( x | D) E y y f ( x) ED f ( x) ( x | D)
2
2
2 f ( x) ED ( x | D) ED ( x | D) ED ( x | D)
Irreducible
variance
Model bias:
Model variance:
The simpler the (x|D),
the
larger
the
The simpler the bias
(x|D), the smaller the variance
Bias-Variance Tradeoff
Small model
bias
True Model
Fit with Complicated
Models
Large model
variance
Bias-Variance Tradeoff
Large model
bias
True Model
Fit with Simple
Models
Small model
variance
Bagging
Bagging performs better than a single classifier because it
effectively reduces the model variance
variance
bias
single decision tree
Bagging decision tree