Model Assessment and Selection

Download Report

Transcript Model Assessment and Selection

Model Assessment and
Selection
Chapter 7
Neil Weisenfeld, et al.
Introduction
Generalization performance of a learning
method:
– Measure of prediction capability on
independent test data
– Guide model selection
– Give quantitative assessment of chosen
model
Chapter shows key methods and how to
apply them to model selection
Bias, Variance, and Model
Complexity
•Bias-Variance trade-off
again
•Generalization: test
sample vs. training
sample performance
– Training data usually
monotonically
increasing
performance with
model complexity
Measuring Performance
target variable Y
Vector of inputs X
Prediction model fˆ  X 
Typical Choices of Loss function

L Y , fˆ  X 


 Y  fˆ X
 


 Y  fˆ  X 


2
squared error
absolute error
Generalization Error
Test error aka. Generalization error
Err  E  L Y , fˆ  X  




Note: This expectation averages anything that is random, including the
randomness in the training sample that it produced
Training error

1 n
err   L yi , fˆ  xi 
N i 1

– average loss over training sample
– not a good estimate of test error (next slide)
Training Error
Training error - Overfitting
– not a good estimate of
test error
– consistently decreases
with model complexity
– drops to zero with high
enough complexity
Categorical Data
same for categorical
responses
Test Error again:
Err  E[ L(G, pˆ ( x))]
pk  X   pr G  k | X 
Gˆ  X   arg max k pˆ k  X 
2 N
err 
log pˆ gi ( xi )

N i 1
Typical Choices of Loss
functions:

 
L G, Gˆ  X   I G  Gˆ  X 

Training Error again:
0  1 loss
K
L  G, pˆ  X    2 I  G  k  log pˆ k  X   2log pˆ G  X  log  likelihood
k 1
Log-likelihood = cross-entropy loss = deviance
Loss Function for General
Densities
For densities parameterized by theta:
Log-likelihood function can be used as a
loss-function
Pr  X  Y 
density of
Y
with predictor
X
L Y ,  X    2log Pr  X  Y 
Two separate goals
Model selection:
– Estimating the performance of different models in order to choose the
(approximate) best one
Model assessment:
– Having chosen a final model, estimating its prediction error
(generalization error) on new data
Ideal situation: split data into the 3 parts for training, validation (est.
prediction error+select model), and testing (assess model)
Typical split: 50% / 25% / 25%
Remainder of the chapter: Data-poor situation
=> Approximation of validation step either analytically (AIC, BIC, MDL,
SRM) or by efficient sample reuse (cross-validation, bootstrap)
Bias-Variance Decomposition
2
Y  f  X    , E    0, Var     
Then for an input point X  x0, using unit-square
loss and regression fit:


2

Err  x0   E Y  fˆ  x0  | X  x0 


2
2
2
     Efˆ  x0   f  x0   E  fˆ  x0   Efˆ  x0 




    Bias  fˆ  x0   Var  fˆ  x0 




2
2
Irreducible
Error
variance of the
target around
the true mean
Bias^2
Amount by which average
estimate differs from the
true mean
Variance
Expected deviation of
f^ around its mean
Bias-Variance Decomposition
Err  x0      Bias  fˆ  x0   Var  fˆ  x0 




2
2
kNN:
 
2
1


Err  x0       f  xo    f xl      2 / k
k l 1


k
2
ˆ
Linear Model Fit: f p  x   ˆ x
T
2
ˆ


Err  x0      f  xo   Ef p  x0   h  x0   2


2
2
where h  x 0    X X  X T y
T
1
Bias-Variance Decomposition
ˆ
Linear Model Fit: f p  x   ˆ x
T
2
ˆ


Err  x0      f  xo   Ef p  x0   h  x0   2


2
2
where h  x 0    X X  X T y ... N-dim weight vector
T
1
average over sample values xi :
2
1 N
1 N 
p 2
2
ˆ

Err  xi        f  xi   Ef  xi     ... in-sample error

N i 1
N i 1
N
Model complexity is directly related to the number of
parameters p
Bias-Variance Decomposition
Err  x0      Bias  fˆ  x0   Var  fˆ  x0 




2
2
For ridge regression and other linear models, variance same as before, but
with diff’t weights.
Parameters of the best fitting linear approximation
2
T
*  arg min E  f  X    X 

Further decompose the bias:
Ex0 [ f ( x0 )  Efˆ ( x0 )]2  Ex0 [ f ( x0 )  T* x0 ]2  Ex0 [T* x0  ET x0 ]2
 Ave[Model Bias]2  Ave[EstimationBias]2
Least squares fits best linear model -> no estimation bias
Restricted fits -> positive estimation bias in favor of reduced variance
Bias-Variance Decomposition
Bias-Variance Decomposition Example
50 observations. 20 predictors. Uniform in  0,1
20
Left panels:
Y is 0 if X1 
1
1
and 1 if X1  , and we apply kNN
2
2
Right panels
10
Y is 1 if
X
j
 5 and 0 otherwise, and we use the
j=1
best subset linear regression of size p
Closeup next slide
Example, continued
Prediction error
Regression
with squared
error loss
-- + -- = --
Squared bias
Variance
-- + -- <> -Classification
with 0-1 loss
Estimation errors
on the right side
of the boundary
don’t hurt!
-- + -- = --
-- + -- <> --
Bias-Variance
different for
0-1 loss
than for
squared error
loss
Optimism of the Training Error Rate
Typically: training error rate < true error
(same data is being used to fit the method
and assess its error)

1 n
err   L yi , fˆ  xi 
N i 1
overly optimistic

<


Err  E  L Y , fˆ  X  


Optimism of the Training Error Rate
Err … kind of extra-sample error: test features don’t need to coincide
with training feature vectors
Focus on in-sample error:

1 N
Errin   EY EY new L Yi new , fˆ  xi 
N i 1

Y new
… observe N new response values at each of training points xi , i=1, 2, ...,N
optimism: op  Errin  Ey  err 
for squared error 0-1 and other loss functions:
2 N
op   Cov  yˆi , yi 
N i 1
The amount by which err underestimates the true error depends on how strongly yi affects
its own prediction.
Optimism of the Training Error Rate
Summary:
2 N
Errin  E y  err    Cov  yˆi , yi 
N i 1
The harder we fit the data, the greater
Cov  yˆi , yi  will be, thereby increasing the
optimism.
For linear fit with d indep inputs/basis funcs:
2
Errin  E y  err   d 2
N
– optimism linearly with # d of basis functions
– Optimism as training sample size
Optimism of the Training Error Rate
Ways to estimate prediction error:
– Estimate optimism and then add it to training error
rate err
• AIC, BIC, and others work this way, for a special class of
estimates that are linear in their parameters
– Direct estimates of the sample error Err
• Cross-validation, bootstrap
• Can be used with any loss function, and with nonlinear,
adaptive fitting techniques
Estimates of In-Sample Prediction
Error
General form of the in-sample estimate:
ˆ rr  err  oˆp
E
in
with estimate of optimism
For linear fit and with
2
Errin  E y  err   d 2 :
N
2d 2
C p  err 
ˆ  , so called C p statistic
N
ˆ2 ... estimate of noise variance, from mean-squared error of low-bias model
d ... # of basis functions
N ... training sample size
Estimates of In-Sample Prediction
Error
Similarly: Akaike Information Criterion (AIC)
– More applicable estimate of Errin , when loglikelihood function is used
2
d
For N   :  2 E log Prˆ Y     E  log lik   2
N
N
Pr Y  ... family density for Y (containing the true density)
ˆ... ML estimate of 
N
loglik=  log Prˆ  yi 
i 1
Maximized log-likelihood due to ML
estimate of theta
AIC
2
d
For N   :  2 E log Prˆ Y     E  log lik   2
N
N
For example, for logistic regression model, using binomial log-likelihood:
AIC  
2
d
 loglik  2 
N
N
To use AIC for model selection: choose the model giving smallest AIC over
the set of models considered.
d   2
AIC    err    2
ˆ
N
fˆ  x  ... set of models,  ... tuning parameter
err   ... training error, d   ... # parameters
AIC
• Function AIC() estimates test error curve
• If basis functions are chosen adaptively
with d<p inputs:
N
2
ˆ
Cov
(
y
,
y
)

d


i
i

i 1
no longer holds => optimism exceeds
(2d / N )
2
effective number of parameters fit > d
Using AIC to select the # of basis
functions
Input vector: log-periodogram of vowel; Quantized to 256 uniformly
spaced f
Linear logistic regression Mmodel
Coefficient function:   f    hm  f m
m 1
– Expansion of M spline basis functions
– For any M, a basis of natural cubic splines is used for
the hm knots chosen uniformly over the range of
frequencies, i.e. d    d  M   M
AIC approximately minimizes Err(M) for both
entropy and 0-1 loss
2 N
2d 2
Cov  yˆi , yi  
  ... simple formula for linear case

N i 1
N
Using AIC to select the # of basis
functions
2 N
2d 2
ˆ
Cov
y
,
y


 i i

N i 1
N
Approximation does
not hold, in general,
for 0-1 case, but it
does o.k. (Exact only
for linear models w/
additive errors and sq
err loss)
Effective Number of Parameters
 y1 
 
y
y=  2 
 
 
 yN 
Vector of Outcomes, similarly for predicitons
yˆ  Sy
Linear fit (e.g. linear regression, quadratic shrinkage – ridge, splines)
S... N  N matrix, depends on input vector xi but not on yi
effective number of parameters: d S  trace  S 
d(s) is the correct d for Cp
2d 2
C p  err 
ˆ 
N
c.f. Cov( yˆ , y)
Bayesian Approach and BIC
Like AIC used in when fitting by max log-likelihood
Bayesian Information Criterion (BIC):
BIC  2 loglik (log N )d
Assuming Gaussian model:  2 known,
 2  loglik  i ( yi  fˆ ( xi )) 2 /  2  N  err /  2
N
d 2
t hen BIC  2 [err  (logN )    ]

N
BIC proportional to AIC except for log(N) rather than
factor of 2. For N>e2 (approx 7.4), BIC penalizes
complex models more heavily.
BIC Motivation
Given a set of candidate models Μm , m  1K M and modelparametersm
Posterior probability of a given model: Pr(Mm | Z)  Pr(Mm )  Pr(Z | Mm )
Where Z representsthetrainingdata{xi , yi }1N
To compare two models, form the posterior odds:
Pr(M m | Z) Pr(M m ) Pr(Z | M m )


Pr(M l | Z) Pr(M l ) Pr(Z | M l )
If odds > 1, then choose model m. Prior over models (left half)
considered constant. Right half, contribution of data (Z) to posterior
odds, is called the Bayes factor BF(Z).
Need to approximate Pr(Z|Mm). Various chicanery and approximations
(pp. 207) gets us BIC.
Can est. posterior from BIC and compare relative merits of models.
BIC: How much better is a
model?
But we may want to know how various models
stack up (not just ranking) relative to one
another:
Once we have the BIC:
estimateof Pr(M m | Z) 
e

1
  BICm
2
M
l 1
e
1
  BICl
2
Denominator normalizes the result and now we
can assess the relative merits of each model
Minimum Description Length
(MDL)
• Huffman Coding (e.g. for compression)
– Variable length code: use shortest symbol length (in bits) for
most frequently occurring targets. So, maybe a code for English
words would look like (instantaneous prefix code):
• E=1
• N = 01
• R = 001, etc.
– First bit codes “is this an E or not,” second bit codes “is this an N
or not,” etc.
Entropy lower bound on message length (information content –
no of bits required to transmit):
E(length)   Pr(zi ) log2 (Pr(zi ))
So bit length for li should be  log2 Pr(zi )
MDL
Now, for our data:
model M , parametersθ , data Z  (X, y)
length   logPr(y | , M , X)  log Pr( | M )
2nd term: avg code len for transmitting model
parameters, 1st term: avg code len for transmitting
difference between model and actual target values
(so choose a model that minimizes this)
Minimizing descriptive length = maximizing posterior
probability
The book adds, almost as a non-sequitur:
suppose y ~ N (,  2 ), and  ~ N (0,1)
length  constant log  
( y  ) 2
2
2
 2
