Model Assessment and Selection
Download
Report
Transcript Model Assessment and Selection
Model Assessment and
Selection
Chapter 7
Neil Weisenfeld, et al.
Introduction
Generalization performance of a learning
method:
– Measure of prediction capability on
independent test data
– Guide model selection
– Give quantitative assessment of chosen
model
Chapter shows key methods and how to
apply them to model selection
Bias, Variance, and Model
Complexity
•Bias-Variance trade-off
again
•Generalization: test
sample vs. training
sample performance
– Training data usually
monotonically
increasing
performance with
model complexity
Measuring Performance
target variable Y
Vector of inputs X
Prediction model fˆ X
Typical Choices of Loss function
L Y , fˆ X
Y fˆ X
Y fˆ X
2
squared error
absolute error
Generalization Error
Test error aka. Generalization error
Err E L Y , fˆ X
Note: This expectation averages anything that is random, including the
randomness in the training sample that it produced
Training error
1 n
err L yi , fˆ xi
N i 1
– average loss over training sample
– not a good estimate of test error (next slide)
Training Error
Training error - Overfitting
– not a good estimate of
test error
– consistently decreases
with model complexity
– drops to zero with high
enough complexity
Categorical Data
same for categorical
responses
Test Error again:
Err E[ L(G, pˆ ( x))]
pk X pr G k | X
Gˆ X arg max k pˆ k X
2 N
err
log pˆ gi ( xi )
N i 1
Typical Choices of Loss
functions:
L G, Gˆ X I G Gˆ X
Training Error again:
0 1 loss
K
L G, pˆ X 2 I G k log pˆ k X 2log pˆ G X log likelihood
k 1
Log-likelihood = cross-entropy loss = deviance
Loss Function for General
Densities
For densities parameterized by theta:
Log-likelihood function can be used as a
loss-function
Pr X Y
density of
Y
with predictor
X
L Y , X 2log Pr X Y
Two separate goals
Model selection:
– Estimating the performance of different models in order to choose the
(approximate) best one
Model assessment:
– Having chosen a final model, estimating its prediction error
(generalization error) on new data
Ideal situation: split data into the 3 parts for training, validation (est.
prediction error+select model), and testing (assess model)
Typical split: 50% / 25% / 25%
Remainder of the chapter: Data-poor situation
=> Approximation of validation step either analytically (AIC, BIC, MDL,
SRM) or by efficient sample reuse (cross-validation, bootstrap)
Bias-Variance Decomposition
2
Y f X , E 0, Var
Then for an input point X x0, using unit-square
loss and regression fit:
2
Err x0 E Y fˆ x0 | X x0
2
2
2
Efˆ x0 f x0 E fˆ x0 Efˆ x0
Bias fˆ x0 Var fˆ x0
2
2
Irreducible
Error
variance of the
target around
the true mean
Bias^2
Amount by which average
estimate differs from the
true mean
Variance
Expected deviation of
f^ around its mean
Bias-Variance Decomposition
Err x0 Bias fˆ x0 Var fˆ x0
2
2
kNN:
2
1
Err x0 f xo f xl 2 / k
k l 1
k
2
ˆ
Linear Model Fit: f p x ˆ x
T
2
ˆ
Err x0 f xo Ef p x0 h x0 2
2
2
where h x 0 X X X T y
T
1
Bias-Variance Decomposition
ˆ
Linear Model Fit: f p x ˆ x
T
2
ˆ
Err x0 f xo Ef p x0 h x0 2
2
2
where h x 0 X X X T y ... N-dim weight vector
T
1
average over sample values xi :
2
1 N
1 N
p 2
2
ˆ
Err xi f xi Ef xi ... in-sample error
N i 1
N i 1
N
Model complexity is directly related to the number of
parameters p
Bias-Variance Decomposition
Err x0 Bias fˆ x0 Var fˆ x0
2
2
For ridge regression and other linear models, variance same as before, but
with diff’t weights.
Parameters of the best fitting linear approximation
2
T
* arg min E f X X
Further decompose the bias:
Ex0 [ f ( x0 ) Efˆ ( x0 )]2 Ex0 [ f ( x0 ) T* x0 ]2 Ex0 [T* x0 ET x0 ]2
Ave[Model Bias]2 Ave[EstimationBias]2
Least squares fits best linear model -> no estimation bias
Restricted fits -> positive estimation bias in favor of reduced variance
Bias-Variance Decomposition
Bias-Variance Decomposition Example
50 observations. 20 predictors. Uniform in 0,1
20
Left panels:
Y is 0 if X1
1
1
and 1 if X1 , and we apply kNN
2
2
Right panels
10
Y is 1 if
X
j
5 and 0 otherwise, and we use the
j=1
best subset linear regression of size p
Closeup next slide
Example, continued
Prediction error
Regression
with squared
error loss
-- + -- = --
Squared bias
Variance
-- + -- <> -Classification
with 0-1 loss
Estimation errors
on the right side
of the boundary
don’t hurt!
-- + -- = --
-- + -- <> --
Bias-Variance
different for
0-1 loss
than for
squared error
loss
Optimism of the Training Error Rate
Typically: training error rate < true error
(same data is being used to fit the method
and assess its error)
1 n
err L yi , fˆ xi
N i 1
overly optimistic
<
Err E L Y , fˆ X
Optimism of the Training Error Rate
Err … kind of extra-sample error: test features don’t need to coincide
with training feature vectors
Focus on in-sample error:
1 N
Errin EY EY new L Yi new , fˆ xi
N i 1
Y new
… observe N new response values at each of training points xi , i=1, 2, ...,N
optimism: op Errin Ey err
for squared error 0-1 and other loss functions:
2 N
op Cov yˆi , yi
N i 1
The amount by which err underestimates the true error depends on how strongly yi affects
its own prediction.
Optimism of the Training Error Rate
Summary:
2 N
Errin E y err Cov yˆi , yi
N i 1
The harder we fit the data, the greater
Cov yˆi , yi will be, thereby increasing the
optimism.
For linear fit with d indep inputs/basis funcs:
2
Errin E y err d 2
N
– optimism linearly with # d of basis functions
– Optimism as training sample size
Optimism of the Training Error Rate
Ways to estimate prediction error:
– Estimate optimism and then add it to training error
rate err
• AIC, BIC, and others work this way, for a special class of
estimates that are linear in their parameters
– Direct estimates of the sample error Err
• Cross-validation, bootstrap
• Can be used with any loss function, and with nonlinear,
adaptive fitting techniques
Estimates of In-Sample Prediction
Error
General form of the in-sample estimate:
ˆ rr err oˆp
E
in
with estimate of optimism
For linear fit and with
2
Errin E y err d 2 :
N
2d 2
C p err
ˆ , so called C p statistic
N
ˆ2 ... estimate of noise variance, from mean-squared error of low-bias model
d ... # of basis functions
N ... training sample size
Estimates of In-Sample Prediction
Error
Similarly: Akaike Information Criterion (AIC)
– More applicable estimate of Errin , when loglikelihood function is used
2
d
For N : 2 E log Prˆ Y E log lik 2
N
N
Pr Y ... family density for Y (containing the true density)
ˆ... ML estimate of
N
loglik= log Prˆ yi
i 1
Maximized log-likelihood due to ML
estimate of theta
AIC
2
d
For N : 2 E log Prˆ Y E log lik 2
N
N
For example, for logistic regression model, using binomial log-likelihood:
AIC
2
d
loglik 2
N
N
To use AIC for model selection: choose the model giving smallest AIC over
the set of models considered.
d 2
AIC err 2
ˆ
N
fˆ x ... set of models, ... tuning parameter
err ... training error, d ... # parameters
AIC
• Function AIC() estimates test error curve
• If basis functions are chosen adaptively
with d<p inputs:
N
2
ˆ
Cov
(
y
,
y
)
d
i
i
i 1
no longer holds => optimism exceeds
(2d / N )
2
effective number of parameters fit > d
Using AIC to select the # of basis
functions
Input vector: log-periodogram of vowel; Quantized to 256 uniformly
spaced f
Linear logistic regression Mmodel
Coefficient function: f hm f m
m 1
– Expansion of M spline basis functions
– For any M, a basis of natural cubic splines is used for
the hm knots chosen uniformly over the range of
frequencies, i.e. d d M M
AIC approximately minimizes Err(M) for both
entropy and 0-1 loss
2 N
2d 2
Cov yˆi , yi
... simple formula for linear case
N i 1
N
Using AIC to select the # of basis
functions
2 N
2d 2
ˆ
Cov
y
,
y
i i
N i 1
N
Approximation does
not hold, in general,
for 0-1 case, but it
does o.k. (Exact only
for linear models w/
additive errors and sq
err loss)
Effective Number of Parameters
y1
y
y= 2
yN
Vector of Outcomes, similarly for predicitons
yˆ Sy
Linear fit (e.g. linear regression, quadratic shrinkage – ridge, splines)
S... N N matrix, depends on input vector xi but not on yi
effective number of parameters: d S trace S
d(s) is the correct d for Cp
2d 2
C p err
ˆ
N
c.f. Cov( yˆ , y)
Bayesian Approach and BIC
Like AIC used in when fitting by max log-likelihood
Bayesian Information Criterion (BIC):
BIC 2 loglik (log N )d
Assuming Gaussian model: 2 known,
2 loglik i ( yi fˆ ( xi )) 2 / 2 N err / 2
N
d 2
t hen BIC 2 [err (logN ) ]
N
BIC proportional to AIC except for log(N) rather than
factor of 2. For N>e2 (approx 7.4), BIC penalizes
complex models more heavily.
BIC Motivation
Given a set of candidate models Μm , m 1K M and modelparametersm
Posterior probability of a given model: Pr(Mm | Z) Pr(Mm ) Pr(Z | Mm )
Where Z representsthetrainingdata{xi , yi }1N
To compare two models, form the posterior odds:
Pr(M m | Z) Pr(M m ) Pr(Z | M m )
Pr(M l | Z) Pr(M l ) Pr(Z | M l )
If odds > 1, then choose model m. Prior over models (left half)
considered constant. Right half, contribution of data (Z) to posterior
odds, is called the Bayes factor BF(Z).
Need to approximate Pr(Z|Mm). Various chicanery and approximations
(pp. 207) gets us BIC.
Can est. posterior from BIC and compare relative merits of models.
BIC: How much better is a
model?
But we may want to know how various models
stack up (not just ranking) relative to one
another:
Once we have the BIC:
estimateof Pr(M m | Z)
e
1
BICm
2
M
l 1
e
1
BICl
2
Denominator normalizes the result and now we
can assess the relative merits of each model
Minimum Description Length
(MDL)
• Huffman Coding (e.g. for compression)
– Variable length code: use shortest symbol length (in bits) for
most frequently occurring targets. So, maybe a code for English
words would look like (instantaneous prefix code):
• E=1
• N = 01
• R = 001, etc.
– First bit codes “is this an E or not,” second bit codes “is this an N
or not,” etc.
Entropy lower bound on message length (information content –
no of bits required to transmit):
E(length) Pr(zi ) log2 (Pr(zi ))
So bit length for li should be log2 Pr(zi )
MDL
Now, for our data:
model M , parametersθ , data Z (X, y)
length logPr(y | , M , X) log Pr( | M )
2nd term: avg code len for transmitting model
parameters, 1st term: avg code len for transmitting
difference between model and actual target values
(so choose a model that minimizes this)
Minimizing descriptive length = maximizing posterior
probability
The book adds, almost as a non-sequitur:
suppose y ~ N (, 2 ), and ~ N (0,1)
length constant log
( y ) 2
2
2
2