Transcript Test title - Applied Physics Laboratory
Slides for
Introduction to Stochastic Search and Optimization
(
ISSO
) by J. C. Spall
CHAPTER 13 M
ODELING
C
ONSIDERATIONS AND
S
TATISTICAL
I
NFORMATION
“All models are wrong; some are useful.”
George E. P. Box •Organization of chapter in
ISSO
–Bias-variance tradeoff –Model selection: Cross-validation –Fisher information matrix: Definition, examples, and efficient computation
Model Definition and MSE
• Assume model
z
is some function, =
x
h
( ,
x
) + is input,
v v
, where
z
is output,
h
(
·
) is noise, and is vector of model parameters –
h
(
·
) may represent simulation model –
h
(
·
) may represent “metamodel” (response surface) of existing simulation • A fundamental goal is to take
n
estimate , forming ˆ
n
data points and • A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed
x
:
E
h
( ˆ
n
,
x
)
E
(
z
x
) 2
x
13-2
Bias-Variance Decomposition
• The MSE of the model at a fixed
x
as:
E
{ [
h
( ,
x
)
= E
{ [
h
( ,
E
(
z
|
x
)] 2
x
)
E
(
h
n
|
x
}
x
))] 2 |
x
} can be decomposed + [
E
(
h
n
x
))
E
(
z
|
x
)] 2 = variance at
x
+ (bias at
x
) 2 where expectations are computed w.r.t.
ˆ
n
• Above implies:
Model too simple
High bias Model too complex
/ low variance Low bias / high variance
13-3
Unbiased Estimator May Not be Best
(Example 13.1 from ISSO)
• (
n
,
E
(
z
x
) (i.e., mean of prediction is same as mean of data
z
)
Example:
ˆ
n
as estimator of true mean (
h
( ,
x
) = in notation above) • Alternative
biased
estimator of is
r
where 0 <
r
< 1 • MSE of biased and unbiased estimators generally satisfy
E
(
r
) 2 <
E
( ) 2 • Biased estimate
better
in MSE sense – However, optimal value of
r
(true) requires knowledge of unknown 13-4
Bias-Variance Tradeoff in Model Selection in Simple Problem
13-5
Example 13.2 in ISSO: Bias-Variance Tradeoff
• Suppose
true process
=
f
(
x
) + noise, where
f
(
x
produces output according to
z
) = (
x
+
x
2 ) 1.1
• Compare linear, quadratic, and cubic approximations • Table below gives average bias, variance, and MSE bias 2 variance Overall MSE Linear Model Quadratic Model Cubic Model 510.6 0.53 0.005 10.0 20.0 30.0 520.6 20.53 30.005 variance; optimal tradeoff is quadratic model 13-6
Model Selection
• The bias-variance tradeoff provides
conceptual framework
for determining a good model – Bias-variance tradeoff not directly useful • Need a
practical
method for optimizing bias-variance tradeoff • Practical aim is to pick a model that minimizes a criterion:
f
1 (fitting error from given data) + f 2 (model complexity)
where
f
1 and
f
2 are increasing functions • All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) • Criterion above may/may not be explicitly used in given method 13-7
Methods for Model Selection
• • Among many popular methods are: – Akaike Information Criterion (AIC) (Akaike, 1974) • Popular in time series analysis – Bayesian selection (Akaike, 1977) – Bootstrap-based selection (Efron and Tibshirini, 1997) –
Cross-validation
(Stone, 1974) – Minimum description length (Risannen, 1978) – V-C dimension (Vapnik and Chervonenkis, 1971) • Popular in computer science
Cross-validation
fitting method appears to be most popular model 13-8
Cross-Validation
• Cross-validation is simple, general method for comparing candidate models – Other specialized methods may work better in specific problems • Cross-validation uses the training set of data • Method is based on iteratively partitioning the
full set
training data into
training
and
test subsets
of • For each partition,
estimate
and
evaluate
model from training subset model on test subset – Number of training (or test) subsets = number of model fits required • Select model that performs best over all test subsets 13-9
Choice of Training and Test Subsets
• Let
n
denote total size of data set,
n T
subset,
n T
<
n
• Common strategy is leave-one-out:
n T
– Implies
n
denote size of test = 1 test subsets during cross-validation process • Often better to choose
n T
> 1 – Sometimes more efficient (sampling w/o replacement) – Sometimes more accurate model selection • If
n T
– > 1, sampling may be with or without replacement
“With replacement”
indicates that there are “
n n T
” test subsets, written
T
choose – With replacement may be prohibitive in practice: e.g.,
n
= 30,
n T
= 6 implies nearly 600K model fits!
• Sampling
without replacement
subsets to
n
reduces number of test /
n T
(disjoint test subsets) 13-10
Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets
13-11
Typical Steps for Cross-Validation
Step 0 (initialization)
Determine size of test subsets and candidate model. Let
i
be counter for test subset being used.
Step 1 (estimation)
For the
i
data be the
i
th th test subset, let the remaining training subset. Estimate from this training subset.
Step 2 (error calculation)
Based on estimate for from Step 1 (
i
th training subset), calculate MSE (or other measure) with data in
i
th test subset.
Step 3 (new training and test subsets)
Update
i
to
i
+ 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated.
Step 4 (new model)
Repeat steps 1 to 3 for next model.
Choose model with lowest mean MSE as best.
13-12
Numerical Illustration of Cross-Validation (Example 13.4 in ISSO)
• Consider true system corresponding to a sine function of the input with additive normally distributed noise • Consider three candidate models – Linear (affine) model – 3rd-order polynomial – 10th-order polynomial • Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement) • Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred • See following plot 13-13
Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations
Sine wave (process mean) 3 rd -order 10 th -order Linear
13-14
Fisher Information Matrix
• Fundamental role of data analysis is to
extract information from data
• Parameter estimation for models is central to process of extracting information • The Fisher
information matrix
plays a central role in parameter estimation for measuring information
Information matrix summarizes the amount of information in the data relative to the parameters being estimated
13-15
Problem Setting
• Consider the classical statistical problem of estimating parameter vector from
n
data vectors
z
1 ,
z
2 ,…,
z
n
• Suppose have a probability density and/or mass function associated with the data • The parameters appear in the probability function and affect the nature of the distribution – Example:
z
i
N
(mean( ), covariance( )) for all
i
• Let l ( |
z
1 ,
z
2 ,…,
z
n
) represent the likelihood function, i.e., the p.d.f./p.m.f. viewed as a function of conditioned on the data 13-16
Information Matrix —Definition
• Recall likelihood function l ( |
z
1 ,
z
2 • Information matrix defined as
F
n
E
log l ,…,
z
n
) log l
T
where expectation is w.r.t.
z
1 ,
z
2 ,…,
z
n
• Equivalent form based on Hessian matrix:
F
n
( )
E
2 log
T
l •
F
n
( ) is positive semidefinite of dimension
p
p
(
p
=dim( )) 13-17
Information Matrix —Two Key Properties
• Connection of value of ):
F
n
( rigorously specified via two famous results ( ˆ
n
= true
1. Asymptotic normality:
where
F
n
(
n
)
n
lim
F
n
( )
n
( ,
F
1 )
2. Cramér-Rao inequality:
cov (
n
)
F
n
( ) 1 for all
n
“smaller”
F
n
(
)
(and vice versa)
n
13-18
Selected Applications
• Information matrix is measure of performance for several applications. Four uses are:
1. Confidence regions for parameter estimation
– Uses asymptotic normality and/or Cramér-Rao inequality
2.
Prediction bounds for mathematical models 3.
Basis for “D-optimal” criterion for experimental design
– Information matrix serves as measure of how well be estimated for a given set of inputs can
4.
Basis for “noninformative prior” in Bayesian analysis
– Sometimes used for “objective” Bayesian inference 13-19