Test title - Applied Physics Laboratory

Download Report

Transcript Test title - Applied Physics Laboratory

Slides for

Introduction to Stochastic Search and Optimization

(

ISSO

) by J. C. Spall

CHAPTER 13 M

ODELING

C

ONSIDERATIONS AND

S

TATISTICAL

I

NFORMATION

“All models are wrong; some are useful.”

 George E. P. Box •Organization of chapter in

ISSO

–Bias-variance tradeoff –Model selection: Cross-validation –Fisher information matrix: Definition, examples, and efficient computation

Model Definition and MSE

• Assume model

z

is some function, =

x

h

(  ,

x

) + is input,

v v

, where

z

is output,

h

(

·

) is noise, and  is vector of model parameters –

h

(

·

) may represent simulation model –

h

(

·

) may represent “metamodel” (response surface) of existing simulation • A fundamental goal is to take

n

estimate  , forming ˆ

n

data points and • A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed

x

:

E

h

( ˆ

n

,

x

) 

E

(

z

x

) 2

x

 13-2

Bias-Variance Decomposition

• The MSE of the model at a fixed

x

as:

E

{ [

h

( ,

x

) 

= E

{ [

h

( ,

E

(

z

|

x

)] 2

x

) 

E

(

h

n

|

x

}

x

))] 2 |

x

} can be decomposed + [

E

(

h

n

x

)) 

E

(

z

|

x

)] 2 = variance at

x

+ (bias at

x

) 2 where expectations are computed w.r.t.

ˆ

n

• Above implies:

Model too simple

High bias Model too complex

/ low variance Low bias / high variance

13-3

Unbiased Estimator May Not be Best

(Example 13.1 from ISSO)

•  ( 

n

, 

E

(

z

x

) (i.e., mean of prediction is same as mean of data

z

)

Example:

ˆ

n

as estimator of true mean  (

h

(  ,

x

) =  in notation above) • Alternative

biased

estimator of  is

r

where 0 <

r

< 1 • MSE of biased and unbiased estimators generally satisfy

E

(

r

) 2 <

E

( ) 2 • Biased estimate

better

in MSE sense – However, optimal value of

r

(true)  requires knowledge of unknown 13-4

Bias-Variance Tradeoff in Model Selection in Simple Problem

13-5

Example 13.2 in ISSO: Bias-Variance Tradeoff

• Suppose

true process

=

f

(

x

) + noise, where

f

(

x

produces output according to

z

) = (

x

+

x

2 ) 1.1

• Compare linear, quadratic, and cubic approximations • Table below gives average bias, variance, and MSE bias 2 variance Overall MSE Linear Model Quadratic Model Cubic Model 510.6 0.53 0.005 10.0 20.0 30.0 520.6 20.53 30.005 variance; optimal tradeoff is quadratic model 13-6

Model Selection

• The bias-variance tradeoff provides

conceptual framework

for determining a good model – Bias-variance tradeoff not directly useful • Need a

practical

method for optimizing bias-variance tradeoff • Practical aim is to pick a model that minimizes a criterion:

f

1 (fitting error from given data) + f 2 (model complexity)

where

f

1 and

f

2 are increasing functions • All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) • Criterion above may/may not be explicitly used in given method 13-7

Methods for Model Selection

• • Among many popular methods are: – Akaike Information Criterion (AIC) (Akaike, 1974) • Popular in time series analysis – Bayesian selection (Akaike, 1977) – Bootstrap-based selection (Efron and Tibshirini, 1997) –

Cross-validation

(Stone, 1974) – Minimum description length (Risannen, 1978) – V-C dimension (Vapnik and Chervonenkis, 1971) • Popular in computer science

Cross-validation

fitting method appears to be most popular model 13-8

Cross-Validation

• Cross-validation is simple, general method for comparing candidate models – Other specialized methods may work better in specific problems • Cross-validation uses the training set of data • Method is based on iteratively partitioning the

full set

training data into

training

and

test subsets

of • For each partition,

estimate

and

evaluate

model from training subset model on test subset – Number of training (or test) subsets = number of model fits required • Select model that performs best over all test subsets 13-9

Choice of Training and Test Subsets

• Let

n

denote total size of data set,

n T

subset,

n T

<

n

• Common strategy is leave-one-out:

n T

– Implies

n

denote size of test = 1 test subsets during cross-validation process • Often better to choose

n T

> 1 – Sometimes more efficient (sampling w/o replacement) – Sometimes more accurate model selection • If

n T

– > 1, sampling may be with or without replacement

“With replacement”

indicates that there are “

n n T

” test subsets, written  

T

choose – With replacement may be prohibitive in practice: e.g.,

n

= 30,

n T

= 6 implies nearly 600K model fits!

• Sampling

without replacement

subsets to

n

reduces number of test /

n T

(disjoint test subsets) 13-10

Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets

13-11

Typical Steps for Cross-Validation

Step 0 (initialization)

Determine size of test subsets and candidate model. Let

i

be counter for test subset being used.

Step 1 (estimation)

For the

i

data be the

i

th th test subset, let the remaining training subset. Estimate  from this training subset.

Step 2 (error calculation)

Based on estimate for  from Step 1 (

i

th training subset), calculate MSE (or other measure) with data in

i

th test subset.

Step 3 (new training and test subsets)

Update

i

to

i

+ 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated.

Step 4 (new model)

Repeat steps 1 to 3 for next model.

Choose model with lowest mean MSE as best.

13-12

Numerical Illustration of Cross-Validation (Example 13.4 in ISSO)

• Consider true system corresponding to a sine function of the input with additive normally distributed noise • Consider three candidate models – Linear (affine) model – 3rd-order polynomial – 10th-order polynomial • Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement) • Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred • See following plot 13-13

Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations

Sine wave (process mean) 3 rd -order 10 th -order Linear

13-14

Fisher Information Matrix

• Fundamental role of data analysis is to

extract information from data

• Parameter estimation for models is central to process of extracting information • The Fisher

information matrix

plays a central role in parameter estimation for measuring information

Information matrix summarizes the amount of information in the data relative to the parameters being estimated

13-15

Problem Setting

• Consider the classical statistical problem of estimating parameter vector  from

n

data vectors

z

1 ,

z

2 ,…,

z

n

• Suppose have a probability density and/or mass function associated with the data • The parameters  appear in the probability function and affect the nature of the distribution – Example:

z

i

N

(mean(  ), covariance(  )) for all

i

• Let l (  |

z

1 ,

z

2 ,…,

z

n

) represent the likelihood function, i.e., the p.d.f./p.m.f. viewed as a function of  conditioned on the data 13-16

Information Matrix —Definition

• Recall likelihood function l (  |

z

1 ,

z

2 • Information matrix defined as

F

n

E

    log   l ,…,

z

n

)  log l  

T

   where expectation is w.r.t.

z

1 ,

z

2 ,…,

z

n

• Equivalent form based on Hessian matrix:

F

n

( ) 

E

    2 log

T

l     •

F

n

(  ) is positive semidefinite of dimension

p

p

(

p

=dim(  )) 13-17

Information Matrix —Two Key Properties

• Connection of value of  ):

F

n

(  rigorously specified via two famous results (   ˆ

n

= true

1. Asymptotic normality:

where

F

n

(

n

   ) 

n

lim 

F

n

(   )

n

( ,

F

 1 )

2. Cramér-Rao inequality:

cov (

n

) 

F

n

(  ) 1 for all

n

 “smaller”

F

n

(

)

(and vice versa) 

n

13-18

Selected Applications

• Information matrix is measure of performance for several applications. Four uses are:

1. Confidence regions for parameter estimation

– Uses asymptotic normality and/or Cramér-Rao inequality

2.

Prediction bounds for mathematical models 3.

Basis for “D-optimal” criterion for experimental design

– Information matrix serves as measure of how well  be estimated for a given set of inputs can

4.

Basis for “noninformative prior” in Bayesian analysis

– Sometimes used for “objective” Bayesian inference 13-19