Test title - Applied Physics Laboratory

Download Report

Transcript Test title - Applied Physics Laboratory

Slides for

Introduction to Stochastic Search and Optimization

(

ISSO

) by J. C. Spall

CHAPTER 13 M

ODELING

ONSIDERATIONS AND

TATISTICAL

NFORMATION

“All models are wrong; some are useful.”

 George E. P. Box •Organization of chapter in

ISSO

–Bias-variance tradeoff –Model selection: Cross-validation –Fisher information matrix: Definition, examples, and efficient computation

Model Definition and MSE

• Assume model

is some function, =

(  ,

) + is input,

v v

, where

is output,

(

) is noise, and  is vector of model parameters –

(

) may represent simulation model –

(

) may represent “metamodel” (response surface) of existing simulation • A fundamental goal is to take

estimate  , forming ˆ

data points and • A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed



( ˆ

) 

(

) 2

 13-2

Bias-Variance Decomposition

• The MSE of the model at a fixed

as:

{ [

( ,

) 

= E

{ [

( ,

(

)] 2

) 

(



}

))] 2 |

} can be decomposed + [

(



)) 

(

)] 2 = variance at

+ (bias at

) 2 where expectations are computed w.r.t.

• Above implies:

Model too simple



High bias Model too complex



/ low variance Low bias / high variance

13-3

Unbiased Estimator May Not be Best

(Example 13.1 from ISSO)

•  ( 

, 

(

) (i.e., mean of prediction is same as mean of data

)

Example:

as estimator of true mean  (

(  ,

) =  in notation above) • Alternative

biased

estimator of  is

where 0 <

< 1 • MSE of biased and unbiased estimators generally satisfy

(

) 2 <

( ) 2 • Biased estimate

better

in MSE sense – However, optimal value of

(true)  requires knowledge of unknown 13-4

Bias-Variance Tradeoff in Model Selection in Simple Problem

13-5

**Example 13.2 in ISSO: Bias-Variance Tradeoff**

• Suppose

true process

(

) + noise, where

(

produces output according to

) = (

2 ) 1.1

• Compare linear, quadratic, and cubic approximations • Table below gives average bias, variance, and MSE bias 2 variance Overall MSE Linear Model Quadratic Model Cubic Model 510.6 0.53 0.005 10.0 20.0 30.0 520.6 20.53 30.005 variance; optimal tradeoff is quadratic model 13-6

Model Selection

• The bias-variance tradeoff provides

conceptual framework

for determining a good model – Bias-variance tradeoff not directly useful • Need a

practical

method for optimizing bias-variance tradeoff • Practical aim is to pick a model that minimizes a criterion:

1 (fitting error from given data) + f 2 (model complexity)

where

1 and

2 are increasing functions • All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) • Criterion above may/may not be explicitly used in given method 13-7

Methods for Model Selection

• • Among many popular methods are: – Akaike Information Criterion (AIC) (Akaike, 1974) • Popular in time series analysis – Bayesian selection (Akaike, 1977) – Bootstrap-based selection (Efron and Tibshirini, 1997) –

Cross-validation

(Stone, 1974) – Minimum description length (Risannen, 1978) – V-C dimension (Vapnik and Chervonenkis, 1971) • Popular in computer science

Cross-validation

fitting method appears to be most popular model 13-8

Cross-Validation

• Cross-validation is simple, general method for comparing candidate models – Other specialized methods may work better in specific problems • Cross-validation uses the training set of data • Method is based on iteratively partitioning the

full set

training data into

training

and

test subsets

of • For each partition,

estimate

and

evaluate

model from training subset model on test subset – Number of training (or test) subsets = number of model fits required • Select model that performs best over all test subsets 13-9

Choice of Training and Test Subsets

• Let

denote total size of data set,

n T

subset,

n T

• Common strategy is leave-one-out:

n T

– Implies

denote size of test = 1 test subsets during cross-validation process • Often better to choose

n T

> 1 – Sometimes more efficient (sampling w/o replacement) – Sometimes more accurate model selection • If

n T

– > 1, sampling may be with or without replacement

“With replacement”

indicates that there are “

n n T

” test subsets, written  

choose – With replacement may be prohibitive in practice: e.g.,

= 30,

n T

= 6 implies nearly 600K model fits!

• Sampling

without replacement

subsets to

reduces number of test /

n T

(disjoint test subsets) 13-10

Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets

13-11

Typical Steps for Cross-Validation

Step 0 (initialization)

Determine size of test subsets and candidate model. Let

be counter for test subset being used.

Step 1 (estimation)

For the

data be the

th th test subset, let the remaining training subset. Estimate  from this training subset.

Step 2 (error calculation)

Based on estimate for  from Step 1 (

th training subset), calculate MSE (or other measure) with data in

th test subset.

Step 3 (new training and test subsets)

Update

+ 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated.

Step 4 (new model)

Repeat steps 1 to 3 for next model.

Choose model with lowest mean MSE as best.

13-12

**Numerical Illustration of Cross-Validation (Example 13.4 in ISSO)**

• Consider true system corresponding to a sine function of the input with additive normally distributed noise • Consider three candidate models – Linear (affine) model – 3rd-order polynomial – 10th-order polynomial • Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement) • Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred • See following plot 13-13

Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations

Sine wave (process mean) 3 rd -order 10 th -order Linear

13-14

Fisher Information Matrix

• Fundamental role of data analysis is to

extract information from data

• Parameter estimation for models is central to process of extracting information • The Fisher

information matrix

plays a central role in parameter estimation for measuring information

Information matrix summarizes the amount of information in the data relative to the parameters being estimated

13-15

Problem Setting

• Consider the classical statistical problem of estimating parameter vector  from

data vectors

1 ,

2 ,…,

• Suppose have a probability density and/or mass function associated with the data • The parameters  appear in the probability function and affect the nature of the distribution – Example:



(mean(  ), covariance(  )) for all

• Let l (  |

1 ,

2 ,…,

) represent the likelihood function, i.e., the p.d.f./p.m.f. viewed as a function of  conditioned on the data 13-16

Information Matrix —Definition

• Recall likelihood function l (  |

1 ,

2 • Information matrix defined as



    log   l ,…,

)  log l  

   where expectation is w.r.t.

1 ,

2 ,…,

• Equivalent form based on Hessian matrix:

( ) 

    2 log

l     •

(  ) is positive semidefinite of dimension



(

=dim(  )) 13-17

Information Matrix —Two Key Properties

• Connection of value of  ):

(  rigorously specified via two famous results (   ˆ

= true

1. Asymptotic normality:

where

(

   ) 

lim 

(   )

( ,

 1 )

2. Cramér-Rao inequality:

cov (

) 

(  ) 1 for all

 “smaller”

(



)

(and vice versa) 

13-18

Selected Applications

• Information matrix is measure of performance for several applications. Four uses are:

1. Confidence regions for parameter estimation

– Uses asymptotic normality and/or Cramér-Rao inequality

Prediction bounds for mathematical models 3.

Basis for “D-optimal” criterion for experimental design

– Information matrix serves as measure of how well  be estimated for a given set of inputs can

Basis for “noninformative prior” in Bayesian analysis

– Sometimes used for “objective” Bayesian inference 13-19

Test title - Applied Physics Laboratory

Transcript Test title - Applied Physics Laboratory

ODELING

ONSIDERATIONS AND

TATISTICAL

NFORMATION

Model Definition and MSE

Bias-Variance Decomposition

Unbiased Estimator May Not be Best

Bias-Variance Tradeoff in Model Selection in Simple Problem

Example 13.2 in ISSO: Bias-Variance Tradeoff

Model Selection

Methods for Model Selection

Cross-Validation

Choice of Training and Test Subsets

Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets

Typical Steps for Cross-Validation

Numerical Illustration of Cross-Validation (Example 13.4 in ISSO)

Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations

Fisher Information Matrix

Problem Setting

Information Matrix —Definition

Information Matrix —Two Key Properties

Selected Applications

Directory

**Example 13.2 in ISSO: Bias-Variance Tradeoff**

**Numerical Illustration of Cross-Validation (Example 13.4 in ISSO)**