module 1 slides - Chemical Engineering

Download Report

Transcript module 1 slides - Chemical Engineering

CHEE824
Nonlinear Regression Analysis
J. McLellan
Winter 2004
Module 1:
Linear Regression
Outline • assessing systematic relationships
• matrix representation for multiple regression
• least squares parameter estimates
• diagnostics
» graphical
» quantitative
• further diagnostics
» testing the need for terms
» lack of fit test
• precision of parameter estimates, predicted responses
• correlation between parameter estimates
3
The Scenario
We want to describe the systematic relationship
between a response variable and a number of
explanatory variables
multiple regression
we will consider the case which
is linear in the parameters
4
Assessing Systematic Relationships
Is there a systematic relationship?
Two approaches:
• graphical
» scatterplots, casement plots
• quantitative
» form correlations between response, explanatory
variables
» consider forming correlation matrix - table of pairwise
correlations between regressor and explanatories, and
pairs of explanatory variables
• correlation between explanatory variables leads to
correlated parameter estimates
5
Graphical Methods for Analyzing Data
Visualizing relationships between variables
Techniques
• scatterplots
• scatterplot matrices
» also referred to as “casement plots”
• Time sequence plots
chee824 - Winter 2004
J. McLellan
6
Scatterplots
,,, are also referred to as “x-y diagrams”
• plot values of one variable against another
• look for systematic trend in data
» nature of trend
• linear?
• exponential?
• quadratic?
» degree of scatter - does spread increase/decrease over
range?
• indication that variance isn’t constant over range of data
chee824 - Winter 2004
J. McLellan
7
Scatterplots - Example
• tooth discoloration data - discoloration vs. fluoride
)
c
0
*2
v
th4
e
t (te
lo
rp
tte
a
c
S
0
5
5
4
0
4
5
3
DISCOLOR
0
3
5
2
0
2
trend - possibly
nonlinear?
5
1
0
1
5
.0
0
.5
0
.0
1
.5
1
.0
2
.5
2
.0
3
.5
3
.0
4
.5
4
E
ID
R
O
U
L
F
chee824 - Winter 2004
J. McLellan
8
Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing
S
c
a
tte
rp
o
lt (te
e
th4
v
*2
0
c
)
5
0
4
5
4
0
signficant trend?
- doesn’t appear to
be present
3
5
DISCOLOR
3
0
2
5
2
0
1
5
1
0
5
4
5
6
7
8
9
1
0
1
1
1
2
1
3
B
R
U
S
H
IN
G
chee824 - Winter 2004
J. McLellan
9
Scatterplot - Example
• tooth discoloration data -discoloration vs. brushing
S
c
a
tte
rp
o
lt (te
e
th4
v
*2
0
c
)
Variance appears
to decrease as
# of brushings increases
5
0
4
5
4
0
3
5
DISCOLOR
3
0
2
5
2
0
1
5
1
0
5
4
5
6
7
8
9
1
0
1
1
1
2
1
3
B
R
U
S
H
IN
G
chee824 - Winter 2004
J. McLellan
10
Scatterplot matrices
… are a table of scatterplots for a set of variables
Look for » systematic trend between “independent” variable and
dependent variables - to be described by estimated
model
» systematic trend between supposedly independent
variables - indicates that these quantities are correlated
• correlation can negatively ifluence model estimation results
• not independent information
• scatterplot matrices can be generated automatically
with statistical software, manually using Excel
chee824 - Winter 2004
J. McLellan
11
Scatterplot Matrices - tooth data
M
a
trixP
lo
t (te
e
th4
v
*2
0
c
)
F
L
U
O
R
ID
E
A
G
E
B
R
U
S
H
IN
G
D
IS
C
O
L
O
R
chee824 - Winter 2004
J. McLellan
12
Time Sequence Plot
- for naphtha 90% point - indicates amount of heavy
hydrocarbons present in gasoline range material
T
im
eS
e
q
u
e
n
c
eP
lo
t -N
a
p
h
th
a9
0
%
P
o
in
t
excursion - sudden
shift in operation
4
8
0
4
7
0
4
6
0
4
5
0
90%point(degreesF)
4
4
0
4
3
0
4
2
0
4
1
0
4
0
0
3
9
0
0
chee824 - Winter 2004
3
0
6
0
9
0
1
2
0
J. McLellan
meandering about
average operating point
- time
correlation
in
data
1
5
0
1
8
0
2
1
0
2
4
0
2
7
0
13
What do dynamic data look like?
Time Series Plot of Industrial Data
7
6
5
4
3
2
1
0
#1
# 301
# 151
chee824 - Winter 2004
# 601
# 451
# 901
# 1201
# 1501
# 1801
# 2101
# 751
# 1051
# 1351
# 1651
# 1951
J. McLellan
v ar1
v ar2
14
Assessing Systematic Relationships
Quantitative Methods
• correlation
» formal def’n plus sample statistic (“Pearson’s r”)
• covariance
» formal def’n plus sample statistic
provide a quantiative measure of systematic LINEAR
relationships
15
Covariance
Formal Definition
• given two random variables X and Y, the covariance
is
Cov ( X , Y )  E {( X   X )(Y  Y )}
• E{ } - expected value
• sign of the covariance indicates the sign of the slope
of the systematic linear relationship
» positive value --> positive slope
» negative value --> negative slope
• issue - covariance is SCALE DEPENDENT
16
Covariance
• motivation for covariance as a measure of systematic
linear relationship
» look at pairs of departures about the mean of X, Y
Y
Y
X
mean of X, Y
X
mean of X, Y
17
Correlation
• is the “dimensionless” covariance
» divide covariance by standard dev’ns of X, Y
• formal definition
Corr ( X ,Y )   ( X ,Y ) 
• properties
» dimensionless
» range
Cov( X ,Y )
 X Y
 1   ( X ,Y ) 1
strong linear relationship
with negative slope
strong linear relationship
with positive slope
Note - the correlation gives NO information about the
actual numerical value of the slope.
18
Estimating Covariance, Correlation
… from process data (with N pairs of observations)
Sample Covariance
1 N
R
 ( X i  X )(Yi  Y )
N  1i 1
Sample Correlation
1 N
 ( X i  X )(Yi  Y )
N  1i 1
r
s X sY
19
Making Inferences
The sample covariance and corrleration are
STATISTICS, and have their own probability
distributions.
Confidence interval for sample correlation » the following is approximately distributed as the standard
normal random variable
N  3(tanh1(r )  tanh1(  ))
1
tanh
(  ) and convert to
» derive confidence limits for
confidence limits for the true correlation using tanh
20
Confidence Interval for Correlation
Procedure
1. find z /2 for desired confidence level
1
tanh
(  ) is
2. confidence interval for
1
tanh (r ) 
1
z / 2
N 3
3. convert to limits to confidence limits for correlation by
taking tanh of the limits in step 2
A hypothesis test can also be performed using this function of the
correlation and comparing to the standard normal distribution
21
Example - Solder Thickness
Objective - study the effect of temperature on solder
thickness
Data - in pairs
Solder Temperature (C)
245
215
218
265
251
213
234
257
244
225
Solder Thickness (microns)
171.6
201.1
213.2
153.3
178.9
226.6
190.3
171
197.5
209.8
22
Example - Solder Thickness
thickness
Solder Thickness (microns)
230
220
210
200
190
180
170
160
150
140
200
210
220
230
240
250
260
270
tem perature
Solder Temperature (C)
Solder Thickness (micro
Solder Temperature (C)
1
Solder Thickness (microns)
-0.920001236
1
23
Example - Solder Thickness
Confidence Interval
zalpha/2 of 1.96 (95% confidence level)
limits in tanh^-1(rho)
limits in rho
-2.329837282
-0.981238575
-0.848216548
-0.690136605
24
Empirical Modeling - Terminology
• response
» “dependent” variable - responds to changes in other
variables
» the response is the characteristic of interest which we are
trying to predict
• explanatory variable
» “independent” variable, regressor variable, input, factor
» these are the quantities that we believe have an
influence on the response
• parameter
» coefficients in the model that describe how the
regressors influence the response
25
Models
When we are estimating a model from data, we
consider the following form:
Y  f ( X , )  
response
“random error”
explanatory parameters
variables
26
The Random Error Term
• is included to reflect fact that measured data contain
variability
» successive measurements under the same conditions
(values of the explanatory variables) are likely to be
slightly different
» this is the stochastic component
» the functional form describes the deterministic
component
» random error is not necessarily the result of mistakes in
experimental procedures - reflects inherent variability
» “noise”
27
Types of Models
• linear/nonlinear in the parameters
• linear/nonlinear in the explanatory variables
• number of response variables
– single response (standard regression)
– multi-response (or “multivariate” models)
From the perspective of statistical model-building,
the key point is whether the model is linear or
nonlinear in the PARAMETERS.
28
Linear Regression Models
• linear in the parameters
T95  b1 TLGO  b2 Tmid  
• can be nonlinear in the regressors
T95  b1 TLGO  b2 Tmid  
29
Nonlinear Regression Models
• nonlinear in the parameters
– e.g., Arrhenius rate expression
E
r  k0 exp(
)
RT
nonlinear
linear
(if E is fixed)
30
Nonlinear Regression Models
• sometimes transformably linear
• start with
E
r  k0 exp(
RT
)
and take ln of both sides to produce
E
ln(r)  ln( k0 ) 

RT
which is of the form
1
Y  0  1

RT
linear in the
parameters
31
Transformations
• note that linearizing the nonlinear equation by
transformation can lead to misleading estimates if the
proper estimation method is not used
• transforming the data can alter the statistical
distribution of the random error term
32
Ordinary LS vs. Multi-Response
• single response (ordinary least squares)
T95  b1 TLGO  b2 Tmid  
• multi-response (e.g., Partial Least Squares)
T95, LGO  b11 TLGO  b12 Tmid  1
T95,kero  b21 Tkero  b22 Tmid  2
– issue - joint behaviour of responses, noise
We will be focussing on single response models.
33
Linear Multiple Regression
Model Equation
Yi  1 Xi1   p Xip  i
i-th observation
of response
(i-th data point)
i-th value of
explanatory variable X 1
random noise
in i-th observation
of response
i-th value of
explanatory variable X p
The intercept can be considered as corresponding
to an X which always has the value “1”
34
Assumptions for Least Squares Estimation
Values of explanatory variables are known EXACTLY
» random error is strictly in the response variable
» practically - a random component will almost always be
present in the explanatory variables as well
» we assume that this component has a substantially
smaller effect on the response than the random
component in the response
» if random fluctuations in the explanatory variables are
important, consider alternative method (“Errors in
Variables” approach)
35
Assumptions for Least Squares Estimation
The form of the equation provides an adequate
representation for the data
» can test adequacy of model as a diagnostic
Variance of random error is CONSTANT over range of
data collected
» e.g., variance of random fluctuations in thickness
measurements at high temperatures is the same as
variance at low temperatures
» data is “heteroscedastic” if the variance is not constant different estimation procedure is required
» thought - percentage error in instruments?
36
Assumptions for Least Squares Estimation
The random fluctuations in each measurement are
statistically independent from those of other
measurements
»
»
»
»
at same experimental conditions
at other experimental conditions
implies that random component has no “memory”
no correlation between measurements
Random error term is normally distributed
» typical assumption
» not essential for least squares estimation
» important when determining confidence intervals,
conducting hypothesis tests
37
Least Squares Estimation - graphically
least squares - minimize sum of squared prediction errors
o
response
(solder thickness)
o
o
deterministic
“true”
relationship
o
o
o
prediction error
“residual”
T
More Notation and Terminology
Random error is “independent, identically distributed”
(I.I.D) -- can say that it is IID Normal
Capitals - Y - denotes random variable
- except in case of explanatory variable - capital used
to denote formal def’n
Lower case - y, x - denotes measured values of
variables
Model
Y  0  1 X  
Measurement
y  0  1x  
39
More Notation and Terminology
Estimate - denoted by “hat”
» examples - estimates of response, parameter
y, 0
Residual - difference between measured and predicted
response
e  y  y
40
Matrix Representation for Multiple Regression
We can arrange the observations in “tabular” form - vector of
observations, and matrix of explanatory values:
 Y1   X11

 

 
 Y2   X 21

 

 
   

 

 
YN 1   X N 1,1

 

 
 YN   X N ,1
X1 p 
 1 
 




1





X 22

X2 p 
 2 


 




2







 
  


 









X N 1,2  X N 1, p 
 N 1 








  p 

X N ,2  X N , p 
  N 
X12

41
Matrix Representation for Multiple Regression
The model is written as:
Y  X  
Nx1
vector
Nxp
matrix
px1
vector
Nx1
vector
N --> number of data observations
p --> number of parameters
42
Least Squares Parameter Estimates
We make the same assumptions as in the straight line
regression case:
» independent random noise components in each
observation
» explanatory variables known exactly - no randomness
» variance constant over experimental region (identically
distributed noise components)
43
Residual Vector
~
Given a set of parameter values  , the residual vector is formed
from the matrix expression:
 e1   Y1   X11

 
 

 
 
 e2   Y2   X 21

 
 

 
 
      

 
 

 
 
e N 1  YN 1   X N 1,1

 
 

 
 
 e N   YN   X N ,1
X1 p 
  ~ 
  1
X 22

X2 p 
 
~ 
  2 


  
 
  
X N 1,2  X N 1, p   
 ~ 
  p 
X N ,2  X N , p 
X12

44
Sum of Squares of Residuals
… is the same as before, but can be expressed as the squared
length of the residual vector:
N
SSE   ei2  eT e
i 1
2
 e
~ T
~
 ( Y  X ) ( Y  X )
45
Least Squares Parameter Estimates
Find the set of parameter values that minimize the sum
of squares of residuals (SSE)
» apply necessary conditions for an optimum from calculus
(stationary point)

( SSE )  0


» system of N equations in p unknowns, with number of
parameters < number of observations : over-determined
system of equations
» solution - set of parameter values that comes “closest to
satisfying all equations” (in a least squares sense)
46
Least Squares Parameter Estimates
The solution is:
  ( XT X) 1 XT Y
generalized matrix inverse
of X
- generalization of standard
concept of matrix inverse to case of
non-square matrix case
47
Example - Solder Thickness
Let’s analyze the data considered for the straight line case:
Solder Temperature (C) Solder Thickness (microns)
245
171.6
215
201.1
218
213.2
265
153.3
251
178.9
213
226.6
234
190.3
257
171
244
197.5
225
209.8
Model:
Y  0   1 X  
48
Example - Solder Thickness
In matrix form:
Y  X   
.  1
1716

 

 
2011
.

 1

 

 
213.2 1

 

 
153.3 1

 

 
178.9  1

 


226.6 1

 

 
190.3 1

 

 
 171  1

 

 
197.5 1

 

 
209.8 1
245
 1 

 

 
215
 2 

 

 
218
 3 

 

 
265
 4 

 

 
251 0   5 
   
   
213  1   6 

 

 
 7 
234

 

 
 8 
257

 

 

 9 
244

 

 

 
225
10
49
Example - Solder Thickness
In order to calculate the Least Squares Estimates:
2367 
 10
 1910 
 ; XT Y  

( X T X)  




2367 563335
449420
50
Example - Solder Thickness
The least squares parameter estimates are obtained as:
. 
 18.373  0.0772  1910  45810



  ( XT X) 1 XT Y  


 

. 
 0.0772 0.0003  449420   113
51
Example - Wave Solder Defects
(page 8-31, Course Notes)
Run
1
2
3
4
5
6
7
8
9
10
11
Wave Solder Defects Data
Conveyor Speed Pot Temp Flux Density No. of Defects
-1
-1
-1
100
1
-1
-1
119
-1
1
-1
118
1
1
-1
217
-1
-1
1
20
1
-1
1
42
-1
1
1
41
1
1
1
113
0
0
0
101
0
0
0
96
0
0
0
115
52
Example - Wave Solder Defects
In matrix form:
Y  X   
100  1

 

 
119  1

 

 
118  1

 

 
217  1

 

 
 20  1

 

 
 42   1

 

 
 41  1

 

 
113  1

 

 
101 1

 

 
 96  1

 

 
115  1

 
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
 1


 1


 1


 1


1


1


1


1


0


0


0 
 1 




 2 




 3 




 4 


 0  

   
   5
 1  

  
6
  

2  

   
   7
 3  

 8 




 9 




 
 10 


 
 11 
53
Example - Wave Solder Defects
To calculate least squares parameter estimates:
11


0
( X T X)  

0


 0
0 0 0


8 0 0
;

0 8 0


0 0 8
 1082 




 212 

XT Y  


 208 




 338
54
Example - Wave Solder Defects
Least squares parameter estimates:
1
11


0
  ( XT X) 1 XT Y  

0


 0
0   1082   93.36 

 


1
  212   26.50 
0 0 
 

8



 

1

0
0   208   26.0 
8
 


 

1 
 338  42.25
0 0

8
0
0
55
Examples - Comments
• if there N runs, and the model has p parameters, XTX is a pxp
matrix (smaller dimension than number of runs)
• elements of XTY are  xij yi for parameters j=1, …, p
i
• in the Wave Solder Defects example, the values of the
explanatory variable for the runs followed very specific patterns
of -1 and +1, and XTX was a diagonal matrix
• in the Solder Thickness example, the values of the explanatory
variable did not follow a specific pattern, and XTX was not
diagonal
56
Graphical Diagnostics
Basic Principle - extract as much trend as possible from
the data
Residuals should have no remaining trend » with respect to the explanatory variables
» with respect to the data sequence number
» with respect to other possible explanatory variables
(“secondary variables”)
» with respect to predicted values
57
Graphical Diagnostics
Residuals vs. Predicted Response Values
residual
ei
*
*
*
*
*
*
*
*
*
- even scatter
over range of prediction
*
- no discernable pattern
*
* * *
yi
- roughly half the residuals
are positive, half negative
DESIRED RESIDUAL PROFILE
58
Graphical Diagnostics
Residuals vs. Predicted Response Values
*
residual
ei
*
*
*
*
*
*
*
outlier lies outside
main body of residuals
*
*
*
* * *
yi
RESIDUAL PROFILE WITH OUTLIERS
59
Graphical Diagnostics
Residuals vs. Predicted Response Values
residual
ei
* *
**
*
*
*
* *
*
* * * *
* * **
variance of the residuals
appears to increase
with higher predictions
yi
NON-CONSTANT VARIANCE
60
Graphical Diagnostics
Residuals vs. Explanatory Variables
» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present
residual
ei
left over quadratic trend
- need quadratic term in model
*
* *
*
** * *
* *
*
*
*
*
x
61
Graphical Diagnostics
Residuals vs. Explanatory Variables Not in Model
» ideal - no systematic trend present in plot
» inadequate model - evidence of trend present
residual
ei
*
* *
*
* *
*
*
**
*
* *
w
systematic trend
not accounted for in model
- include a linear term in “w”
62
Graphical Diagnostics
Residuals vs. Order of Data Collection
residual
ei
* *
** * *
failure to account for time trend
in data
*
**
*
residual
ei
*
t
* *
*
** ** *
*
*
*
* *
*
*
t
successive random noise
components are correlated
- consider more complex model
- time series model for random
component?
63
Quantitative Diagnostics - Ratio Tests
Residual Variance Test
» is the variance of the residuals significant to the inherent
noise variance?
» same test as that for the straight line data
» only distinction - number of degrees of freedom for the
Mean Squared Error => N-p , where p is the number of
parameters in the model
» compare ratio to FN-p,M-1,0.05 where M is the number of
data points used to estimate inherent variance
» significant? -> model is INADEQUATE
64
Quantitative Diagnostics - Ratio Tests
Residual Variance Ratio
2
sresiduals
2
sinherent

Mean Squared Error of Residuals( MSE )
2
sinherent
Mean Squared Error of Residuals (Var. of Residuals):
N
2
 ei
2
sresiduals
 MSE  i 1
Np
65
Quantitative Diagnostics - Ratio Tests
Mean Square Regression Ratio
» same as in the straight line case except for degrees of
freedom
Variance described by model:
N
2
 ( yi  y )
MSR  i 1
p 1
66
Quantitative Diagnostics - Ratio Test
Test Ratio:
MSR
MSE
is compared against Fp-1,N-p,0.95
Conclusions?
– ratio is statistically significant --> significant trend
– NOT statistically significant --> significant trend has NOT
been modeled, and model is inadequate in its present form
For the multiple regression case, this test is a coarse
measure of whether some trend has been modeled it provides no indication of which X’s are important
67
Analysis of Variance Tables
The ratio tests involve dissection of the sum of squares:
SSR
SSE
N
N
  ( yi  yi )2
  ( yi  y )2
i 1
i 1
N
TSS   ( yi  y ) 2
i 1
68
Analysis of Variance (ANOVA) for Regression
Source
of
Variation
Degrees Sum of
of
Squares
Freedom
Mean
Square
Regression
p-1
SSR
MSR=SSR/(p-1)
Residuals
N-p
SSE
MSE=SSE/(N-p)
Total
N-1
TSS
F-Value
p-value
F=MSR/MSE
p
69
Quantitative Diagnostics - R2
Coefficient of Determination (“R2 Coefficient”)
» square of correlation between observed and predicted
values:
R2  [corr ( y, y)]2
» relationship to sums of squares:
SSE SSR
2
R  1

TSS
TSS
» values typically reported in “%”, i.e., 100 R2
» ideal - R2 near 100%
70
Issues with R2
• R2 is sensitive to extreme data points, resulting in misleading
indication of quality of fit
• R2 can be made artifically large by adding more parameters to
the model
» put a curve through every point - “connect the dots”
model --> simply modeling noise in the data, rather than
trend
» solution - define the “adjusted R2”, which penalizes the
addition of parameters to the model
71
Adjusted R2
Adjust for number of parameters relative to number of observations
» account for degrees of freedom of the sums of squares
» define in terms of Mean Squared quantities
2
Radj
 1
MSE
SSE / ( N  p)
 1
TSS / ( N  1)
TSS / ( N  1)
» want value close to 1 (or 100%), as before
» if N>>p, adjusted R2 is close to R2
» provides measure of agreement, but does not account for
magnitude of residual error
72
Testing the Need for Groups of Terms
In words: “Does a specific group of terms account for significant
trend in the model”?
Test
» compare difference in residual variance between full and
reduced model
» benchmark against an estimate of the inherent variation
» if significant, conclude that the group of terms ARE
required
» if not significant, conclude that the group of terms can be
dropped from the model - not explaining significant trend
» note that remaining parameters should be re-estimated in
this case
73
Testing the Need for Groups of Terms
Test:
A - denotes the full model (with all terms)
B - denotes the reduced model (group of terms deleted)
Form:
SSE model A  SSE model B
s2 ( p A  p B )
pA, pB are the numbers of parameters in models A, B
s2 is an estimate of the inherent noise variance:
» estimate as SSEA/(N-pA)
74
Testing the Need for Groups of Terms
Compare this ratio to
Fp A  pB ,inherent ,0.95
» if MSEA is used as estimate of inherent variance, then
degrees of freedom of inherent variance estimate is pA
75
Lack of Fit Test
If we have replicate runs in our regression data set, we can break
out the noise variance from the residuals, and assess the
component of the residuals due to unmodelled trend
Replicates » repeated runs at the SAME experimental conditions
» note that all explanatory variables must be at fixed
conditions
» indication of inherent variance because no other factors
are changing
» measure of repeatibility of experiments
76
Using Replicates
We can estimate the sample variance for each set of replicates,
and pool the estimate of the variance
» constancy of variance can be checked using Bartlett’s
test
» constant variance is assumed for ordinary least squares
estimation
average of
values in
For each replicate set, we have:
number of
values in
replicate set
“i”
ni
2
 ( yij  yi  )
j 1
si2 
ni  1
replicate set
“i”
values in
replicate set
“i”
77
Using Replicates
The pooled estimate of variance is:
m
2
 (ni  1) si
i 1
m 
  ni   m
 i 1 
i.e., convert back to sums of squares, and divide by the total
number of degrees of freedom (the sum of the degrees of
freedom for each variance estimate)
78
The Lack of Fit Test
SSE
{
Back to the sum of squares “block”:
SSEP
SSR
SSELOF
“lack of fit”
sum of squares
“pure error” sum
of squares
TSS
79
The Lack of Fit Test
We partition the SSE into two components:
» component due to inherent noise
» component due to unmodeled trend
Pure error sum of squares (SSEP):
m  ni

2
SSEP     ( yij  yi  ) 

i 1 j 1
i.e., add together sums of squares associated with each replicate
group (there are “m” replicate groups in total)
80
The Lack of Fit Test
The “lack of fit sum of squares” (SSELOF) is formed by backing out
SSEP from SSE:
SSELOF  SSE  SSEP
Degrees of Freedom:
- for SSEP:
m 
  ni   m
 i 1 
- for SSELOF:
 m 

N  p     ni   m
  i 1 

81
The Lack of Fit Test
The test ratio:
MSELOF SSELOF /  LOF

MSEP
SSEP /  Pure
Compare to
F LOF , Pure ,0.95
» significant? - there is significant unmodeled trend, and
model should be modified
» not significant? - there is nosignificant unmodeled trend,
and supports model adequacy
82
Example - Wave Solder Defects
From earlier regression, SSE = 2694.0 and SSR = 25306.5
Replicate Set
9
10
11
0
0
0
0
0
0
0
101
0
96
0
115
std. devn
9.848858
sample var
97
sum of sq
194
(as (n_i-1)s^2)
LACK OF FIT TEST
ANOVA
df
Residual
LOF
Pure Error
SS
MS
F
value from F-table (95% pt)
7 2694.045
5 2500.045 500.0091 5.154733
19.3 (this is F5,2,0.95)
2
194
97
This was done by hand - Excel has no Lack of Fit test
83
A Comment on the Ratio Tests
Order of Preference (or “value”) - from most definitive to
least definitive:
• Lack of Fit Test -- MSELOF/MSEP
• MSE/s2inherent
• MSR/MSE
If at all possible, try to include replicate runs in your experimental
program so that the Lack of Fit test can be conducted
Many statistical software packages will perform the Lack of Fit test
in their Regression modules - Excel does NOT
84
The Parameter Estimate Covariance Matrix
… summarizes the variance-covariance structure of the parameter
estimates
 Var ( 1)
Cov ( 1, 2 )


 Cov ( 1, 2 )
Var ( 2 )







Cov ( 1,  p ) Cov ( 2 ,  p )

 Cov ( 1,  p ) 


 Cov ( 2 ,  p ) 








Var (  p ) 
85
Properties of the Covariance Matrix
•
•
•
•
symmetric -- Cov(b1,b2) = Cov(b2,b1)
diagonal entries are always non-negative
off-diagonal entries can be +ve or -ve
matrix is positive definite
v T v  0
for any vector v
86
Parameter Estimate Covariance Matrix
The covariance matrix of the parameter estimates is defined as:

  E (    )(    )T

Compare expression with variance for single parameter:
2


Var (  )  E {(    ) }
For linear regression, the covariance matrix is obtained as:
  ( XT X) 1 2
87
Parameter Estimate Covariance Matrix
Key point - the covariance structure of the parameter estimates is
governed by the experimental run conditions used for the
explanatory variables the Experimental Design
Example - the Wave Solder Defects data
11


0
( X T X)  

0


 0
0 0 0


8 0 0
;

0 8 0


0 0 8
1
11


0
( XT X) 1  

0


 0
0
0
1
8
0
0
1
8
0
0
0



0


0

1
8 
Parameter estimates
are uncorrelated, and
variances of the
non-intercept
parameteres are the
same
- towards “uniform
precision” of
parameter estimates
88
Estimating the Parameter Covariance Matrix
The X matrix is known - set of run conditions - so the only
estimated quantity is the inherent noise variance
» from replicates, external estimate, or MSE
For wave solder defect data, the sample variance of the replicates
is 384.86 with 7 degrees of freedom, and the parameter
covariances are:
1
11


0
  ( XT X) 1 se2  

residual
0
variance from


MSE
 0
0
0
1
8
0
0
1
8
0
0
0
0
0 
0
34.99







0
4811
.
0
0
0




 (384.86) 



0
4811
.
0 
0
 0





1
 0
0
0
4811
. 
8 
89
Using the Covariance Matrix
Variances of parameter estimates
» are obtained from the diagonal of the matrix
» square root is the standard dev’n, or “standard error”, of
the parameter estimates
• use to formulate confidence intervals for the paramters
• use in hypothesis tests for the parameters
Correlations between the parameter estimates
» can be obtained by taking covariance from appropriate
off-diagonal element, and dividing by the standard errors
of the individual parameter estimates
90
Correlation of the Parameter Estimates
Note that
0  Y  1x
I.e., the parameter estimate for the intercept depends
linearly on the slope!
» the slope and intercept estimates are correlated
changing slope changes
point of intersection with
axis because the line must
go through the centroid of the
data
91
Getting Rid of the Covariance
Let’s define the explanatory variable as the deviation
from its average:
ZXX
- average of z is zero
Least Squares parameter
estimates:
- note that now there is no explicit
dependence on the slope value
in the intercept expression
0  Y
N
 ziYi
1  i 1
N 2
 zi
i 1
92
Getting Rid of the Covariance
In this form of the model, the slope and intercept
parameter estimates are uncorrelated
Why is lack of correlation useful?
» allows indepedent decisions about parameter estimates
» decide whether slope is significant, intercept is significant
individually
» “unique” assignment of trend
• intercept clearly associated with mean of y’s
• slope clearly associated with steepness of trend
» correlation can be eliminated by altering form of model,
and choice of experimental points
93
Confidence Intervals for Parameters
… similar procedure to straight line case:
» given standard error for parameter estimate, use
appropriate t-value, and form interval as:
i  t , / 2 s
i
The degrees of freedom for the t-statistic come from the
estimate of the inherent noise variance
» the degrees of freedom will be the same for all of the
parameter estimates
If the confidence interval contains zero, the parameter is plausibly
zero and consideration should be given to deleting the term.
94
Hypothesis Tests for Parameters
… represent an alternative approach to testing whether the term
should be retained in the model
Null hypothesis - parameter = 0
Alternate hypothesis - parameter is not equal to 0
Test statistic:
i
s 
i
» compare absolute value to t , /2
» if test statistic is greater (“outside the fence”), parameter
is significant -- retain
» inside the fence? - consider deleting the term
95
Example - Wave Solder Defects Data
Test statistic will be compared to t7,0.025  2.365
because MSE is used to calculate standard errors of parameters,
and has 7 degrees of freedom.
Test statistic for intercept:
0
98.36

 16.63
s
34.99
0
Since 16.63 > 2.365, conclude that intercept parameter IS
significant and should be retained.
96
Example - Wave Solder Defects Data
For the next term in the model:
1
26.5

 382
.  2.365
s
4811
.
1
Therefore this term should be retained in the model.
Because the parameter estimates are uncorrelated in this model,
terms can be dropped without the need to re-estimate the other
parameters in the model -- in general, you will have to reestimate the final model once more to obtain the parameter
estimates corresponding to the final model form.
97
Example - Wave Solder Defects Data
From Excel:
Intercept
Conveyor Speed
Pot Temp
Flux Density
Coefficients
Standard Error
t Stat
98.36363636
5.915031978 16.62943
26.5
6.935989803 3.820652
26
6.935989803 3.748564
-42.25
6.935989803 -6.09142
standard dev’ns.
of each parameter
estimate
test statistic
for each
parameter
P-value
Lower 95% Upper 95%
6.948E-07 84.376818 112.3505
0.0065367 10.099002
42.901
0.0071817 9.599002
42.401
0.0004953
-58.651
-25.849
prob. that
a value is
greater than
computed test
ratio - 2-tailed
test!
confidence
limits
98
Precision of the Predicted Responses
The predicted response from an estimated model has uncertainty,
because it is a function of the parameter estimates which have
uncertainty:
e.g., Solder Wave Defect Model - first responseat the point -1,-1,-1
y1  0  1( 1)  2 ( 1)  3( 1)
If the parameter estimates were uncorrelated, the variance of the
predicted response would be:
Var ( y1)  Var ( 0 )  Var ( 1)  Var ( 2 )  Var ( 3 )
(recall results for variance of sum of random variables)
99
Precision of the Predicted Responses
In general, both the variances and covariances of the parameter
estimates must be taken into account.
For prediction at the k-th data point:
Var ( yk )  xTk ( XT X) 1x k  2

 xk 1
 xk 1 




 xk 2 

 2
xk 2  xkp ( XT X) 1 
 
  




 xkp 

100
Example - Wave Solder Defects Model
In this example, the parameter estimates are uncorrelated
» XTX is diagonal
» variance of the predicted reponse is in fact the sum of the
variances of the parameter estimates
Variance of prediction at run #11 (0,0,0):
Var ( y11)  Var ( 0 )  Var ( 1)(0)  Var ( 2 )(0)  Var ( 3 )(0)
 Var (  )
0
101
Precision of “Future” Predictions
Suppose we want to predict the response at conditions other than
those of the experimental runs --> future run.
The value we observe will consist of the component from the
deterministic component, plus the noise component.
In predicting this value, we must consider:
» uncertainty from our prediction of the deterministic
component
» noise component
2
Var
(
y
)



The variance of this future prediction is
future

where Var ( y future ) is computed using the same expression
for variance of predicted responses at experimental run conditions
102
Estimating Precision of Predicted Responses
Use an estimate of the inherent noise variance
T X) 1x s2
s2y  xT
(
X
k e
k
k
The degrees of freedom for the estimated variance of the predicted
response are those of the estimate of the noise variance
» replicates, external estimate, MSE
103
Confidence Limits for Predicted Responses
Follow an approach similar to that for parameters - 100(1-alpha)%
confidence limits for predicted response at the k-th run are:
y k  t , / 2 s y k
» degrees of freedom are those of the inherent noise
variance estimate
If the prediction is for a response at conditions OTHER than one of
the experimental runs, the limits are:
yk  t , / 2 s2y
 se2
 future
104
Practical Guidelines for Model Development
1) Consider CODING your explanatory variables
xi  xi
~
Coding - one standard form: xi  1
range( xi )
2
» places designed experiment into +1,-1 form
» if run conditions are from an experimental design, this
coding must be used in order to obtain all of the benefits
from the design - uncorrelated parameter estimates
» if conditions are not from an experimental design, such a
coding improves numerical conditioning of the problem -similar numerical scales for all variables
105
Practical Guidelines for Model Development
2) Types of models » linear in the explanatory variables
» linear with two-factor interactions (xi xj)
» general polynomials
3) Watch for collinearity in the X matrix - run condition patterns for
two or more explanatory variables are almost the same
» prevents clear assignment of trend to each factor
» shows up as singularity in XTX matrix
» associated with very strong correlation between
parameter estimates
106
Practical Guidelines for Model Development
4) Be careful not to extrapolate excessively beyond the range of
the data
5) Maximum number of parameters that can be fit to a data set =
number of unique run conditions
 m 

N     ni   m
  i 1 

»
»
»
»
N - number of data points
m - number of replicate sets
ni - number of points in replicate set “i”
as number of parameters increases, precision of
predictions decreases - start modeling noise
107
Practical Guidelines for Model Development
6) Model building sequence
» “building” approach - start with few terms and add as
necessary
» “pruning” approach - start with more terms and remove
those which aren’t statistically significant
» stepwise regression - terms are added, and retained
according to some criterion - frequently R2
• uncorrelated? criterion?
» “all subsets” regression - consider all subsets of model
terms of certain type, and select model with best criterion
• significant computational load
108
Polynomial Models
Order - maximum over the p terms in the model of the sum of the
exponents in a given term
e.g.,
2
2
3
Y  0  1x1  2 x2  3x1 x2  
is a fifth-order model
Two factor interaction » product term - x1x2
» implies that impact of x1 on response depends on value
of x2
109
Polynomial Models
Comments » polynomial models can sometimes suffer from collinearity
problems - coding helps this
» polynomials can provide approximations to nonlinear
functions - think of Taylor series approximations
» high-order polynomial models can sometimes be
replaced by fewer nonlinear function terms
• e.g., ln(x) vs. 3rd order polynomial
110
Joint Confidence Region (JCR)
… answers the question
Where do the true values of the parameters lie?
Recall that for individual parameters, we gain an understanding of
where the true value lies by:
» examining the variability pattern (distribution) for the
parameter estimate
» identify a range in which most of the values of the
parameter estimate are likely to lie
» manipulate this range to determine an interval which is
likely to contain the true value of the parameter
111
Joint Confidence Region
Confidence interval for individual parameter:
Step 1) The ratio of the estimate to its standard deviation is
distributed as a Student’s t-distribution with degrees of freedom
equal to that of the standard devn of the variance estimate
i  i
~ t
s
i
Step 2) Find interval [ t , / 2 , t , / 2 ] which contains 100(1   )%
of values -i.e., probability of a t-value falling in this interval is (1  )
 t

 , / 2 si
Step 3) Rearrange this interval to obtain interval i
which contains true value of parameter 100(1   )% of the time
112
Joint Confidence Region
Comments on Individual Confidence Intervals:
» sometimes referred to as marginal confidence intervals cf. marginal distributions vs. joint distributions from earlier
» marginal confidence intervals do NOT account for
correlations between the parameter estimates
» examining only marginal confidence intervals can
sometimes be misleading if there is strong correlation
between several parameter estimates
• value of one parameter estimate depends in part on anther
• deletion of the other changes the value of the parameter
estimate
• decision to retain might be altered
113
Joint Confidence Region
Sequence:
Step 1) Identify a statistic which is a function of the parameter
estimate statistics
Step 2) Identify a region in which values of this statistic lie a certain
fraction of the time (a 100(1   )% region)
Step 3) Use this information to determine a region which contains
the true value of the parameters 100(1   )% of the time
114
Joint Confidence Region
The quantity
(    )T XT X(    )
p
s2

estimate of
inherent
noise variance
(if MSE is used, degrees of freedom is n-p)
~ Fp,n  p
is the ratio of two sums of squares, and is distributed as an Fdistribution with p degrees of freedom in the numerator, and n-p
degrees of freedom in the denominator
115
Joint Confidence Region
We can define a region by thinking of those values of the ratio
which have a value less than
Fp,n  p,1
   )T XT X(    )
(

i.e.,
p
s2

Rearranging yields:
 Fp,n  p,1
(    )T XT X(    )  ps2 Fp,n  p,
116
Joint Confidence Region - Definition
The 100(1   )% joint confidence region for the parameters is
defined as those parameter values  satisfying:
(    )T XT X(    )  ps2 Fp,n  p,1
Interpretation:
» the region defined by this inequality contains the true
values of the parameters 100(1   )% of the time
» if values of zero for one or more parameters lie in this
region, those parameters are plausibly zero, and
consideration should be given to dropping the
corresponding terms from the model
117
Joint Confidence Region - Example with 2 Parameters
Let’s reconsider the solder thickness example:
2367 
 10
;
( X T X)  


2367 563335
. 
45810
;
  


. 
  113
s2  135.38
95% Joint Confidence Region (JCR) for slope&intercept:
(    ) T XT X(    )

 0  0
0  0 
  ps2 F
1  1 XT X 
 2 s2 F2,10 2,0.95

p
,
n

p

 1  1 

118
Joint Confidence Region - Example with 2 Parameters
95% Joint Confidence Region (JCR) for slope&intercept:
.  0 
45810
  2(135.38) F
.  0  113
.  1XT X 
45810
2,8,0.95


.  1 
  113
 2(135.38)(4.46)  1207.59
The boundary is an ellipse...
119
Joint Confidence Region - Example with 2 Parameters
rotated - implies correlation
between estimates of slope
and intercept
Region
-0.6
centred at least squares
parameter estimates
Slope
-1.6
320
Intercept
600
greater “shadow” along horizontal axis --> variance of
intercept estimate is greater than that of slope
120
Interpreting Joint Confidence Regions
1) Are axes aligned with coordinate axes?
» is ellipse horizontal or vertical?
» indicates no correlation between parameter estimates
2) Which axis has the greatest shadow?
» projection of ellipse along axis
» indicates which parameter estimate has the greatest
variance
3) The elliptical region is, by definition, centred at the least squares
parameter estimates
4) Long, narrow, rotated ellipses indicate significant correlation
between parameter estimates
5) If a value of zero for one or more parameters lies in the region,
these parameters are plausibly zero - consider deleting from
model
121
Joint Confidence Regions
(    )T XT X(    )
What is the motivation for the ratio
p
s2
used to define the joint confidence region?
Consider the joint distribution for the parameter estimates:
1 
exp{ (    )T  1(    )}

2
(2 ) p / 2 det(   )
1
Substitute in estimate for
parameter covariance matrix:
(    )T (( XT X) 1 s2 ) 1(    )

(    ) T XT X(    )
s2
122
Confidence Intervals from Densities
f  (b)
Individual Interval
f   (b0 , b1)
0 1
Joint Region
b1
lower
upper
b
b0
area = 1-alpha
volume = 1-alpha
Joint Confidence
Region
123
Relationship to Marginal Confidence Limits
marginal confidence interval
for slope
Region
-0.6
Slope
centred at least squares
parameter estimates
-1.6
320
Intercept
600
marginal confidence interval for intercept
124
Relationship to Marginal Confidence Limits
95% confidence
region implied by
considering parameters
individually
marginal confidence interval
for slope
Region
-0.6
95% confidence
region for parameters
considered jointly
Slope
-1.6
320
Intercept
600
marginal confidence interval for intercept
125
Relationship to Marginal Confidence Intervals
Marginal confidence intervals are contained in joint confidence
region
» potential to miss portions of plausible parameter values
at tails of ellipsoid
» using individual confidence intervals implies a
rectangular region, which includes sets of parameter
values that lie outside the joint confidence region
» both situations can lead to
• erroneous acceptance of terms in model
• erroneous rejection of terms in model
126
Going Further - Nonlinear Regression Models
Model:
Yi  (xi , )  i
explanatory
variables
random noise
component
parameters
Estimation Approach:
» linearize model with respect to parameters
» treat linearization as a linear regression problem
» iterate by repeating linearization/estimation/linearization
about new estimates,… until convergence to parameter
values - Gauss-Newton iteration - or solve numerical
optimization problem
127
Interpretation - Columns of X
– values of a given variable at different operating points – entries in XTX
» dot products of vectors of regressor variable values
» related to correlation between regressor variables
– form of XTX is dictated by experimental design
• e.g., 2k design - diagonal form
chee824 - Winter 2004
J. McLellan
128
Parameter Estimation - Graphical View
approximating observation vector
residual
vector
y
observations
y
chee824 - Winter 2004
J. McLellan
129
Parameter Estimation - Nonlinear Regression Case
approximating observation vector
residual
vector
y
observations
y
model surface
chee824 - Winter 2004
J. McLellan
130
Properties of LS Parameter Estimates
Key Point - parameter estimates are random variables
» because of how stochastic variation in data propagates
through estimation calculations
» parameter estimates have a variability pattern probability distribution and density functions
Unbiased
» “average”
E{ofrepeated
}   data collection / estimation
sequences will be true value of parameter vector
chee824 - Winter 2004
J. McLellan
131
Properties of Parameter Estimates
Consistent
» behaviour as number of data points tends to infinity
» with probability 1,
lim   
N 
» distribution narrows as N becomes large
Efficient
» variance of least squares estimates is less than that of
other types of parameter estimates
chee824 - Winter 2004
J. McLellan
132
Properties of Parameter Estimates
Covariance Structure
» summarized by variance-covariance matrix
T 1 2

Cov(  )  ( X X) 
structure dictated by
experimental design
variance of
noise
 Var ( 0 )
Cov ( 0 , 1 ) 

Cov (  )  




Cov
(

,

)
Var
(

)

0 1
1 
chee824 - Winter 2004
J. McLellan
133
Prediction Variance
… in matrix form -
var( y k )  xTk ( XT X) 1x k  2
where
is vector of conditions at k-th data point
xk
chee824 - Winter 2004
J. McLellan
134
Joint Confidence Regions
Variability in data can affect parameter estimates jointly
depending on structure of data and model

section of sum of
squares
(or likelihood)
function
2
1
marginal confidence limits
chee824 - Winter 2004
J. McLellan
135