Prediction and Lack of Fit in Regression

Download Report

Transcript Prediction and Lack of Fit in Regression

Prediction, Correlation, and Lack of
Fit in Regression (§11.4, 11.5, 11.7)
Outline
• Confidence interval and prediction interval.
• Regression Assumptions.
• Checking Assumptions (model adequacy).
• Correlation.
• Influential observations.
13-1
Prediction
Our regression model is
Y   0  1 X   ,
 ~ N(0,  )
2
so that the average value of the
response at X=x is
E[ yx ]  0  1 x
Number
of components
i
xi
1
1
2
2
3
4
4
4
5
4
6
5
7
6
8
6
9
8
10
8
11
9
12
9
13
10
14
10
Repair
time
yi
23
29
64
72
80
87
96
105
127
119
145
149
165
154
13-2
The estimated average response at X=x is therefore
Eˆ[ yx ]  yˆ x  ˆ0  ˆ1x
The expected value!
This quantity is a statistic, a random variable, hence it has a sampling distribution.
Regression Assumptions
Normal Distribution for 
Sample estimate, and associated variance:
yˆ x  ˆ0  ˆ1x
2


1
(
x

x
)
2

Var[ yˆ x ]  ˆ   
S xx 
n
ˆ 2  MSE
A (1-)100% CI for the average
response at X=x is therefore:
yˆ x  t n  2,  Var [ yˆ x ]
2
13-3
Prediction and Predictor Confidence
The best predictor of an individual response y at X=x, yx,pred, is
simply the average response at X=x.
Random variables -- they vary from sample-to-sample.
yˆ x, pred  ˆ0  ˆ1 x
Hence the predicted value
is also a random variable.
A (1-)100% CI for an
individual response at
X=x:
2


1
(
x

x
)
2

Var[ yˆ x , pred ]  ˆ  1  
S xx 
 n
Variance associated with an individual
prediction is larger than that for the mean
value! Why?
yˆ x , pred  t n  2, / 2 Var [ yˆ x , pred ]
13-4
Prediction band - what would we
expect for one new observation.
Confidence band - what would we
expect for the mean of many
observations taken at the value of
X=x.
13-5
13-6
Regression Assumptions and
Lack of Fit
Regression Model
Assumptions
•
•
•
•
yi  E  yi    i
Effect additivity (multiple regression)
Normality of the residuals
Homoscedasticity of the residuals
Independence of the residuals
13-7
Additivity
Additivity assumption.
E( yi )  0  1 xi
“The expected value of an observation is
a weighted linear combination of a number of factors.”
Which factors? (model uncertainty)
• number of factors in the model
• interactions of factors
• powers or transformations of factors
13-8
Homoscedasticity and Normality
Observations never equal
their expected values.
yi  E yi    i
E  i   0
No systematic biases.
Homoscedasticity assumption.
Var i   
2
The unexplained component
has a common variance
for all values i.
Normality assumption.
i ~ N 0,
2

The unexplained
component has a
normal distribution.
13-9
Independence
yi  E  yi    i
Independence assumption.
Corr i ,  j   Corryi , y j   0, for i  j.
Responses in one experimental unit are not correlated
with, affected by, or related to, responses for other
experimental units.
13-10
Correlation Coefficient
A measure of the strength of the linear relationship between two variables.
Product Moment
Correlation Coefficient
corr ( x, y )  r 
In SLR, r is related to the slope
of the fitted regression equation.
r2 (or R2) represents that proportion of
total variability of the Y-values that is
accounted for by the linear regression
with the independent variable X.
S xy
Sxx Syy
S xx
ˆ
r  1
S yy
r 
2
S xy
2
S xx S yy
R2: Proportion of variability in Y explained by X.
SSR

TSS
13-11
Properties of r
1. r lies between -1 and +1.
r > 0 indicates a positive
linear relationship.
r < 0 indicates a negative
linear relationship.
r = 0 indicates no linear
relationship.
r = 1 indicates perfect
linear relationship.
2. The larger the absolute
value of r, the stronger the
linear relationship.
3. r2 also lies between 0 and
1.
13-12
Checking Assumptions
How well does the model fit?
Do predicted values seem to
be placed in the middle of
observed values?
Do residuals satisfy the
regression assumptions?
(Problems seen in plot of X
vs. Y will be reflected in
residual plot.)
•
•
•
Constant variance?
Regularities suggestive
of lack of independence
or more complex model?
Poorly fit observations?
y
x
13-13
Model Adequacy
Studentized
residuals (ei)
ei 
i
MSE (i) (1- h i )
Allows us to gauge
whether the residual is too
large. It should have a
standard normal
distribution, hence it is very
unlikely that any
studentized residual will
be outside the range [-3,3].
MSE(I) is the calculated MSE leaving
observation i out of the computations.
hi is the ith diagonal of the projection matrix for
the predictor space (ith hat diagonal
element).
13-14
Normality of residuals
Formal Goodness of fit tests:
Kolmogorov-Smirnov Test
Shapiro-Wilks Test (n<50)
D’Agostino’s Test (n50)
All quite conservative - they
fail to reject the hypothesis
of normality more often than
they should.
Graphical Approach:
Quantile-quantile plot (qq-plot)
1. Compute and sort the simple
residuals [1],[2],…[n].
2. Associate to each residual a standard
normal quantile
[z[i]=normsinv((i-.5)/n)].
3. Plot z[I] versus e[I]. Compare to 45o
line.
13-15
13-16
Influence Diagnostics
(Ways to detect influential observations)
Does a particular observation consisting of a pair of (X,Y) values (a
case) have undue influence on the fit of the regression model? i.e. what
cases are greatly affecting the estimates of the p regression parameters
in the model. (For simple linear regression p=2.)
Standardized/Studentized Residuals. The ei are used to detect
cases that are outlying with respect to their Y values. Check
cases with
|ei| > 2 or 3.
Hat diagonal elements. The hi are used to detect cases that are
outlying with respect to their X values. Check cases with
hi > 2p/n.
13-17
Dffits. Measures the influence that the ith case has on the ith fitted
value. Compares the ith fitted value with the ith fitted value
obtained by omitting the ith case. Check cases for which
|Dffits|>2(p/n).
Cook’s Distance. Similar to Dffits, but considers instead the
influence of the ith case on all n fitted values. Check when
Cook’s Dist > Fp,n-p,0.50.
Covariance Ratio. The change in the determinant of the
covariance matrix that occurs when the ith case is deleted.
Check cases with
|Cov Ratio  1| 3p/n.
Dfbetas. A measure of the influence of the ith case on each
estimated regression parameter. For each regression
parameter, check cases with
|Dfbeta| > 2/n.
13-18
Cutoffs: Hat=0.29, CovRatio=0.43, Dffits=0.76, Dfbetas=0.53
13-19
Regression Plot
Y = 7.71100 + 15.1982 X
S = 6.43260
R-Sq = 98.1 %
R-Sq(adj) = 98.0 %
170
Y
120
70
20
0
1
2
3
4
5
6
7
8
9
10
X
Normal Probability Plot of the Residuals
(response is Y)
2
Normal Score
1
0
-1
-2
-10
0
Residual
10
13-20
Residuals Versus the Fitted Values
(response is Y)
Obs 5
Residual
10
Obs 1
0
-10
20
70
120
170
Fitted Value
Obs 2
13-21