Notes on Residuals Slides

Download Report

Transcript Notes on Residuals Slides

Notes on Residuals
Simple Linear Regression Models
1
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Terminology
The predicted or fitted values result from
substituting each sample x value into the
equation for the least squares line. This gives
ŷ1  a  bx1 =1st predicted value
ŷ2  a  bx 2 =2nd predicted value
...
ŷn  a  bx n =nth predicted value
The residuals for the least squares line are the
values: y1  y
ˆ 1 , y2  yˆ 2 , ..., yn  yˆ n
2
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Consider the following data on percentage
unemployment and suicide rates.
Percentage Suicide
Unemployed Rate
New York
3.0
72
Los Angeles
4.7
224
Chicago
3.0
82
Philadelphia
3.2
92
Detroit
3.8
104
Boston
2.5
71
San Francisco
4.8
235
Washington
2.7
81
Pittsburgh
4.4
86
St. Louis
3.1
102
Cleveland
3.5
104
City
* Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158.
3
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example
The plot of the data points produced by
Minitab follows
4
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Analysis
The simple linear regression model equation
is y =  + x + e where e represents the
random deviation of an observed y value
from the population regression line  + x .
Key assumptions about e
1. At any particular x value, the distribution
of e is a normal distribution
2. At any particular x value, the standard
deviation of e is , which is constant
over all values of x.
5
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Residual Analysis
To check on these assumptions, one would
examine the deviations e1, e2, …, en.
Generally, the deviations are not known, so
we check on the assumptions by looking at
the residuals which are the deviations from
the estimated line, a + bx.
The residuals are given by
y1  yˆ 1  y1  (a  bx1)
y2  yˆ 2  y2  (a  bx2 )
yn  yˆ n  yn  (a  bxn )
6
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Standardized Residuals
Recall: A quantity is standardized by
subtracting its mean value and then dividing
by its true (or estimated) standard deviation.
For the residuals, the true mean is zero (0)
if the assumptions are true.
The estimated standard deviation of a residual
depends on the x value. The estimated standard
deviation of the ith residual, yi  y
ˆ i , is given by
1  x  x
 se 1 
n
Sxx
2
syi yˆ i
7
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Standardized Residuals
As you can see from the formula for the
estimated standard deviation the calculation
of the standardized residuals is a bit of a
calculational nightmare.
Fortunately, most statistical software
packages are set up to perform these
calculations and do so quite proficiently.
8
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Standardized Residuals - Example
Consider the data on percentage unemployment
and suicide rates
Percentage Suicide
ŷ
Unemployed Rate
New York
3.0
72
83.31
Los Angeles
4.7
224 183.70
Chicago
3.0
82
83.31
Philadelphia
3.2
92
95.12
Detroit
3.8
104 130.55
Boston
2.5
71
53.78
San Francisco
4.8
235 189.61
Washington
2.7
81
65.59
Pittsburgh
4.4
86 165.99
St. Louis
3.1
102 89.21
Cleveland
3.5
104 112.84
City
Standardized
Residual
y  yˆ
Residual
-11.31
-0.34
40.30
1.34
-1.31
-0.04
-3.12
-0.09
-26.55
-0.78
17.22
0.55
45.39
1.56
15.41
0.48
-79.98
-2.50
12.79
0.38
-8.84
-0.26
Notice that the standardized residual for Pittsburgh
is -2.50, somewhat large for this size data set.
9
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example
Pittsburgh
This point has
an unusually
high residual
10
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Normal Plots
Notice that both of the normal plots look similar. If
a software package is available to do the
calculation and plots, it is preferable to look at the
normal plot of the standardized residuals.
Normal Probability Plot of the Residuals
Normal Probability Plot of the Residuals
(response is Suicide)
(response is Suicide)
2
2
1
1
Normal Score
Normal Score
In both cases, the points look reasonable linear
with the possible exception of Pittsburgh, so the
assumption that the errors are normally distributed
seems to be supported by the sample data.
0
-1
-1
-2
-2
-50
11
0
0
Residual
50
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
Standardized Residual
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
More Comments
The fact that Pittsburgh has a large
standardized residual makes it worthwhile
to look at that city carefully to make sure the
figures were reported correctly. One might
also look to see if there are some reasons
that Pittsburgh should be looked at
separately because some other
characteristic distinguishes it from all of the
other cities.
Pittsburgh does have a large effect on
model.
12
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
(response is y)
Standardized Residual
2
1
x
0
-1
-2
This plot is an example of a satisfactory plot that
indicates that the model assumptions are reasonable.
13
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
Standardized Residual
(response is y)
2
1
0
x
-1
-2
This plot suggests that a curvilinear regression model
is needed.
14
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
3
(response is y)
Standardized Residual
2
1
x
0
-1
-2
-3
This plot suggests a non-constant variance. The
assumptions of the model are not correct.
15
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
(response is y)
Standardized Residual
2
1
x
0
-1
-2
-3
This plot shows a data point with a large standardized
residual.
16
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Visual Interpretation of
Standardized Residuals
Standardized Residuals Versus x
Standardized Residual
2
(response is y)
1
x
0
-1
-2
This plot shows a potentially influential observation.
17
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example - % Unemployment vs. Suicide Rate
Generally
decreasing
pattern to these
points.
These two points are quite
influential since they are far
away from the others in
terms of the % unemployed
Unusually large
residual
18
This plot of the residuals (errors) indicates some
possible problems with this linear model. You can see
a pattern to the points.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.