Transcript Chapter 5

Chapter 5
Residuals, Residual Plots,
& Influential points
Residuals (error) • The vertical deviation between the
observations & the LSRL
• the sum of the residuals is always zero
• error = observed - expected
residual  y  yˆ
Residual plot
• A scatterplot of the (x, residual) pairs.
• Residuals can be graphed against other
statistics besides x
• Purpose is to tell if a linear association
exist between the x & y variables
Weight
Consider a population of adult women. Let’s
examine the relationship between their height
and weight.
60
64
Height
68
Weight
Residuals
Suppose we now take a random sample from our
population of women.
60
64
Height
68
Residual plot
• A scatterplot of the (x, residual) pairs.
• Residuals can be graphed against other
statistics besides x
• Purpose is to tell if the model is an
appropriate fit between the x & y variables
• If no pattern exists between the points in
the residual plot, then the model is
appropriate.
Residuals
Residuals
x
Model is
appropriate
x
Model is NOT
appropriate
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
One measure of the success of knee
surgery is post-surgical range of motion
for the knee joint following a knee
dislocation. Is there a linear
relationship between age & range of
motion?
Sketch a residual plot.
Residuals
Age
x
Since there is no pattern in the
residual plot, there is a linear
relationship between age and
range of motion
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
Plot the residuals against the yhats. How does this residual plot
compare to the previous one?
Residuals
Age
ˆy
Residuals
Residuals
x
Residual plots are the same no matter if
plotted against x or y-hat.
ˆy
Coefficient of determination• r2
• the proportion of variation in y that can
be attributed to a approximate linear
relationship between x & y
• remains the same no matter which
variable is labeled x
Age
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
20
120
21
127
30
122
Let’s examine r2.
Suppose you were going to
predict a future y but you didn’t
know the x-value. Your best guess
would be the overall mean of the
existing y’s.
Total sum of the squared
16
135 deviations
14 Total variation
108
y  130.083
SStotal = 1564.917
Age
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
Now suppose you were going
to predict a future y but you DO
know the x-value. Your best
guess would be the point on the
LSRL for that x-value (y-hat).
yˆ  .871x  107.583
26
Sum of the 135
squared
16
135 the
residuals using
14
108
LSRL.
20
120
21
127
30
122
SSResid = 1085.735
Age
Range of Motion
35
154
SSTotal = 1564.917
24
142
SSResid = 1085.735
40
137
31
133
28
122
25
26
16
By what percent did the sum of
the squared error go down
when you went from just an
“overall mean” model to the
amount
of
the
“regression on x” model?
This126
is r2 – the
135 in the y-values that is
variation
explained
by
the x-values.
135
SStotal
 SSResid
14
108
20
120
21
127
30
122

SStotal
1 564.91 667 1 085.735
 .3062
1 564.91 667
Age
35
Range of Motion
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
How well does age predict the
range of motion after knee
surgery?
30.6% of the variation in
range of motion after knee
surgery can be explained by
the approximate linear
regression of age and range
of motion.
Interpretation of
2
r
r2% of the variation in y can be
explained by the approximate
linear regression of x & y.
Computer-generated regression analysis of knee surgery
Be sure to convert r2
data:
NEVER use
to decimal before 2
adjusted r !
taking the square
Predictor
Coef
Stdev
T
P
root!
Constant
107.58What is
11.12
9.67 of0.000
the equation
the
What
Age
0.8710are the0.4146
LSRL? 2.10 0.062
correlation
coefficient
Find
the slope & y-intercept.
and the coefficient of
s = 10.42
R-sq = 30.6%
R-sq(adj) = 23.7%
determination?
yˆ  107.58  .8710 x
r  .5532
Outlier –
• In a regression setting, an
outlier is a data point with a
large residual
Influential point• A point that influences where the LSRL
is located
• If removed, it will significantly change
the slope of the LSRL
Racket
Resonance
Acceleration
(Hz)
(m/sec/sec)
1
105
36.0
2
106
35.0
3
110
34.5
4
111
36.8
5
112
37.0
6
113
34.0
7
113
34.2
8
114
33.8
9
114
35.0
10
119
35.0
11
120
33.6
12
121
34.2
13
126
36.2
14
189
30.0
One factor in the
development of tennis elbow
is the impact-induced
vibration of the racket and
arm at ball contact.
Sketch a scatterplot of these
data.
Calculate the LSRL &
correlation coefficient.
Does there appear to be an
influential point? If so,
remove it and then calculate
the new LSRL &
correlation coefficient.
(189,30) could be
influential. Remove &
recalculate LSRL
(189,30) was influential
since it moved the
LSRL
Which of these measures are
resistant?
• LSRL
• Correlation coefficient
• Coefficient of determination
NONE – all are affected by outliers