16 Diagnostics - University of South Florida

Download Report

Transcript 16 Diagnostics - University of South Florida

Diagnostics
Checking Assumptions and Bad Data
Questions
• What is the linearity
assumption? How
can you tell if it
seems met?
• What is
homoscedasticity
(heteroscedasticity)?
How can you tell if
it’s a problem?
• What is an outlier?
• What is leverage?
• What is a residual?
• How can you use
residuals in assuring
that the regression
model is a good
representation of the
data?
• Why consider a
standardized residual?
• What is a studentized
residual?
Linear Model
• Linear relations b/t X and Y
• Normal distribution of error of
prediction
• Homoscedasticity (homogeneity of
error in Y across levels of X)
Good-Looking Graph
9
Y
6
3
0
No apparent departures from line.
-3
-2
0
2
X
4
6
Same Data, Different Graph
3
Residuals
2
0
-2
-3
-2
0
2
3
5
6
X
No systematic relations between X and residuals.
Problem with Linearity
50
Miles per Gallon
40
30
20
R Sq Linear = 0.595
10
50
100
150
Horsepower
200
250
Problem with
Heteroscedasticity
10
Common problem when Y = $
8
Y
6
4
2
0
-2
0
2
3
X
5
6
Outliers
10
Outlier = pathological point
8
Y
6
3
1
Outlier
-1
-2
0
2
3
X
5
6
Review
• What is the linearity assumption? How
can you tell if it seems met?
• What is homoscedasticity
(heteroscedasticity)? How can you tell
if it’s a problem?
• What is an outlier?
Residuals
Y Y
e
e
Zresid 


SDresid SDresid SY . X
• Zresid
• Look for large values (some say |z|>2)
• Studentized residual (Student Residual):
Sei  SY . X
 1 ( X  X )2 
1  

2
 N
 x 
The studentized residual considers the distance of the
point from the mean. The farther X is from the mean,
the smaller the standard error and the larger the
residual. Look for large values.
Also, studentized deleted residual (RStudent).
Influence Analysis
1 ( X  X )2
hi  
2
N
x

• Leverage:
• Leverage is an index of the importance
of an observation to a regression
analysis.
–
–
–
–
Function of X only
Large deviations from mean are influential
Maximum is 1; min is 1/N
Average value is (k+1)/N, where k is the
number of IVs
Influence Analysis (2)
• DFBETA and standardized DFBETA
• Change in slope or intercept resulting
when you delete the ith person.
• Allow for influence of both X and Y
Example
X
r = .82; r2 = .67; p < .05.
Y
2
2
3
3
3
1
SX = 1.95, SY = 2.41
b=1.01, a=-1.34
10
4
1
4
3
5
2
8
8
Y
8
5
3
0
2
3
5
6
X
M=
4.14 2.86
8
9
Example (2)
Y
Pred
Resid
Student
Residual
Rstudent
DFBETA DFBETA
a
b
2
.6875
1.3125
1.072
1.0923
.7577
-.6044
3
1.7
1.3
.962
.9526
.3943
-.2546
1
1.7
-.7
-.518
-.476
-.1970
.1272
1
2.7125
-1.7125
-1.224
-1.3086
-.2524
.0423
3
2.7125
.2875
.206
.1846
.0356
-.006
2
3.725
-1.725
-1.256
-1.3584
.0198
-.2681
8
6.7625
1.2375
1.803
2.7249
-3.5303
4.4807
Remedies
• Fit Curves if needed.
• Note heteroscedasticity for applied
problems.
• Investigate all outliers. May delete
them or not, depending. Report your
actions.
Review
• What is leverage?
• What is a residual?
• How can you use residuals in assuring
that the regression model is a good
representation of the data?
• Why consider a standardized residual?
• What is a studentized residual?