No Slide Title

Download Report

Transcript No Slide Title

Clinical Research Training Program 2021
REGRESSION DIAGNOSTICS I
Fall 2004
www.edc.gsph.pitt.edu/faculty/dodge/clres2021.html
1
OUTLINE
Purpose of Regression Diagnostics
Residuals
Ordinary residuals, standardized residuals,
studentized residuals, Jackknife residuals
Leverage points
Diagonal elements of the hat matrix
Influential observations
Cook’s distance
Collinearity
Alternate Strategies of Analysis
2
Purpose of Regression Diagnostics
The techniques of regression diagnostics
are employed to check the assumptions
and to assess the accuracy of
computations for a regression analysis.
3
MODEL ASSUMPTIONS
• Independence: the errors associated with one
observation are not correlated with the errors of any
other observation
• Linearity: the relationship between the predictors X
and the outcome Y should be linear
• Homoscedasticity: the error variance should be
constant
• Normality: the errors should be normally
distributed
• Model Specification: the model should be properly
specified (including all relevant variables, and
excluding irrelevant variables).
4
UNUSUAL OBSERVATIONS
Outliers: In linear regression, an outlier
is an observation with large residual. In
other words, it is an observation whose
dependent variable value is unusual given
its values on the predictor variables X.
An outlier may indicate a sample
peculiarity or may indicate a data entry
error or other problem.
5
UNUSUAL OBSERVATIONS
Leverage: An observation with an
extreme value on a predictor variable X is
called a point with high leverage.
Leverage is a measure of how far an
independent variable deviates from its
mean. These leverage points can have an
unusually large effect on the estimate of
regression coefficients.
6
UNUSUAL OBSERVATIONS
Influence: An observation is said to be
influential if removing the observation
substantially changes the estimate of
regression coefficients. Influence can be
thought of as the product of leverage and
outlierness.
7
SIMPLE APPROACHES
Detect errors in the data & pinpoint potential
violations of the assumptions:
 Check the type of subject
 Check the procedure for data collecting
 Check the unit of measurement for each
variable
 Check the plausible range of values and a
typical value for each variable
 Descriptive statistics
8
200
Systolic blood pressure
180
160
140
120
40
45
50
55
age
60
9
65
SIMPLE APPROACHES
 Analysis of residuals and other regression
diagnostic procedures provide the most
refined and accurate evaluation of model
assumptions.
10
RESIDUAL ANALYSIS
Model
yi  β0  β1 x1i      β p x pi  εi
i = 1, …, n
Fitted model
(unobserved) error
term for the ith
response
yˆ i  βˆ 0  βˆ1 x1i      βˆ p x pi
Ordinary residuals
ei  εˆi  yi - yˆ i
11
Difference b/w the observed and the expected outcomes
LEAST-SQUARES METHOD
ei  yi  yˆ i
45
( xi , yi )
yi
yˆ i  ˆ0  ˆ1 xi
40
( x i , yˆ i )
yˆ i
35
•
•
y
30
25
20
5
10
15
20
25
Estriol (mg/24 hr) levels of pregnant women
30
12
ORDINARY RESIDUALS
The ordinary residuals {ei} reflect the amount
of discrepancy between observed and
predicted values that remains after the data
have been fitted by the least-squares model.
Underlying assumption for unobserved
i.i.d.
errors:
2
εi ~ N (0, σ ).
Each residual ei represents an estimate of the
corresponding unobserved error i.
13
ORDINARY RESIDUALS
 The mean of {ei} is 0.
1 n
e   ei
n i 1
 The estimate of population variance computed from
n residuals is
n
1
2
2
2
ei  0  MSE
σˆ  s 

n  p  1 i 1
s2 is an unbiased estimator of 2 if the model is
correct.
14
ORDINARY RESIDUALS
• Ordinary residuals are correlated (residuals sum
up to 0) and have unequal variances even
though underlying errors are independent and
have equal variance
Var(ei )  E{(ei  0)2}   2 (1  hii )2
where hii is the ith diagonal element of the
(n  n) matrix H = X(X´X)-1X´, the hat matrix.
Note yˆ  Hy (expected y = H · observed y).
 {ei} are not independent random variables (sum
up to zero). However, if n >> p, then this
dependency can be ignored in any analysis of the
residuals.
15
STANDARDIZED RESIDUALS
ei
Standardized residual is defined as z i  .
s
The standardized residuals sum to 0 and
hence are not independent.
The standardized residuals have unit
variance.
2
n


e
1
1
1
1


2
2
zi 
ei   1
 i  2
n  p  1 i 1
n  p  1 i 1  s  s  n  p  1 i 1 
16
n

n


STUDENTIZED RESIDUALS
Studentized residual is defined as
ei
ei
ri 

.
SDei  s 1 - hii
where hii is the ith diagonal element of the
-1
(n  n) matrix H = X(X´X) X´, the hat matrix.
Note yˆ  Hy (expected y = H · observed y).
• Value hii ranges from 0 to 1, which is a
measure of leverage. A high value means more
leverage, i.e., X is further away from the X17
centroid (X-variable means).
STUDENTIZED RESIDUALS
If data point is associated with a higher
leverage value (hii is larger), we will get a
bigger studentized residual value.
Therefore, residuals on the edging will have
higher studentized residual values (making
them easier to single out).
If the data follow the usual assumptions for
linear regression, the studentized residual
approximately follows tn-p-1.
18
LEVERAGE MEASURES
 The quantity hii, the leverage, measures the
distance of the ith observation from the set of xvariable means – namely, from x1 ,..., x p .
 hii indicates that for a fixed xi , when yi moves a
little bit, how much does y i move?
 If y i moves a lot, then yi has the potential to drive
the regression, so the point is a leverage point.
However, if y i hardly moves at all, then yi has no
chance of driving the regression.


19
LEVERAGE MEASURES
Under the model, yi  β0  β1 x1i      β p x pi  εi
n
we have  hii  p  1. Consequently, the
i 1
average leverage value is h  ( p  1) /n.
Hoaglin and Welsch (1978) recommended
scrutinizing any observation for which
hii > 2(p+1)/n.
20
JACKKNIFE RESIDUALS
 Jackknife residual is defined as
r( -i ) 
s( -i )
ei
.
1 - hii
where s(2 i ) is the MSE computed with the ith
observation deleted.
 If the ith observation lies far from the rest of the
data, s(-i) will tend to be much smaller than s, which
in turn will make r(-i) larger in comparison to ri.
 If the ith observation has larger leverage value hii,
then r(-i) will become larger in comparison to ri.
21
JACKKNIFE RESIDUALS
Therefore, residuals on the edging (high
leverage) or far away from the rest of the
data (small S(-i) value) will have higher
Jackknife residual values (making them
easier to single out).
If the usual assumptions are met, each
jackknife residual exactly follows a t
distribution with (n-p-1) degrees of freedom.
22
RESIDUALS
Kleinbaum et
al.
STATA®
SAS
SPSS
(Ordinary)
Residuals
Residuals
Residuals
Unstandardized
Residuals
Standardized
Residuals
--
--
Standardized
Residuals
Studentized
Residuals
Standardized
Residuals
Standardized
Residuals
Studentized
Residuals
Jackknife
Residuals
Studentized
Residuals
Studentized
Residuals
Studentized
Residuals
23
Graphical Analysis of Residuals
 Stem-and-leaf diagram, histogram, boxplot, and
normal probability plot of residuals can be employed
to test the normality assumption.
 Studentized residual (or jackknife residual) vs.
yˆ can be employed to detect outliers.
 Studentized residual (or jackknife residual) vs.
yˆ can be employed to detect nonlinearity.
 Studentized residual (or jackknife residual) vs. (or
x) can be employed to detect heteroscedasticity.
24
Graphical Analysis of Residuals
• Stem-and-leaf diagram, histogram, boxplot,
and normal probability plot of residuals can be
employed to test the normality assumption.
• For the normal probability plot of residuals, if
the data points are distributed away from the
ideal 45 degree line, the normality assumption
is questionable.
25
Grid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles
-1.80177
1.83502
• regress sbp age
• predict rstudent, rstudent
• qnorm rstudent, grid
Studentized residuals
6.29774
.016626
1.12984
.086896
-1.65345
-2.40987
-2.40987
2.44313
Inverse Normal
26
Graphical Analysis of Residuals
 Studentized residual (or jackknife residual) vs.
yˆ can be employed to detect outliers.
 For a sample with size large enough, 95% of
Jackknife residuals should lie between + 2.
 For a sample with size large enough, 99% of
Jackknife residuals should lie between + 2.5.
 Any observation for which the absolute value of
the Jackknife residuals is 3 or more, is likely to
be an outlier.
27
6.29774
Studentized residuals
• regress sbp age
• predict yhat, xb
• predict rstudent, rstudent
• graph rstudent yhat, yline(-3, -2, 0, 2, 3)
-2.06712
120.96
Fitted values
28173.28
Graphical Analysis of Residuals
 Studentized residual (or jackknife residual)
vs. yˆ can be employed to detect
nonlinearity.
 Studentized residual (or jackknife residual)
vs. yˆ (or x) can be employed to detect
heteroscedasticity.
 In STATA®, command “hettest” performs
Cook-Weisberg test for heteroscedasticity
after “regress.”
29
. regress SBP Age
Source |
SS
df
MS
---------+--------------------------Model | 15068.9324
1 15068.9324
Residual | 13150.4391 68
193.38881
---------+--------------------------Total | 28219.3714 69 408.976398
Number of obs
F( 1,
68)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
70
77.92
0.0000
0.5340
0.5271
13.906
----------------------------------------------------------------SBP |
Coef.
Std. Err.
t
P>|t|
[95% CI]
------+---------------------------------------------------------Age | .9871668
.1118317
8.83
0.000
.764
1.210
_cons | 104.1781
5.422842
19.21
0.000
93.357 115.000
-----------------------------------------------------------------
. hettest
Cook-Weisberg test for heteroskedasticity using fitted SBP
Ho: Constant variance
chi2(1)
=
0.07
30
Prob > chi2 =
0.7975
Analysis of Leverage Points
Observations correspond to large diagonal
elements of the hat matrix (i.e., hii>2(p+1)/n)
are considered as leverage points.
31
regress sbp age
•
.
•
•
•
•
•
•
Source |
SS
df
MS
Number of obs = 32
-------------+-----------------------------F( 1, 30) = 45.18
Model | 3861.63037 1 3861.63037
Prob > F = 0.0000
Residual | 2564.33838 30 85.4779458
R-squared = 0.6009
-------------+-----------------------------Adj R-squared = 0.5876
Total | 6425.96875 31 207.289315
Root MSE = 9.2454
•
•
•
•
•
•
-----------------------------------------------------------------------------sbp | Coef. Std. Err.
t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------age | 1.6045 .2387159 6.72 0.000 1.116977 2.092023
_cons | 59.09162 12.81626 4.61 0.000 32.91733 85.26592
------------------------------------------------------------------------------
predict hat, hat
• . predict rstudent, rstudent
• . list sbp age yhat rstudent hat if hat>(4/32)
•
•
•
.
2.
sbp
122
age
yhat rstudent
hat
41 124.8761 -.3287683 .1312917
32
Should we remove outlier?
• Error in the data management step?
(coding error, data entry error, etc.)
• Suspected data points?
(subjects could not perform the task, subjects
did not take experiment seriously, etc.)
• Data points are from a different population
than the rest of the data?
• If the data points are moved, how much do the
results change?
33