Chapter 6 Diagnostics for Leverage and Influence

Download Report

Transcript Chapter 6 Diagnostics for Leverage and Influence

Chapter 6 Diagnostics for
Leverage and Influence
Ray-Bing Chen
Institute of Statistics
National University of Kaohsiung
1
6.1 Important of Detecting
Influential Observations
• Usually assume equal weights for the observations.
For example: sample mean
• In Section 2.7, the location of observations in xspace can play an important role in determining
the regression coefficients (see Figure 2.6 and 2.7)
• Outliers or observations that have the unusual y
values.
• In Section 4.4, the outliers can be identified by
residuals
2
• See Figure 6.1
3
• The point A is called leverage point.
• Leverage point:
– Has an unusual x-value and may control certain
model properties.
– This point does not effect the estimates of the
regression coefficients, but it certainly will
dramatic effect on the model summary statistics
such as R2 and the standard errors of the
regression coefficients.
4
• See the point A in Figure 6.2
5
• Influence point:
– For the point A, it has a moderately unusual xcoordinate, and the y value is unusual as well.
– An influence point has a noticeable impact on
the model coefficients in that it pulls the
regression model in its direction.
• Sometimes we find that a small subset of data
exerts a disproportionate influence on the model
coefficients and properties.
• In the extreme case, the parameter estimates may
depend on the influential subset of points than on
the majority of the data.
6
• We would like for a regression model to be
representative of all of the sample observations,
not an artifact of a few.
• If the influence points are bad values, then they
should be eliminated from the sample.
• If they are not bad values, there may be nothing
wrong with these points, but if they control key
model properties we would like to know it, as it
could affect the end use of the regression model.
• Here we present several diagnostics for leverage
and influence. And it is important to use these
diagnostics in conjunction with the residual
analysis techniques of Chapter 4.
7
6.2 Leverage
• The location of points in x-space is potentially
important in determining the properties of the
regression model.
• In particular, remote points potentially have
disproportionate impact on the parameter
estimates, standard error, predicted values, and
model summary statistics.
8
• Hat matrix plays an important role in identifying
influential observations.
H = X(X’X)-1X’
• H determines the variances and covariances of the
fitted values and residuals, e.
• The elements hij of H may be interpreted as the
amount of leverage exerted by the ith observation
yi on the ith fitted value.
9
10
• The point A in Figure 6.1 will have a large hat
diagonal and is assuredly a leverage point, but it
has almost no effect on the regression coefficients
because it lies almost on the line passing through
the remaining observations. (Because the hat
diagonals examine only the location of the
observation in x-space)
• Observations with large hat diagonals and large
residuals are likely to be influential.
• If 2p/n > 1, then the cutoff value does not apply.
11
Example 6.1 The Delivery Time Data
• In Example 3.1, p=3, n=25. The cutoff value is
2p/n = 0.24. That is if hii exceeds 0.24, then the ith
observation is a leverage point.
• Observation 9 and 22 are leverage points.
• See Figure 3.4 (the matrix of scatterplots), Figure
3.11 and Table 4.1 (the studentized residuals and
R-student)
• The corresponding residuals for the observation 22
are not unusually large. So it indicates that the
observation 22 has little influence on the fitted
model.
12
• Both scaled residuals for the observation 9 are
moderately large, suggesting that this observation
may have moderate influence on the model.
13
14
6.3 Measures of Influence:
Cook’s D
• It is desirable to consider both the location of the
point in x-space and the response variable in
measuring influence.
• Cook (1977, 1979) suggested to use a measure of
the squared distance between the least-square
estimate based on the estimate of the n points and
the estimate obtained by deleting the ith point.
15
• Usually
• Points with large values of Di have considerable
influence on the least-square estimate.
• The magnitude of Di is usually assessed by
comparing it to F, p, n-p.
• If Di = F0.5, p, n-p, then deleting point I would move ˆi
to the boundary an approximate 50% confidence
region for  based on the complete data set.
16
• A large displacement indicates that the leastsquares estimate is sensitive to the ith data point.
• F0.5, p, n-p  1
• The Di statistic may be rewritten as
• Di is proportional to the product of the square of
the ith studentized residual and hii / (1 – hii).
• This ratio can be shown to be the distance from
the vector xi to the centroid of the remaining data.
• Di is made up of a component that reflects how
well the model fits the ith observation yi and a
component that measures how far that points is 17
from the rest of the data.
• Either component (or both) may contribute to a
large value of Di.
• Di combines residual magnitude for the ith
observation and the location of that point in xspace to assess influence.
• Because
, another way to
write Cook’s distance measure is
• Di is also the squared distance that the vector of
fitted values moves when the ith observation is
deleted.
18
Example 6.2 The delivery Time Data
• Column b of Table 6.1 contains the values of
Cook’s distance measure for the soft drink
delivery time data.
19
20
21
6.4 Measure of Influence:
DFFITS and DFBETAS
• Cook’s D is a deletion diagnostic.
• Blesley, Kuh and Welsch (1980) introduce two
useful measures of deletion influence.
• First one: How much the regression coefficient
changes.
• Here Cjj is the jth diagonal element of (X’X)-1
22
• A large value of DFBETASj,i indicates that
observation i has considerable influence on the jth
regression coefficient.
• Define R = (X’X)-1X’
• The n elements in the jth row of R produce the
leverage that the n observations in the sample have
on the estimate of the jth coefficient, ˆ
j
•
r j ,i
ei
DFBETASj ,i 
rjrj S(i ) (1  hii )
r j ,i
ti

rjrj (1  hii )
23
• DFBETASj,i measures both leverage and the effect
of a large residual.
• Cutoff value: 2/n1/2
• That is if |DFBETASj,i| > 2/n1/2, then the ith
observation warrant examination.
• Second one: the deletion influence of the ith
observation on the predicted or fitted value
24
• DFFITSi is the number of standard deviation that
the fitted value changes if observation i is
removed.
• DFFITSi is also affected by both leverage and
prediction error.
• Cutoff value: 2(pn)1/2
25
26
27
6.5 A Measure of Model
Performance
• The diagnostics Di, DFBETASj,i and DFFITSi
provide insight about the effect of observations on
the estimated coefficients and the fitted values.
• No information about overall precision of
estimation.
• Generalized variance:
28
• To express the role of the ith observation on the
precision of estimation, we define
• If COVRATIOi > 1, then the ith observation
improves the precision of estimation.
• If COVRATIOi < 1, inclusion of the ith point
degrades precision.
29
• Because of 1 / (1 – hii), a high-leverage point will
make COVRATIOi large.
• The ith point is considered influential if
COVRATIOi > 1 + 3p/n or COVRATIOi < 1 – 3p/n
• Theses cutoffs are only recommended for large
sample.
Example 6.4 The Delivery Time Data
• The cutoffs are 1.36 and 0.64.
• Observation 9 and 22 are influential.
• Obs. 9 degrades precision of estimation.
• The influence of Obs. 22 is fairly small.
30
6.6 Delecting Groups of
Influential Observations
• The above methods only focus on singleobservation deletion diagnostics for influence and
leverage.
• Single-observation diagnostic => multipleobservation case.
• Extend Cook’s distance measure
• Let i be the m  1 vector of indices specifying the
points to be deleted, and
31
•
• Di is a multiple-observation version of Cook’s
distance measure.
• Large value of Di indicates that the set of m points
are influential.
• In some data sets, subsets of points are jointly
influential but individual points are not!
• Sebert, Montgomery and Rollier (1998)
investigate the use of cluster analysisto find the set
of influential observation in regression. (signlelinkage clustering procedure)
32
6.7 Treatment of Influential
Observations
• Diagnostics for leverage and influence are an
important part of the regression model-builder’s
arsenal of tools.
• Offer the analyst insight about the data, and signal
which observations may deserve more scrutiny.
• Should influential observations ever be discarded?
• A compromise between deleting an observation
and retaining it is to consider using an estimation
technique that is not impacted as severely by
influential points as least squares.
33