High-leverage outlier

Download Report

Transcript High-leverage outlier

STATS 330: Lecture 11
4/7/2015
330 lecture 11
1
Outliers and high-leverage
points
 An outlier is a point that has a larger or
smaller y value that the model would
suggest
• Can be due to a genuine large error e
• Can be caused by typographical errors in
recording the data
 A high leverage point is a point with
extreme values of the explanatory
variables
4/7/2015
330 lecture 11
2
Outliers
 The effect of an outlier depends on whether it is
also a high leverage point
 A “high leverage” outlier
• Can attract the fitted plane, distorting the fit,
sometimes extremely
• In extreme cases may not have a big residual
• In extreme cases can increase R2
 A “low leverage” outlier
• Does not distort the fit to the same extent
• Usually has a big residual
• Inflates standard errors, decreases R2
4/7/2015
330 lecture 11
3
35
30
15
20
y
Low
leverage
Outlier: big
residual
25
35
30
25
15
20
y
No outliers
No highleverage
points
2
3
4
5
6
7
2
3
4
30
25
15
10
0
5
y
0
1
2
3
4
5
6
x
4/7/2015
7
20
25
20
15
10
5
0
y
High
leverage
Not an
outlier
6
x
30
x
5
0
1
2
3
4
5
6
Highleverage
outlier
x
330 lecture 11
4
400
300
380
under18
400
350
educ
450
500
550
Example: the education
data (ignoring urban)
High
leverage
point
360
250
340
320
200
300
3000
280
3500
4000
4500
5000
5500
6000
percap
4/7/2015
330 lecture 11
5
50
0
-50
13
23 49
4
21 46 41
10
15
9
16 11
20 40
18
5
4717 24
14
2
43
38
19372839
36
26
3
44
31 30
1
25
27 35 42
34 32
157
47 21
14
59
4243
3
46 48
13
38
29
40
3436
25
126
23
31
37 4144 22
3027 2420
32
2839
35
6
17
49
3319 4
2 11
12
1618 8
45
33
300
320
340
360
Residual
somewhat
extreme
10
380
under18
4/7/2015
50
50
45
22
residuals(educ.lm)
5500
48
5000
4500
3500
4000
percap
6
8 12
7
29
100
An outlier also?
200
250
300
350
400
450
predict(educ.lm)
330 lecture 11
6
Measuring leverage
It can be shown (see eg STATS 310) that the fitted
value of case i is related to the response data y1,…,yn
by the equation
yˆ  Hy where H  X ( X ' X ) X ' ,
1
yˆ i  h i 1 y 1  ...  h ii y i  ...  h in y n
The hij depend on the explanatory variables. The quantities hii are
called “hat matrix diagonals” (HMD’s) and measure the influence yi has
on the ith fitted value.
They can also be interpreted as the distance between the
X-data for the ith case and the average x-data for all the cases. Thus,
they directly measure how extreme the x-values of each point are
4/7/2015
330 lecture 11
7
Interpreting the HMDs
 Each HMD lies between 0 and 1
 The average HMD is p/n
• (p=no of reg coefficients, p=k+1)
 An HMD more than 3p/n is considered
extreme
4/7/2015
330 lecture 11
8
Example: the education
data
educ.lm<-lm(educ~percapita+under18,
data=educ.df)
hatvalues(educ.lm)[50]
50
0.3428523
n=50, p=3
> 9/50
[1] 0.18
Clearly extreme!
4/7/2015
330 lecture 11
9
Studentized residuals
 How can we recognize a big residual? How big
is big?
 The actual size depends on the units in which
the y-variable is measured, so we need to
standardize them.
 Can divide by their standard deviations
 Variance of a typical residual e is
var(e) = (1-h) s2
where h is the hat matrix diagonal for the point.
4/7/2015
330 lecture 11
10
Studentized residuals (2)
 “Internally studentised”
ei
(Called
“standardised” in R)
(1  hi ) s
 “Externally studentised”
(Called
“studentised” in R)
4/7/2015
s2 is Usual Estimate
of s2
2
s2i is estimate of
s2 after deleting the
ei
ith data point
(1  hi ) s i
2
330 lecture 11
11
Studentized residuals (3)
 How big is big?
 Both types of studentised residual are
approximately distributed as standard normals
when the model is OK and there are no outliers.
(in fact the externally studentised one has a tdistribution)
 Thus, studentised residuals should be between
-2 and 2 with approximately 95% probability.
4/7/2015
330 lecture 11
12
Studentized residuals (4)
Calculating in R:
library(MASS)
# load the MASS library
stdres(educ.lm)
#internally studentised (standardised in R)
studres(educ.lm)
#externally studentised (studentised in R)
> stdres(educ.lm)[50]
50
3.275808
> studres(educ.lm)[50]
50
3.700221
4/7/2015
330 lecture 11
13
What does studentised
mean?
4/7/2015
330 lecture 11
14
Recognizing outliers
 If a point is a low influence outlier, the residual
will usually be large, so large residual and a low
HMD indicates an outlier
 If a point is a high leverage outlier, then a large
error usually will cause a large residual.
 However, in extreme cases, a high leverage
outlier may not have a very big residual,
depending on how much the point attracts the
fitted plane. Thus, if a point has a large HMD,
and the residual is not particularly big, we can’t
always tell if a point is an outlier or not.
4/7/2015
330 lecture 11
15
High-leverage outlier
Small
residual!
Residuals vs Fitted
3.0
1.0
Plot of y versus x, Example 5
Fitted Line
True line
21
Small
residual!
0.0
0.5
-0.5
1.0
1.5
y
Residuals
2.0
0.5
2.5
24
15
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x
4/7/2015
1.0
1.5
2.0
2.5
3.0
Fitted values
330 lecture 11
16
Leverage-residual plot
plot(educ.lm, which=5)
Point 50 is high
leverage, big residual,
is an outlier
Residuals vs Leverage
2
1
0.5
-1
0
1
7
45
-2
Standardized residuals
3
50
Cook's distance
0.5
Can plot standardised
residuals versus
leverage (HMD’s): the
leverage-residual plot (LR
plot)
1
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Leverage
lm(educ ~ percap + under18)
4/7/2015
330 lecture 11
17
Interpreting LR plots
S
t
a
n
d
a
r
d
i
s
e
d
r
e
s
i
d
u
a
l
4/7/2015
Low leverage
outlier
3p/n
High leverage
outlier
2
0
Possible
high
leverage
outlier
OK
-2
Low leverage
outlier
leverage
330 lecture 11
High leverage,
outlier
18
Residuals and HMD’s
Plot of y versus x, Example 1
2
Residuals vs Leverage
1
0
1
1.0
-2
1.5
23
-1
Standardized residuals
2.5
2.0
y
9
0.5
23
Cook's distance
1
0.0
0.5
9
Fitted Line
True line
0.2
0.4
0.6
0.8
0.00
x
0.05
0.10
0.15
Leverage
No big studentized residuals, no big HMD’s
(3p/n = 0.2 for this example)
4/7/2015
330 lecture 11
19
Residuals and HMD’s (2)
Plot of y versus x, Example 2
Residuals vs Leverage
Point 24
4
3
2
2.0
0.5
-1
1.5
1
1
Standardized residuals
24
Point 24
y
24
0
2.5
Fitted
True
1.0
Cook's distance
-2
1
0.0
0.2
0.4
0.6
0.8
0.00
x
4/7/2015
19
19
0.05
0.10
1
0.5
0.15
Leverage
One big studentized residual, no big HMD’s
(3p/n = 0.2 for this example). Line moves a
bit
330 lecture 11
20
Residuals and HMD’s (3)
Plot of y versus x, Example 3
4.0
Residuals vs Leverage
2
1
1
0
1
0.5
1
-2
1.5
9 23
0.5
-1
Standardized residuals
2.5
Point 1
2.0
yy
1
9
3.0
3.5
Fitted Line
True line
23
Cook's distance
0.0
0.5
1.0
1.5
xx
Point 1
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Leverage
No big studentized residual, one big HMD, pt 1.
(3p/n = 0.2 for this example). Line hardly
moves.Pt 1 is high leverage but not influential.
4/7/2015
330 lecture 11
21
Residuals and HMD’s (4)
Plot of y versus x, Example 4
Residuals vs Leverage
5
5
1
3
2
1
3
Point 1
1
9
2
0.5
Point 1
0
Standardized residuals
4
1
2
yy
4
Fitted Line
True line
-1
9
Cook's distance
1
2
0.5
1
0.0
0.5
1.0
1.5
xx
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Leverage
One big studentized residual, one big HMD, both
pt 1. (3p/n = 0.2 for this example). Line moves
but residual is large. Pt 1 is influential
4/7/2015
330 lecture 11
22
Residuals and HMD’s (5)
Point 1
Plot of y versus x, Example 5
7
1
1
0.5
0
7
0.5
1
29
-2
0.0
0.5
-1
Standardized residuals
1
Point 1
1.5
2.0
2.5
Fitted Line
True line
1.0
y
2
3.0
Residuals vs Leverage
1
29
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x
4/7/2015
Cook's distance
0.0
0.2
0.4
0.6
Leverage
No big studentized residuals, big HMD, 3p/n=
0.2. Point 1 is high leverage and
influential
330 lecture 11
23
Influential points
 How can we tell if a high-leverage
point/outlier is affecting the regression?
 By deleting the point and refitting the
regression: a large change in coefficients
means the point is affecting the regression
 Such points are called influential points
 Don’t want analysis to be driven by one
or two points
4/7/2015
330 lecture 11
24
“Leave-one out” measures
 We can calculate a variety of measures by
leaving out each data point in turn, and
looking at the change in key regression
quantities such as
• Coefficients
• Fitted values
• Standard errors
 We discuss each in turn
4/7/2015
330 lecture 11
25
Example: education data
4/7/2015
With point 50
Without point 50
Const
-557
-278
percap
0.071
0.059
under18
1.555
0.932
330 lecture 11
26
Standardized difference in
coefficients: DFBETAS
Formula:
ˆ j  ˆ j (  i )
s .e .( ˆ j )
Problem when: Greater than 1 in absolute value
This is the criterion coded into R
4/7/2015
330 lecture 11
27
Standardized difference in
fitted values: DFFITS
Formula:
yˆ i  yˆ i (  i )
s .e .( yˆ i )
Problem when: Greater than 3(p/(N-p) ) in absolute
value
(p=number of regression coefficients)
4/7/2015
330 lecture 11
28
COV RATIO & Cooks D
• Cov Ratio:
Measures change in the standard errors
of the estimated coefficients
Problem indicated: when Cov Ratio more than
1+3p/n or less than 1-3p/n
• Cook’s D
Measures overall change in the coefficients
Problem when: More than qf(0.50, p,n-p)
(lower 50% of F distribution), roughly 1 in
most cases
4/7/2015
330 lecture 11
29
Calculations
> influence.measures(educ.lm)
Influence measures of
lm(formula = educ ~ under18 + percap, data = educ.df)
10
dfb.1. dfb.un18 dfb.prcp
dffit cov.r
cook.d
hat inf
0.06381 -0.02222 -0.16792 -0.3631 0.803 4.05e-02 0.0257
*
44
0.02289 -0.02948
50 -2.36876
>
2.23393
0.00298 -0.0340 1.283 3.94e-04 0.1690
*
1.50181
*
2.4733 0.821 1.66e+00 0.3429
p=3, n=50, 3p/n=0.18,
3(p/(n-p)) =0.758,
qf(0.5,3,47)=
4/7/2015
0.8002294
330 lecture 11
30
Plotting influence
# set up plot window with 2 x 4
array of plots
par(mfrow=c(2,4))
# plot dffbetas, dffits, cov ratio,
# cooks D, HMD’s
influenceplots(educ.lm)
4/7/2015
330 lecture 11
31
dfb.un18
20
30
40
50
0
10
20
30
40
50
2.0
20
30
Obs. number
ABS(COV RATIO-1)
Cook's D
Hats
20
30
Obs. number
4/7/2015
40
50
50
0
10
20
30
40
Obs. number
0.30
50
0.10
Hats
0.00
0.0
10
40
0.20
1.5
1.0
0.5
Cook's D
0.10
0.00
50
1.5
0.5
0.0
10
Obs. number
10
0
0
Obs. number
44
0.20
0.0
0.0
10
50
1.0
DFFITS
1.5
0.5
1.0
dfb.un18
1.0
0.5
dfb.prcp
1.5
1.0
0.0
0.5
dfb.1_
0
ABS(COV RATIO-1)
50
2.0
50
2.0
50
DFFITS
2.5
dfb.prcp
1.5
dfb.1_
0
10
20
30
40
50
0
Obs. number
10
20
30
40
50
Obs. number
330 lecture 11
32
50
Remedies for outliers
 Correct typographical errors in the data
 Delete a small number of points and refit
(don’t want fitted regression to be
determined by one or two influential
points)
 Report existence of outliers separately:
they are often of scientific interest
 Don’t delete too many points (1 or 2 max)
4/7/2015
330 lecture 11
33
Summary: Doing it in R
 LR plot:
plot(educ.lm, which=5)
 Full diagnostic display
plot(educ.lm)
 Influence measures:
influence.measures(educ.lm)
 Plots of influence measures
par(mfrow=c(2,4))
influenceplots(educ.lm)
4/7/2015
330 lecture 11
34
HMD Summary
 Hat matrix diagonals
• Measure the effect of a point on its fitted value
• Measure how outlying the x-values are (how
“high –leverage” a point is)
• Are always between 0 and 1 with bigger
values indicating high leverage
• Points with HMD’s more than 3p/n are
considered “high leverage”
4/7/2015
330 lecture 11
35