A Broad Overview of Key Statistical Concepts

Download Report

Transcript A Broad Overview of Key Statistical Concepts

Outliers and
influential data points
The distinction
• An outlier is a data point whose response y
does not follow the general trend of the rest
of the data.
• A data point is influential if it unduly
influences any part of a regression analysis,
such as predicted responses, estimated beta
coefficients, hypothesis test results, etc.
No outliers?
No influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
70
y = 1.73 + 5.12 x
60
y
50
40
30
20
y = 2.96 + 5.04 x
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
y = 1.73 + 5.12 x
70
60
y
50
40
y = 2.47 + 4.93 x
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
y = 1.73 + 5.12 x
70
60
y
50
40
30
y = 8.51 + 3.32 x
20
10
0
0
2
4
6
8
x
10
12
14
Impact on regression analyses
• Not every outlier strongly influences the
regression analysis.
• Always determine if the regression analysis
is unduly influenced by one or a few data
points.
• Simple plots for simple linear regression.
• Summary measures for multiple linear
regression.
The leverages hi
The leverages hi
The predicted response can be written as a linear
combination of the n observed values y1, y2, …, yn:
yˆ i  h1 y1  h2 y2   hi yi   hn yn for i  1,, n
where the weights h1, h2, …, hi, …, hn depend only
on the predictor values.
The leverage, hi, quantifies the influence that the
observed response yi has on its predicted value yˆ i
Properties of the leverages hi
• The hi is:
– a measure of the distance between the x value
for the ith data point and the mean of the x
values for all n data points.
– a number between 0 and 1, inclusive.
• The sum of the hi equals p, the number of
parameters.
Any high leverages hi?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Dotplot for x
h(1) = 0.176
h(21) = 0.163
h(11) = 0.048
x
0
1
2
3
4
5
6
7
8
9
sample mean = 4.751
HI1
0.176297
0.077744
0.048147
0.072580
0.163492
0.157454
0.065028
0.049313
0.109616
0.127014
0.061276
0.051829
0.127489
0.119313
0.050974
0.055760
0.140453
Sum of HI1 = 2.0000
0.086145
0.049628
0.069311
0.141136
Any high leverages hi?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Dotplot for x
h(1) = 0.153
h(11) = 0.048
h(21) = 0.358
x
0
2
4
6
8
10
12
14
sample mean = 5.227
HI1
0.153481
0.077557
0.047632
0.078121
0.357535
0.139367
0.066879
0.048156
0.088549
0.116292
0.063589
0.049557
0.096634
0.110382
0.050033
0.055893
0.096227
Sum of HI1 = 2.0000
0.084374
0.052121
0.057574
0.110048
Identifying data points whose
x values are extreme .... and
therefore potentially influential
Using leverages to identify
extreme x values
Minitab flags any observations whose leverage
value, hi, is more than 3 times larger than the
mean leverage value….
n
h
h
i 1
n
i
p

n
…or if it’s greater than 0.99 (whichever is smallest).
 p  2 
3   3   0.286
 n   21
x
14.00
y
68.00
HI1
0.357535
70
60
50
y
40
30
20
10
0
0
2
4
6
8
10
12
14
x
Unusual Observations
Obs
x
y
Fit
21
14.0 68.00 71.449
SE Fit
1.620
Residual
-3.449
St Resid
-1.59 X
X denotes an observation whose X value gives it large
influence.
 p  2 
3   3   0.286
 n   21
x
13.00
y
15.00
HI2
0.311532
70
60
50
y
40
30
20
10
0
0
2
Unusual Observations
Obs
x
y
Fit
21 13.0 15.00 51.66
4
6
8
10
12
14
x
SE Fit
5.83
Residual
-36.66
St Resid
-4.23RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large
influence.
Identifying outliers
(unusual y values)
Identifying outliers
• Residuals
• Standardized residuals
– also called internally studentized residuals
Residuals
Ordinary residuals defined for each observation,
i = 1, …, n:
ei  yi  yˆ i
x
1
2
3
4
y
2
5
6
9
FITS1
2.2
4.4
6.6
8.8
RESI1
-0.2
0.6
-0.6
0.2
Standardized residuals
Standardized residuals defined for each
observation, i = 1, …, n:
ei* 
MSE1
x
1
2
3
4
ei

sei 
0.400000
y
FITS1
2
2.2
5
4.4
6
6.6
9
8.8
ei
MSE1  hi 
RESI1
-0.2
0.6
-0.6
0.2
HI1
0.7
0.3
0.3
0.7
SRES1
-0.57735
1.13389
-1.13389
0.57735
Standardized residuals
• Standardized residuals quantify how large
the residuals are in standard deviation units.
– An observation with a standardized residual
that is larger than 3 (in absolute value) is
considered an outlier.
– Recall that Minitab flags any observation with a
standardized residual that is larger than 2 (in
absolute value).
An outlier?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
S = 4.711
x
0.10000
0.45401
1.09765
1.27936
2.20611
...
8.70156
9.16463
4.00000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
FITS1
3.4614
5.2446
8.4869
9.4022
14.0706
HI1
0.176297
0.157454
0.127014
0.119313
0.086145
s(e)
4.27561
4.32424
4.40166
4.42103
4.50352
RESI1
-3.5330
-1.0774
-1.9166
4.4128
-2.6205
SRES1
-0.82635
-0.24916
-0.43544
0.99818
-0.58191
46.5475
45.7762
40.0000
46.7904
49.1230
23.1070
0.140453
0.163492
0.050974
4.36765
4.30872
4.58936
-0.2429
-3.3468
16.8930
-0.05561
-0.77679
3.68110
Unusual Observations
Obs
21
x
4.00
y
40.00
Fit
SE Fit
23.11
1.06
Residual
16.89
St Resid
3.68R
R denotes an observation with a large standardized residual.
Identifying influential
data points
Identifying influential data points
• Deleted residuals
• Deleted t residuals
– also called studentized deleted residuals
– also called externally studentized residuals
• Difference in fits, DFITS
• Cook’s distance measure
Basic idea of these four measures
• Delete the observations one at a time, each
time refitting the regression model on the
remaining n-1 observations.
• Compare the results using all n observations
to the results with the ith observation deleted
to see how much influence the observation
has on the analysis.
Deleted residuals
yi = the observed response for ith observation
yˆ ( i )= predicted response for ith observation based
on the estimated model with the ith observation
deleted
Deleted residual d i  yi  yˆ ( i )
15
y = 0.6 + 1.55 x
y
10
5
y = 3.82 - 0.13 x
0
0
1
2
3
4
5
6
7
8
9
10
x
y4  2.1
yˆ ( 4)  0.6  1.5510  16.1
d 4  2.1 16.1  14
Deleted t residuals
A deleted t residual is just a standardized
deleted residual:
di
ti 

s(d i )
di
MSE
1  hi 
The deleted t residuals follow a t distribution
with ((n-1)-p) degrees of freedom.
15
y = 0.6 + 1.55 x
y
10
5
y = 3.82 - 0.13 x
0
0
1
2
3
4
5
6
7
8
9
10
x
x
1
2
3
10
y
2.1
3.8
5.2
2.1
RESI1
-1.59
0.24
1.77
-0.42
TRES1
-1.7431
0.1217
1.6361
-19.7990
70
y = 1.73 + 5.12 x
60
y
50
40
30
20
y = 2.96 + 5.04 x
10
0
0
2
4
6
8
10
12
14
x
Row
1
2
3
...
19
20
21
x
0.10000
0.45401
1.09765
y
-0.0716
4.1673
6.5703
RESI1
-3.5330
-1.0774
-1.9166
SRES1
-0.82635
-0.24916
-0.43544
TRES1
-0.81916
-0.24291
-0.42596
8.70156
9.16463
4.00000
46.5475
45.7762
40.0000
-0.2429
-3.3468
16.8930
-0.05561
-0.77679
3.68110
-0.05413
-0.76837
6.69012
DFITS
The difference in fits: DFITS i 
yˆ i  yˆ ( i )
MSE ( i ) hi
is the number of standard deviations that the
fitted value changes when the ith case is omitted.
DFITS
An observation is deemed influential if
the absolute value of its DFIT value is …
… greater than 1 for small to medium data sets
p
… greater than 2
for large data sets
n
… or if it just sticks out like a sore thumb
2
p
 0.62
2
2
21
n
x
14.00
y
68.00
DFIT1
-1.23841
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.1000
0.4540
1.0977
1.2794
2.2061
2.5006
3.0403
3.2358
4.4531
4.1699
5.2847
5.5924
5.9209
6.6607
6.7995
7.9794
8.4154
8.7161
8.7016
9.1646
14.0000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
68.0000
DFIT1
-0.52503
-0.08388
-0.18232
0.75898
-0.21823
-0.20155
0.27774
-0.08230
0.13865
-0.02221
-0.18487
0.05523
0.19741
-0.42449
-0.17249
0.29918
0.30960
0.63049
0.14948
-0.25094
-1.23841
x
13.00
2
p
 0.62
2
2
21
n
y
15.00
DFIT2
-11.4670
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.1000
0.4540
1.0977
1.2794
2.2061
2.5006
3.0403
3.2358
4.4531
4.1699
5.2847
5.5924
5.9209
6.6607
6.7995
7.9794
8.4154
8.7161
8.7016
9.1646
13.0000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
15.0000
DFIT2
-0.4028
-0.2438
-0.2058
0.0376
-0.1314
-0.1096
0.0405
-0.0424
0.0602
0.0092
0.0054
0.0782
0.1278
0.0072
0.0731
0.2805
0.3236
0.4361
0.3089
0.2492
-11.4670
Cook’s distance
Cook’s distance

yi  yˆ i 

 hi 

2
p  MSE  1  hi  
2
Di
• Di depends on both residual ei and leverage hi.
• Di summarizes how much all of the estimated
beta coefficients change when deleting the ith
observation.
• A large Di indicates yi has a strong influence on
the estimated beta coefficients.
Cook’s distance
• Compare Di to the F(p, n-p) distribution.
• If Di is greater than the 50th percentile,
F(0.50, p, n-p), then the ith observation has
lots of influence.
x
14.00
F (0.50,2,19)  0.7191
y
68.00
COOK1
0.701960
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
x
13.00
F (0.50,2,19)  0.7191
y
15.00
COOK2
4.04801
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14