A Broad Overview of Key Statistical Concepts

Download Report

Transcript A Broad Overview of Key Statistical Concepts

Outliers and
influential data points
The distinction
• An outlier is a data point whose response y
does not follow the general trend of the rest
of the data.
• A data point is influential if it unduly
influences any part of a regression analysis,
such as predicted responses, estimated slope
coefficients, hypothesis test results, etc.
No outliers?
No influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
70
y = 2.96 + 5.04 x
60
y
50
40
30
20
y = 1.73 + 5.12 x
10
0
0
2
4
6
8
x
10
12
14
Without the blue data point:
The regression equation is y = 1.73 + 5.12 x
Predictor
Constant
x
S = 2.592
Coef
1.732
5.1169
SE Coef
1.121
0.2003
R-Sq = 97.3%
T
1.55
25.55
P
0.140
0.000
R-Sq(adj) = 97.2%
With the blue data point:
The regression equation is y = 2.96 + 5.04 x
Predictor
Constant
x
S = 4.711
Coef
2.958
5.0373
SE Coef
2.009
0.3633
R-Sq = 91.0%
T
1.47
13.86
P
0.157
0.000
R-Sq(adj) = 90.5%
Any outliers?
Any influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
y = 1.73 + 5.12 x
70
60
y
50
40
y = 2.47 + 4.93 x
30
20
10
0
0
2
4
6
8
x
10
12
14
Without the blue data point:
The regression equation is y = 1.73 + 5.12 x
Predictor
Constant
x
S = 2.592
Coef
1.732
5.1169
SE Coef
1.121
0.2003
R-Sq = 97.3%
T
1.55
25.55
P
0.140
0.000
R-Sq(adj) = 97.2%
With the blue data point:
The regression equation is y = 2.47 + 4.93 x
Predictor
Constant
x
S = 2.709
Coef
2.468
4.9272
SE Coef
1.076
0.1719
R-Sq = 97.7%
T
2.29
28.66
P
0.033
0.000
R-Sq(adj) = 97.6%
Any outliers?
Any influential data points?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Any outliers?
Any influential data points?
y = 1.73 + 5.12 x
70
60
y
50
40
30
y = 8.51 + 3.32 x
20
10
0
0
2
4
6
8
x
10
12
14
Without the blue data point:
The regression equation is y = 1.73 + 5.12 x
Predictor
Constant
x
S = 2.592
Coef
1.732
5.1169
SE Coef
1.121
0.2003
R-Sq = 97.3%
T
1.55
25.55
P
0.140
0.000
R-Sq(adj) = 97.2%
With the blue data point:
The regression equation is y = 8.50 + 3.32 x
Predictor
Constant
x
S = 10.45
Coef
8.505
3.3198
SE Coef
4.222
0.6862
R-Sq = 55.2%
T
2.01
4.84
P
0.058
0.000
R-Sq(adj) = 52.8%
Impact on regression analyses
• Not every outlier strongly influences the
regression analysis.
• Always determine if the regression analysis
is unduly influenced by one or a few data
points.
• Simple plots for simple linear regression.
• Summary measures for multiple linear
regression.
The leverages hii
The leverages hii
The predicted response can be written as a linear
combination of the n observed values y1, y2, …, yn:
yˆ i  hi1 y1  hi 2 y2   hii yi   hin yn for i  1,, n
where the weights hi1, hi2, …, hii, …, hin depend only
on the predictor values.
For example:
yˆ1  h11 y1  h12 y2    h1n yn
yˆ 2  h21 y1  h22 y2    h2 n yn

yˆ n  hn1 y1  hn 2 y2    hnn yn
The leverages hii
Because the predicted response can be written as:
yˆ i  hi1 y1  hi 2 y2   hii yi   hin yn for i  1,, n
the leverage, hii, quantifies the influence that the
observed response yi has on its predicted value yˆ i .
Properties of the leverages hii
• The leverage hii is:
– a measure of the distance between the x value
for the ith data point and the mean of the x
values for all n data points.
– a number between 0 and 1, inclusive.
• The sum of the hii equals p, the number of
parameters.
Any high leverages hii?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Dotplot for x
h(1,1) = 0.176
h(11,11) = 0.048
h(21,21) = 0.163
x
0
1
2
3
4
5
6
7
8
9
sample mean = 4.751
HI1
0.176297
0.077744
0.048147
0.072580
0.163492
0.157454
0.065028
0.049313
0.109616
0.127014
0.061276
0.051829
0.127489
0.119313
0.050974
0.055760
0.140453
Sum of HI1 = 2.0000
0.086145
0.049628
0.069311
0.141136
Any high leverages hii?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Dotplot for x
h(1,1) = 0.153
h(11,11) = 0.048
h(21,21) = 0.358
x
0
2
4
6
8
10
12
14
sample mean = 5.227
HI1
0.153481
0.077557
0.047632
0.078121
0.357535
0.139367
0.066879
0.048156
0.088549
0.116292
0.063589
0.049557
0.096634
0.110382
0.050033
0.055893
0.096227
Sum of HI1 = 2.0000
0.084374
0.052121
0.057574
0.110048
Identifying data points whose
x values are extreme .... and
therefore potentially influential
Using leverages to identify
extreme x values
Minitab flags any observations whose leverage
value, hii, is more than 3 times larger than the
mean leverage value….
n
h
h
i 1
n
ii
p

n
…or if it’s greater than 0.99 (whichever is smallest).
 p  2 
3   3   0.286
 n   21
x
14.00
y
68.00
HI1
0.357535
70
60
50
y
40
30
20
10
0
0
2
4
6
8
10
12
14
x
Unusual Observations
Obs
x
y
Fit
21
14.0 68.00 71.449
SE Fit
1.620
Residual
-3.449
St Resid
-1.59 X
X denotes an observation whose X value gives it large
influence.
 p  2 
3   3   0.286
 n   21
x
13.00
y
15.00
HI2
0.311532
70
60
50
y
40
30
20
10
0
0
2
Unusual Observations
Obs
x
y
Fit
21 13.0 15.00 51.66
4
6
8
10
12
14
x
SE Fit
5.83
Residual
-36.66
St Resid
-4.23RX
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large
influence.
Important distinction!
• The leverage merely quantifies the potential
for a data point to exert strong influence on
the regression analysis.
• The leverage depends only on the predictor
values.
• Whether the data point is influential or not
depends on the observed value yi.
Identifying outliers
(unusual y values)
Identifying outliers
• Residuals
• Standardized residuals
– also called internally studentized residuals
Residuals
Ordinary residuals defined for each observation,
i = 1, …, n:
ei  yi  yˆ i
x
1
2
3
4
y
2
5
6
9
FITS1
2.2
4.4
6.6
8.8
RESI1
-0.2
0.6
-0.6
0.2
Standardized residuals
Standardized residuals defined for each
observation, i = 1, …, n:
ei* 
MSE1
x
1
2
3
4
ei

sei 
0.400000
y
FITS1
2
2.2
5
4.4
6
6.6
9
8.8
ei
MSE1  hii 
RESI1
-0.2
0.6
-0.6
0.2
HI1
0.7
0.3
0.3
0.7
SRES1
-0.57735
1.13389
-1.13389
0.57735
Standardized residuals
• Standardized residuals quantify how large
the residuals are in standard deviation units.
– An observation with a standardized residual
that is larger than 3 (in absolute value) is
generally deemed an outlier.
– Recall that Minitab flags any observation with a
standardized residual that is larger than 2 (in
absolute value).
An outlier?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
S = 4.711
x
0.10000
0.45401
1.09765
1.27936
2.20611
...
8.70156
9.16463
4.00000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
FITS1
3.4614
5.2446
8.4869
9.4022
14.0706
HI1
0.176297
0.157454
0.127014
0.119313
0.086145
s(e)
4.27561
4.32424
4.40166
4.42103
4.50352
RESI1
-3.5330
-1.0774
-1.9166
4.4128
-2.6205
SRES1
-0.82635
-0.24916
-0.43544
0.99818
-0.58191
46.5475
45.7762
40.0000
46.7904
49.1230
23.1070
0.140453
0.163492
0.050974
4.36765
4.30872
4.58936
-0.2429
-3.3468
16.8930
-0.05561
-0.77679
3.68110
Unusual Observations
Obs
21
x
4.00
y
40.00
Fit
SE Fit
23.11
1.06
Residual
16.89
St Resid
3.68R
R denotes an observation with a large standardized residual.
Why should we care?
(Regression of y on x with outlier)
The regression equation is y = 2.95763 + 5.03734 x
S = 4.71075
R-Sq = 91.0 %
R-Sq(adj) = 90.5 %
Analysis of Variance
Source
Regression
Error
Total
DF
1
19
20
SS
4265.82
421.63
4687.46
MS
4265.82
22.19
F
192.230
P
0.000
Why should we care?
(Regression of y on x without outlier)
The regression equation is y = 1.73217 + 5.11687 x
S = 2.5919
R-Sq = 97.3 %
R-Sq(adj) = 97.2 %
Analysis of Variance
Source
Regression
Error
Total
DF
1
18
19
SS
4386.07
120.93
4507.00
MS
4386.07
6.72
F
652.841
P
0.000
Identifying influential
data points
Identifying influential data points
• Deleted residuals
• Deleted t residuals
– also called studentized deleted residuals
– also called externally studentized residuals
• Difference in fits, DFITS
• Cook’s distance measure
Basic idea of these four measures
• Delete the observations one at a time, each
time refitting the regression model on the
remaining n-1 observations.
• Compare the results using all n observations
to the results with the ith observation deleted
to see how much influence the observation
has on the analysis.
Deleted residuals
yi = the observed response for ith observation
yˆ ( i )= predicted response for ith observation based
on the estimated model with the ith observation
deleted
Deleted residual d i  yi  yˆ ( i )
15
y = 0.6 + 1.55 x
y
10
5
y = 3.82 - 0.13 x
0
0
1
2
3
4
5
6
7
8
9
10
x
y4  2.1
yˆ ( 4)  0.6  1.5510  16.1
d 4  2.1 16.1  14
Deleted t residuals
A deleted t residual is just a standardized
deleted residual:
di
ti 

s(d i )
di
MSE
1  hii 
The deleted t residuals follow a t distribution
with ((n-1)-p) degrees of freedom.
15
y = 0.6 + 1.55 x
y
10
5
y = 3.82 - 0.13 x
0
0
1
2
3
4
5
6
7
8
9
10
x
x
1
2
3
10
y
2.1
3.8
5.2
2.1
RESI1
-1.59
0.24
1.77
-0.42
TRES1
-1.7431
0.1217
1.6361
-19.7990
The t(1) distribution
density
0.3
0.2
0.1
0.0
-4
-3
-2
-1
0
1
2
3
4
t(1)
Do any of the deleted t residuals stick out like a sore thumb?
70
y = 1.73 + 5.12 x
60
y
50
40
30
20
y = 2.96 + 5.04 x
10
0
0
2
4
6
8
10
12
14
x
Row
1
2
3
...
19
20
21
x
0.10000
0.45401
1.09765
y
-0.0716
4.1673
6.5703
RESI1
-3.5330
-1.0774
-1.9166
SRES1
-0.82635
-0.24916
-0.43544
TRES1
-0.81916
-0.24291
-0.42596
8.70156
9.16463
4.00000
46.5475
45.7762
40.0000
-0.2429
-3.3468
16.8930
-0.05561
-0.77679
3.68110
-0.05413
-0.76837
6.69012
The t(18) distribution
density
0.3
0.2
0.1
0.0
-3
-2
-1
0
1
2
3
t(18)
Do any of the deleted t residuals stick out like a sore thumb?
DFITS
The difference in fits: DFITS i 
yˆ i  yˆ ( i )
MSE ( i ) hii
is the number of standard deviations that the
fitted value changes when the ith case is omitted.
Using DFITS
An observation is deemed influential …
… if the absolute value of its DFIT value is
greater than:
p 1
2
n  p 1
… or if the absolute value of its DFIT value
sticks out like a sore thumb from the other
DFIT values.
p 1
2 1
2
2
 0.82
n  p 1
21 2  1
x
14.00
y
68.00
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
DFIT1
-1.23841
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.1000
0.4540
1.0977
1.2794
2.2061
2.5006
3.0403
3.2358
4.4531
4.1699
5.2847
5.5924
5.9209
6.6607
6.7995
7.9794
8.4154
8.7161
8.7016
9.1646
14.0000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
68.0000
DFIT1
-0.52503
-0.08388
-0.18232
0.75898
-0.21823
-0.20155
0.27774
-0.08230
0.13865
-0.02221
-0.18487
0.05523
0.19741
-0.42449
-0.17249
0.29918
0.30960
0.63049
0.14948
-0.25094
-1.23841
p 1
2 1
2
2
 0.82
n  p 1
21 2  1
x
13.00
y
15.00
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
DFIT2
-11.4670
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.1000
0.4540
1.0977
1.2794
2.2061
2.5006
3.0403
3.2358
4.4531
4.1699
5.2847
5.5924
5.9209
6.6607
6.7995
7.9794
8.4154
8.7161
8.7016
9.1646
13.0000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
15.0000
DFIT2
-0.4028
-0.2438
-0.2058
0.0376
-0.1314
-0.1096
0.0405
-0.0424
0.0602
0.0092
0.0054
0.0782
0.1278
0.0072
0.0731
0.2805
0.3236
0.4361
0.3089
0.2492
-11.4670
p 1
2 1
2
2
 0.82
n  p 1
21 2  1
x
4.00
y
40.00
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
DFIT3
1.5505
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.10000
0.45401
1.09765
1.27936
2.20611
2.50064
3.04030
3.23583
4.45308
4.16990
5.28474
5.59238
5.92091
6.66066
6.79953
7.97943
8.41536
8.71607
8.70156
9.16463
4.00000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
40.0000
DFIT3
-0.37897
-0.10501
-0.16248
0.36737
-0.17547
-0.16377
0.10670
-0.09265
0.03061
-0.05850
-0.16025
-0.02183
0.05988
-0.34036
-0.18835
0.10017
0.09771
0.29275
-0.02188
-0.33969
1.55050
Cook’s distance
Cook’s distance

yi  yˆ i 

 hii 

2
p  MSE  1  hii  
2
Di
• Di depends on both residual ei and leverage hii.
• Di summarizes how much each of the
estimated coefficients change when deleting
the ith observation.
• A large Di indicates yi has a strong influence
on the estimated coefficients.
Effect on estimates of removing each
data point one at a time?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Effect on estimates of removing each
data point one at a time?
Estimated slope (b1)
6
4
All data
2
0
0
2
4
6
8
Estimated intercept (b0)
10
Effect on estimates of removing each
data point one at a time?
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Effect on estimates of removing each
data point one at a time?
Estimated slope (b1)
6
4
With (13,15) removed
All data
2
0
0
2
4
6
Estimated intercept (b0)
8
10
Using Cook’s distance
• If Di is greater than 1, then the ith data point
is worthy of further investigation.
• If Di is greater than 4, then the ith data point
is most certainly influential.
• Or, if Di sticks out like a sore thumb from
the other Di values, it is most certainly
influential.
x
14.00
y
68.00
COOK1
0.701960
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.1000
0.4540
1.0977
1.2794
2.2061
2.5006
3.0403
3.2358
4.4531
4.1699
5.2847
5.5924
5.9209
6.6607
6.7995
7.9794
8.4154
8.7161
8.7016
9.1646
14.0000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
68.0000
COOK1
0.134156
0.003705
0.017302
0.241688
0.024434
0.020879
0.038414
0.003555
0.009944
0.000260
0.017379
0.001605
0.019747
0.081345
0.015290
0.044621
0.047961
0.173897
0.011657
0.032320
0.701960
x
13.00
y
15.00
COOK2
4.04801
70
60
50
y
40
30
20
10
0
0
2
4
6
8
x
10
12
14
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.1000
0.4540
1.0977
1.2794
2.2061
2.5006
3.0403
3.2358
4.4531
4.1699
5.2847
5.5924
5.9209
6.6607
6.7995
7.9794
8.4154
8.7161
8.7016
9.1646
13.0000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
15.0000
COOK2
0.08172
0.03076
0.02198
0.00075
0.00901
0.00629
0.00086
0.00095
0.00191
0.00004
0.00002
0.00320
0.00848
0.00003
0.00280
0.03958
0.05229
0.09180
0.04809
0.03194
4.04801
x
4.00
y
40.00
2
6
COOK3
0.36391
70
60
50
y
40
30
20
10
0
0
4
8
x
10
12
14
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x
0.10000
0.45401
1.09765
1.27936
2.20611
2.50064
3.04030
3.23583
4.45308
4.16990
5.28474
5.59238
5.92091
6.66066
6.79953
7.97943
8.41536
8.71607
8.70156
9.16463
4.00000
y
-0.0716
4.1673
6.5703
13.8150
11.4501
12.9554
20.1575
17.5633
26.0317
22.7573
26.3030
30.6885
33.9402
30.9228
34.1100
44.4536
46.5022
50.0568
46.5475
45.7762
40.0000
COOK3
0.073075
0.005801
0.013793
0.067493
0.015960
0.013909
0.005955
0.004498
0.000494
0.001799
0.013191
0.000251
0.001886
0.056276
0.018263
0.005272
0.005020
0.043959
0.000253
0.058966
0.363914
A strategy for dealing with
problematic data points
• Don’t forget that the above methods are just
statistical tools. It’s okay to use common
sense and knowledge about the situation.
• First, check for obvious data errors.
– If a data entry error, simply correct it.
– If not representative of the population, delete it.
– If a procedural error invalidates the
measurement, delete it.
A comment about
deleting data points
• Do not delete data just because they do not
fit your preconceived regression model.
• You must have a good, objective reason for
deleting data points.
• If you delete any data after you’ve collected
it, justify and describe it in your reports.
• If not sure what to do about a data point,
analyze data twice and report both results.
A strategy for dealing with
problematic data points (cont’d)
• Then, consider model misspecification.
– Any important variables missing?
– Any nonlinearity that needs to be modeled?
– Any missing interaction terms?
• If nonlinearity an issue, one possibility is to
reduce scope of model and fit linear model.