Regression diagnostics

Transcript Regression diagnostics

Slide 1

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 2

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 3

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 4

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 5

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 6

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 7

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 8

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 9

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 10

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 11

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 12

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 13

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 14

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 15

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 16

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 17

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 18

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 19

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 20

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 21

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 22

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 23

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 24

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 25

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 26

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 27

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 28

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 29

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 30

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 31

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 32

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 33

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 34

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 35

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 36

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 37

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 38

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 39

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 40

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 41

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 42

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 43

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 44

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 45

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 46

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 47

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 48

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 49

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 50

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 51

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 52

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 53

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 54

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 55

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 56

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 57

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 58

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 59

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 60

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 61

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 62

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 63

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 64

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 65

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 66

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 67

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 68

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 69

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 70

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 71

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 72

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 73

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 74

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 75

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 76

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 77

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 78

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 79

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 80

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 81

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 82

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 83

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 84

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 85

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 86

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 87

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 88

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 89

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 90

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 91

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 92

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 93

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 94

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Slide 95

V. Regression Diagnostics

 Regression

analysis assumes a
random sample of independent
observations on the same
individuals (i.e. units).
 What are its other basic
assumptions? They all concern
the residuals (e):

(1) The mean of the probability
distribution of e over all possible
samples is 0: i.e. the mean of e does
not vary with the levels of x.
(2) The variance of the probability
distribution of e is constant for all levels
of x: i.e. the variance of the residuals
does not vary with the levels of x.

(3) The errors associated with any
two different y observations are
0: i.e. the errors are
uncorrelated—the errors
associated with one value of y
have no effect on the errors
associated with other y values.
(4) The probability distribution of

e is normal.

 The assumptions are commonly
summarized as I.I.D.: independent &
identically distributed residuals.
 To the degree that these assumptions
are confirmed, then the relationship
between the outcome variable &
independent variables is adequately linear.

What are the implications of these
assumptions?

 Assumption 1: ensures that the regression
coefficients are unbiased.
 Assumptions 2 & 3: ensure that the
standard errors are unbiased & are the
lowest possible, making p-values &
significance tests trustworthy.
 Assumption 4: ensures the validity of
confidence intervals & p-values.

 Assumption 4 is by far the least important:

even if the distribution of a regession model’s
residuals depart from approximate normality, the
central limit theorem makes us generally
confident that confidence intervals & p-values will
be trustworthy approximations if the sample is at
least 100-200.
 Problems with assumption 4, however, may be
indicative of problems with the other, crucial
assumptions: when might these be violated to the
degree that the findings are highly biased &
unreliable?

 Serious violations of assumption 1 result
from anything that causes serious bias of
the coefficients: examples?
 Violations of assumption 2 occur, e.g.,
when variance of income increases as the
value of income increases, or when
variance of body weight increases as body
weight itself increases: this pattern is
commonplace.

 Violations of assumption 3 occur as a
result of clustered observations or timeseries observations: variance is not
independent from one observation to
another but rather is correlated.

 E.g., in a cluster sample of individuals
from neighborhoods, schools, or
households, the individuals within any
such unit tend to be significantly more
homogeneous than are individuals in the
wider sample. Ditto for panel or timeseries observations.

 In the real world, the linear model is usually
no better than an approximation, & violations
to one extent or another are the norm.
 What matters is if the violations surpass
some critical threshold.
 Regression diagnostics: procedures to
detect violations of the linear model’s
assumptions; gauge the severity of the
violations; & take appropriate remedial
action.

 Keep in mind: statistical vs. practical
significance in evaluating the findings
of diagnostic tests.

 See King et al. for applications of the

logic of regression diagnostics to
qualitative social science research as
well.

 Keep in mind that the linear model does
not assume that the distribution of a
variable’s observations is normal.
 Its assumptions, rather, involve the
residuals (which are the sample estimates
of the population e).
 While it’s important to inspect univariate &
bivariate distributions, & to be careful about
extreme outliers, recall that multiple
regression expresses the joint,
multivariate associations of x’s with y.

 Let’s turn our attention to regression
diagnostics.
 For the sake of presenting the
material, we’ll examine & respond to
the diagnostic tests step by step.
 In ‘real life,’ though, we should go
through the entire set of diagnostic
tests first & then use them fluidly &
interactively to address the
problems.

Model Specification
 Does a regression model properly
account for the relationship between the
outcome & explanatory variables?
 See Wooldridge, Introductory
Econometrics, pages 289-94; Allison,
Multiple Regression, pages 49-52, 123-25;
Stata Reference G-M, pages 274-79; N-R,
363-65.

 If a model is functionally misspecified,
its slope coefficients may be seriously
biased, either too low or too high.
 We could then either under or
overestimate the y/x relationship; &
conclude incorrectly that a coefficient is
insignificant or significant.

 If this is a problem, perhaps the

outcome variable needs to be redefined
to properly account for the y/x
relationships (e.g., from ‘miles per gallon’
to ‘gallons per mile’).

 Or perhaps, e.g., ‘wage’ needs to be
transformed to ‘log(wage)’.
 And/or maybe not OLS but another kind
of regression—e.g., quantile regression—
should be used.

 Let’s begin by exploring the

variables we’ll use.
. use WAGE1, clear
. hist wage, norm

. gr box wage, marker(1, mlab(id))
. su wage, d
. ladder wage
 Note that ‘ladder wage’ doesn’t

suggest a transformation, but log
wage is common for right skewness.

. gen lwage=ln(wage)
. su wage lwage
. hist lwage, norm
. gr box lwage, marker(1, mlab(id))
 While a log transformation makes wage’s
distribution much more normal, it leaves an
extreme low-value outlier, id=24.
 Let’s inspect its profile:

. list id wage lwage educ exp tenure female
nonwhite if id==24

 It’s a white female with 12 years of
education, but earning a very low wage.
 We don’t know if its wage is an error, we’ll
keep an eye on id=24 for possible problems.
 Let’s exam the independent variables:

.

. hist educ
. gr box educ, marker(1, mlab(id))
. su educ, d
. sparl lwage educ
. sparl lwage educ, quad
. twoway qfitci lwage educ
. gen educ2=educ^2
. su educ educ2

. hist exper, norm
. gr box exper, marker(1, mlab(id))
. su exper, d
. exper, ladder
. sparl lwage exper
. sparl lwage exper, quad
. twoway qfitci lwage exper

. gen exper2=exper^2
. su exper exper2

. hist tenure, norm
. gr box tenure, marker(1, mlab(id))
. su tenure, d
. su tenure if tenure>=25 & tenure<.

 Note that there are only 18 cases of
tenure>=25, & increasingly fewer cases with
greater tenure.
. ladder tenure
. sparl lwage tenure

. sparl lwage tenure, quad
 Note that there are cases of tenure=0,
which must be accommodated in a log
transformation.

. gen ltenure=ln(tenure + 1)
. su tenure ltenure
. sparl lwage tenure
. sparl lwage tenure, quad
. sparl lwage ltenure

[i.e. transformed]

. twoway qfitcit lwage tenure
. twoway qfitcit lwage ltenure
 What other options could be explored for
educ, exper, & tenure?

 The data do not include age, which could
be an important factor, including as a control.
. tab female, su(lwage)
. ttest lwage, by(female)

. tab nonwhite, su(lwage)
. ttest lwage, by(nonwhite)
 Although nonwhite tests insignificant, it might
become significant in the model.

 Let’s first test the model omitting the
transformed version of the variables:
. reg wage educ exper tenure female nonwhite

 A form of model misspecification is
omitted variables, which also causes
biased slope coefficients (see Allison,
Multiple Regression, pages 49-52).

 Leaving key explanatory variables out
means that we have not adequately
controlled for important x effects on y.
 Thus we may incorrectly detect or fail to
detect y/x relationships.

 In

STATA we use ‘estat ovtest’ (also known
as regression specification test, RESET) to
indicate whether there are important omitted
variables or not.
 ‘estat ovtest’ adds polynomials to the
model’s fitted values: we want it to test
insignificant so that we fail to reject the null
hypothesis that there are no important
omitted variables.
 We want to fail to reject the null hypothesis
that the model has no omitted variables.

. reg wage educ exper tenure female
nonwhite
. estat ovtest
Ramsey RESET test using powers of the fitted values of
wage
Ho: model has no omitted variables
F(3, 517) =
9.37
Prob > F =
0.0000

 The model fails. Let’s add the
transformed variables.

. reg lwage educ educ2 exper exper2 ltenure
female nonwhite

. estat ovtest
. estat ovtest
Ramsey RESET test using powers of the fitted values of
lwage
Ho: model has no omitted variables
F(3, 515) =
2.11
Prob > F =
0.0984

 The model passes. Perhaps it would do
better if age were available, and with other
variables & other forms of these predictors.

 estat ovtest is a decisive hurdle to clear.
 Even so, passing any diagnostic test
by no means guarantees that we’ve
specified the best possible model,
statistically or substantively.
 It could be, again, that other models
would be better in terms of statistical fit
and/or substance.

 Another way to test the model’s functional
specification via ‘linktest’: it tests whether y
is properly specified or not.
 linktest’s ‘hatsq’ must test insignificant:
we want to fail to reject the null hypothesis
that y is specifed correctly.

. linktest
--------------------------------------------------------------------------------------------------lwage |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+------------------------------------------------------------------------------------_hat | .3212452 .3478094
0.92 0.356
-.36203
1.00452
_hatsq | .2029893 .1030452
1.97 0.049
.0005559 .4054228
_cons | .5407215 .2855868
1.89 0.059
-.0203167 1.10176
-----------------------------------------------------------------------------------------------------

 The model fails. Let’s try another
model that uses categorized predictors.

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite

. estat ovtest=.12
. linktest=.15

 We’ll stick with this model—unless other
diagnostics indicate problems. Again, too
bad the data don’t include age.

 Passing ovtest, linktest, or other
diagnostic indicators does not mean
that we’ve necessarily specified the
best possible model—either
statistically or substantively.
 It merely means that the model has
passed some minimal statistical threshold
of data fitting.

 At this stage it’s helpful to plot the model’s
residual versus fitted values to obtain a
graphic perspective on the model’s fit &
problems.

1

. rvfplot, yline(0) ml(id)
0

172

343

260

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
95
17789
62
107
171
142
324
513
170
98
198
355
23518
505
405217
230
497
104
35231 46
449 175
52378
40197
13
33
245
88
399
140
92
525
515
72
326278
411
386
144
183
284
178
283
167
265
339
94
488410 25 421 15421
10
406
200
35
158
476
256
444
275
202
76
208
68
28
287
127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383 518431
162
279 122
182
181
64 294
328
179 4 314
417
1
385
194
45
348
478 26
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
366
484
508
187
302
288
255
164
34
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
307
267
500
11
251
138
519
85
101
257
21083 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486
372
503
268
396
39843
520
271
121
362
462102
357
18418517354 25967 157
329
318
155
78
221
226
93
298
463
424
466
323
266
306
351
377
499
311
433
77
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
19
273
393
20
494
293
480
141
458
215
523159
233
504
363
471
387
472
274
190
91
309
232
3
491
240
495 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126
205
160
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
9
56
39
82
350
49
224
86
136
341
244
123258
277 73 137
295
38
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

12

128

-2

24

1

1.5

2
Fitted values

 Problems of heteroscedasticity?

2.5

 So, even though the model passed
linktest & estat ovtest, at least one
basic problems remains to be
overcome.
 Let’s examine other assumptions.

 The model specification tests have to do
with biased slope coefficients, & thus with
Assumption 1.
 Next, though, let’s examine a potential
model problem—multicollinearity—which in
fact does not violate any of the regression
assumptions.

Multicollinearity
 Multicollinearity: high correlations
between the explanatory variables.

 Multicollinearity does not violate any linear
model assumptions: if our objective were
merely to predict values of y, there would be
no need to worry about it.
 But, like small sample or subsample size, it
does inflate standard errors.

 It thereby makes it tough to find out
which of the explanatory variables has a
signficant effect.
 For confidence intervals, p-values, &
hypothesis tests, then, it does cause
problems.

 Signs of multicollinearity:
(1) High bivariate correlations (say, .80+)
between the explanatory variables: but,
because multiple regression expresses
joint linear effects, such correlations (or
their absence) aren’t reliable indicators.
(2) Global F-test is significant, but none of the
individual t-tests is significant.

(3) Very large standard errors.

(4) Sign of coefficients may be opposite of
hypothesized direction (but could be
Simpson’s paradox, etc.)
(5) Adding/deleting explanatory variables
causes large changes in coefficients,
particularly switches between positive &
negative.

The following indicators are more reliable
because they better gauge the array of
joint effects within a model:

(6) Post-model estimation (STATA command ‘vif’) ‘VIF’>10: measures inflation in variance due to
multicollinearity.
 Square root of VIF: shows the amount of increase in
an explanatory variable’s standard error due to
multicollinearity.

(7) Post-model estimation (STATA command ‘vif’)‘Tolerance’<.10: reciprocal of VIF; measures extent of
variance that is independent of other explanatory variables.
(8) Pre-model estimation (downloadable STATA command
‘collin’) – ‘Condition Number’>15 or especially >30.

.

. vif
Variable |
VIF
1/VIF
-------------+----------------------------exper | 15.62 0.064013
exper2 | 14.65 0.068269
hsc |
1.88 0.532923
_Itencat_3 |
1.83 0.545275
ccol |
1.70 0.589431
scol |
1.69 0.591201
_Itencat_2 |
1.50 0.666210
_Itencat_1 |
1.38 0.726064
female |
1.07 0.934088
nonwhite |
1.03 0.971242
-------------+---------------------------Mean VIF |
4.23

 The seemingly troublesome scores for exper
exper2 are an artifact of the quadratic form &
pose no problem. The results look fine.

What would we do if there were a
problem of multicollinearity?


 Correct any inappropriate use of explanatory
dummy variables or computed quantitative
variables.

 ‘Center’ the offending explanatory variables (see
Mendenhall/Sincich), perhaps using STATA’s
‘center’ command.

 Eliminate variables—but this might cause
specification errors (see, e.g., ovtest).

 Collect additional data—if you have a big bank
account & plenty of time!
 Group relevant variables into sets of variables
(e.g., an index): how might we do this?

 Learn how to do ‘principal components analysis’
or ‘principal factor analysis’ (see Hamilton).

 Or do ‘ridge regression’ (see Mendenhall/
Sincich).

 Let’s skip to Assumption 4: that the residuals
are normally distributed.

 While this is by far the least important
assumption, examining the residuals at this
stage—as we began to do with rvfplot—can tip
us off to other problems.

Normal Distribution of Residuals
 This is necessary for confidence intervals
& p-values to be accurate.

 But this is the least worrisome problem in
general: if the sample is as large as 100200, the central limit theorem says that
confidence intervals & p-values will be
good approximations of a normal
distribution.

 The most basic way to assess the normality of
residuals is simply to plot the residuals via
histogram, box or dotplot, kdensity, or normal
quantile plot.
 We’ll use studentized (a version of
standardized) residuals because we can asses
their distribution relative to the normal distribution:
. predict rstu if e(sample), rstu
. su rstu, d
Note: to obtain unstandardized residuals—
predict e if e(sample), resid

.5
.4
.3
.2
.1
0

Density

. hist rstu, norm

-4

-2

0
Studentized residuals

2

4

 Not bad for the assumption of normality, but
the low-end outliers correspond to the earlier
evidence of heteroscedasticity.

 estat imtest (information matrix test) gives us a formal test
of the normal distribution of residuals—which they really
don’t need to pass—plus it leads us into the assessment of
non-constant variance:
Cameron & Trivedi's decomposition of IM-test
Source

chi2

df

p

Heteroskedasticity 21.27

13

0.0677

Skewness

4.25

4

0.3733

Kurtosis

2.47

1

0.1160

Total

27.99

18

0.0621

 Normality (skewness) is good, but the model just
edges by with respect to non-constant variance
(p=.0677). Let’s investigate.

Non-Constant Variance
 If the variance changes according to the levels
of the explanatory variables—i.e. the residuals
are not random but rather are correlated with the
values of x—then:
(1) the OLS standard errors are not optimal:
alternative approaches such as weighted least
squares would give better estimates; &
(2) the standard errors are biased either up or
down, making statistical significance either too
hard or too easy to detect.

 In STATA we test for non-constant
variance by means of:

(1) tests with estat: hettest or szroeter or
imtest; &
(2) graphs: rvfplot & rvpplot.
 We want the tests to turn out insignificant
so that we fail to reject the null hypothesis
that there’s no heteroscedasticity.

Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of lwage

chi2(1)
= 15.76
Prob > chi2 = 0.0001
 There seem to be problems. Let’s

inspect the individual predictors.

. estat hettest, rhs mt(sidak)
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
---------------------------------------------Variable |
chi2 df
p
-------------+------------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
1.56 1 0.9077 #
exper2 |
0.10 1 1.0000 #
_Itencat_1 | 0.44 1 0.9992 #
_Itencat_2 | 0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
-------------+-------------------------------simultaneous | 23.68 10 0.0085
----------------------------------------------# Sidak adjusted p-values

 It seems that the problem has to do with
‘tenure.’

. estat szroeter, rhs mt(sidak)

 both hettest & szroeter say that the serious
problem is with tenure.

. szroeter, rhs mt(sidak)
Szroeter's test for homoskedasticity
Ho: variance constant
Ha: variance monotonic in variable
--------------------------------------Variable |
chi2 df
p
-------------+------------------------hsc |
0.14 1 1.0000 #
scol |
0.17 1 1.0000 #
ccol |
2.47 1 0.7078 #
exper |
3.20 1 0.5341 #
exper2 |
3.20 1 0.5341 #
_Itencat_1 |
0.44 1 0.9992 #
_Itencat_2 |
0.00 1 1.0000 #
_Itencat_3 | 10.03 1 0.0153 #
female |
1.02 1 0.9762 #
nonwhite |
0.03 1 1.0000 #
--------------------------------------# Sidak adjusted p-values

 So, hettest & szroeter say that the high
category of tenure is to blame.
 What measures might be taken to
correct or reduce the problem?

 Let’s examine the graphs: rvfplot & rvpplot.

1

. rvfplot, yline(0) ml(id)

0

172

343

15
440 186
59

105
66
522
61
229 112
16
29
43642
17
450
89
95
62
107
171
142 177198 513
324
170
98
355
23518
505
405217 230
497
104
35231 46
449 175
52378
40
13
33
245
88
399
140
92
525
197
515
72
326278
411
386
183
284
178
283
25 421
167
265
339
94
488410 144
10
406
200
35
21
158
476
154
256
444
275
202
76
208
68
28
287 127
165
149
227
220
420
400
325
521
196
249
394
37
168
106
156
214
110
454
299
423
383
431
162
294
279
182
181
122
64
328
179
417
1
385
4 314
194
51826
45
348
478
407
96 239
370
375
356
130
430
195
408
456
8252
269
143
90 65
281
469 489
334395
228
474
209
416
401
338
428
319
354
484
508
187
302
288
255
164
34366
79
373
22
84
425
169
176
435
44
5
206
246
304
71
30
321
63
199
308
468
83
307
267
500
11
251
138
519
85
101
257
210 134
460
317
75
434
477
7 479427
330
459
443
397
32
481
234
231
131
253
114
163
263
189
486 54
372
503
268
396
398
520
43
271
121
362
462102
357
184185
329
318
155
78
221
226
93
298
463
424
466
323
266
157
306
351
377
499
311
433
77
259
67
173
342
467
358
292
442
117
345
441
496
113
108
145
409
201
501
146
379
359
273387
393
20
494159
293
480
141
458
215
523
233
504
363
471
472
274
190
91
309
232
3
491
240
49519 285
452
403
404
250
332 6
461512
116
148
368344
135
475
510
413
365
225
161
129
418
482
340
280
272
80
336
99
243
133
333
213
191
446
103
132
414
422
437
297
337
147
367
402
419
87
247
211
204
514
511
432
261
23
100
384
415
493
526
361
310
222
353
81
14
192
270
264
451
126 160
205
262
109
216
97
301
2219238
118
465
124
335
380
439
48
360
119207
53 313 153
412 290
152
507485
392
50
374
509
296506
389
364
305
27455
376
517
322
236
470
180
438
111
445
115
151
57
483
139
320
120
223
490
303
331
316
429
237
349
938
56
39
350
49
224
86
136
341
244
123258
27782 73
295
137
388
498
242
312
371
70 289
464
241
327
524
369 315254
390
55
391 347
218 346 60
69
291
473
286
248
36
426
300
193
492
74
502
174
276 447
47
166
282 382
188
457
453 51
487
150
125 448
516
212
41

-1

58
203381

260

12

128

-2

24

1

1.5

2
Fitted values

2.5

15
186
59
343
66
61
112
89
107
170
98
18
497
46
33
140
92
326
183
278
284
25
265
10
200
476
256
444
76
220
325
394
214
383
417
4
518
252
478
334
209
484
187
302
79
22
176
44
63
468
307
267
427
479
231
486
398
463
466
377
433
185
173
345
441
108
201
379
393
141
458
215
471
461
475
510
413
103
437
297
6
367
514
23
384
270
465
412
290
296
27
180
438
111
115
139
320
120
73
341
38
388
498
312
464
241
524
369
390
347
473
300
74

260
440
381
172
58
203
105
41
522
12
229
42
16
29
436
17
450
95
177
62
171
142
324
513
198
355
235
505
405
230
104
217
352
449
52
31
40
175
13
245
88
399
378
525
197
515
72
411
386
144
178
421
283
167
339
94
488
406
410
35
21
158
154
275
202
208
68
28
287
127
165
149
227
420
400
521
196
249
37
168
106
156
110
454
299
423
431
162
294
279
182
181
122
64
328
179
314
1
385
194
45
348
407
96
370
375
356
130
430
195
408
456
8
26
269
143
90
281
469
228
474
395
416
401
338
65
428
319
239
354
366
508
288
255
164
34
373
489
84
425
169
435
5
206
246
304
71
30
321
199
308
83
500
11
251
138
519
85
101
257
210
460
317
75
434
477
7
134
330
459
443
397
32
481
234
131
253
114
163
263
189
54
372
503
268
396
520
43
271
121
362
462
357
184
329
318
155
78
221
226
93
298
424
323
266
157
306
351
499
311
77
259
67
102
342
467
358
292
442
117
496
113
145
409
501
146
359
19
273
20
494
293
480
285
523
233
504
363
387
472
274
512
190
91
309
232
3
491
240
495
344
452
403
404
250
332
159
116
148
368
135
365
225
161
129
418
482
340
280
272
80
99
336
243
133
333
213
191
446
132
414
422
337
147
402
87
419
247
211
204
511
432
261
100
415
493
526
361
310
222
353
81
14
192
264
451
126
205
160
262
109
97
216
301
485
2
219
118
124
207
335
380
439
48
360
119
53
313
152
507
238
392
50
374
509
153
364
389
305
376
517
322
470
236
506
455
445
151
57
483
223
490
303
331
316
429
237
9
349
56
39
82
350
49
224
86
136
244
123
277
295
137
242
258
371
70
327
254
289
55
391
60
315
218
69
291
346
286
248
36
426
193
492
502
174
276
47
447
166
382
453
51
487
125
516

-1

0

1

. rvpplot _Itencat_3, yline(0) ml(id)

282
188
457
150
448
212
128

-2

24

0

.2

.4

.6
tencat==3

 By the way, note id=24.

.8

1

 What to do about tenure? Although the
model passed ovtest, adding omitted
variables is a principal response to nonconstant variance. I’m guessing that
including the variable age, which the data set
doesn’t have, would either solve or reduce
the problem. Why?

 Maybe the small # observations for the high
end of tenure matters as well.
 Other options:

 Try interactions &/or transformations
based on qladder & ladder.
 Categorizing a continuous predictor
(multi-level or binary) may work—
although at the cost of lost information).
 We also could transform the outcome
variable (see qladder, ladder, etc.),
though not in this example because we
had good reason for creating log(wage).

 A more complicated option would be to
use weighted least squares regression.
 If nothing else works, we could use
robust standard errors. These relax
Assumption 2—to the point that we
wouldn’t have to check for non-constant
variance (& in fact the diagnostics for
doing so wouldn’t work).

 Whatever strategies we try, we redo the
diagnostics & compare the new model’s
coefficients to the original model.
 The key question: is there a practically
significant difference in the models?
 My own data exploration finds that
nothing works, perhaps because tenure
is correlated with age, which the data set
does not include.

 What can we do?

 We can use robust standard errors.

 It’s quite common, recommended
practice to use robust standard errors
routinely, as long as the sample size isn’t
small.
 Doing so relaxes Assumption 2, which
we then no longer need to check.
 If we do use robust standard errors, lots of
our routine diagnostic procedures won’t work
because their statistical premises don’t hold.

 A reasonable, routine strategy is to use
robust standard errors at this point, then
re-estimate the model without the robust
standard errors & compare the difference.
 Even if we decide that we should stick
with robust standard errors, we can do the
following: estimate the model without
them; conduct the next rounds of
diagnostic tests; & then use robust
standard errors in our final model.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite
. est store m1
.

. xi:reg lwage hs scol ccol exper exper2 i.tencat
female nonwhite, robust
. est store m2_robust
. est table m1 m2_robust, star stats(N)

. est table m1 m2_robust, star stats(N)
---------------------------------------------Variable |
m1
m2_robust
-------------+-------------------------------hsc | .22241643***
.22241643***
scol | .32030543***
.32030543***
ccol | .68798333***
.68798333***
exper | .02854957***
.02854957***
exper2 | -.00057702***
-.00057702***
_Itencat_1 | -.0027571
-.0027571
_Itencat_2 | .22096745***
.22096745***
_Itencat_3 | .28798112***
.28798112***
female | -.29395956***
-.29395956***
nonwhite | -.06409284
-.06409284
_cons | 1.1567164***
1.1567164***
-------------+-------------------------------N |
526
526
---------------------------------------------legend: * p<0.05; ** p<0.01; *** p<0.001

 There’s no difference at all!

 See Allison, who points out that nonconstant variance has to be
pronounced in order to make a
difference.
 It’s a good idea, in any case, to
specify robust standard errors in a
final model.

 For now, we won’t use

robust standard errors so that
we can explore additional
diagnostics.

 Our final model, however,
will use robust standard
errors.

Correlated Errors
 In the case of these data there’s no need
to worry about correlated errors: the sample
is neither cluster nor panel or time series.
 In general there’s no straightforward way
to check for correlated errors.
 If we suspect correlated errors, we
compensate in one or more of the following
three ways:

(1) by using robust standard errors;
(2) if it’s a cluster sample, by using STATA’s
cluster option with the sample-cluster
variable.
. xi:reg wage educ educ2 exper i.tenure, robust
cluster(district)
 But again, our data aren’t based on a cluster
sample.

(3) if it’s time-series data, by using Stata’s
bygodfrey option for Breusch-Godfrey
Lagrange Multiplier.

This model seems to be satisfactory from the
perspective of linear regression’s assumptions
with the exception of an insignificant problem
with non-constant variance.
 But there’s another potential problem:
influential outliers.
 Particularly in small samples, OLS slope
estimates can be strongly influenced by
particular observations.

 An observation’s influence on the slope
coefficients depends on its discrepancy &
leverage:
discrepancy: how far the observation on the
Y-axis falls from the mean for y;
leverage: how far the observation on the Xaxis falls from the mean for x.

discrepancy + leverage = influence
 Highly influential observations are most
likely to occur in small samples.

 Discrepancy is measured by residuals, a
standardized version of which is the studentized
residual, which behaves like a t or z statistic:
 Studentized residuals of –3 or less or +3 or
more usually represent outliers (i.e. y-values with
high residuals), which indicate potential influence.
 Large outliers can affect the equation’s constant,
reduce its fit, & increase its standard errors, but
they don’t influence the regression coefficients.

 Leverage is measured by hat value, a
non-negative statistic that summarizes how
far the explanatory variables fall from their
means: greater hat values are farther from
their x-mean.

 Hat values are likely to be greater in small
or moderate samples: values of 3*k/n or
more are relatively large & indicate
potential influence.

 Whereas studentized residuals & hat
values each measure potential influence,
Cook’s Distance & DFITS measure the
actual influence of an observation on the
overall fit of a model.
 Cook’s Distance & DFITS values of 1 or
more, or 4/n or more (in large samples),
suggest substantial influence (of 1 or more
standard deviations) on the model’s overall
fit.

 DFBETAs also measure the actual influence of
observations, in this case on particular slope
coefficients (e.g., DFeduc, DFexper, & DFtenure).
 DFBETAs, then, provide the most direct measure
of the influence of explanatory variables on slope
coefficients.
 Every DFBETA increment of 1 increases the
corresponding slope coefficient by 1 standard
deviation: DFBETAs of 1 or more, or of at least 2
times the square root of n (in large samples),
represent influential outliers.

 But before we examine these influence
indicators, let’s examine some graphic
indicators:
. lvr2plot, ml(id)
. avplots
. avplot _Itencat_3, ml(id)

. lvr2plot, ms(i) ml(id)
.06

465

.05

306

.01

.02

.03

.04

520
298
305
266
252
133
463
253
282
407 167
308
262
150
164
342
196
202
72 36
226
219
511
431
444
26
241
398
391
512
250
382
67
418
109265 248
526
468
328
287
105
438
299
336
397
25
414
425
64
58
138
85
191
222
89
442
147
509
178
239
234
417
471
151
259
366
179
42
37 388 405
466
62
309
129
498
492
44
9628
303
409
390
59
319
273
23
335
175
445
447
480
503
499
325
29
337
276
130
385
174
381 212
111
315502
216
311
379
461
403
81
387
365
341
406
61
487
370
127
149
401
230 450
458
57
211
279
32
483
378
166 522
436
318
367
4
470
331
486
280
126
30
452
245
389
504
159
330
473
345
484
33
18171
258
92
497
260
121
131
146
165
462
272
485
218
267
404
488
39
334
430
455
355
376
277
293
182
237
52
426
71
402
467
99
213
140
433
6
208
183
286
160
97
506
123
205
153
49
393
119
283
375
420
115
478
16
274
422
163
408
120
386
244
514
518
27
297
479
116
326
333
439
524
207
290
199
80
48
392
227
421
114
496
11
132
220
43
400
223
515
31
125
427
141
517
316
476
10
448
508
235
181
158
82
278
449
477
441
395
364
46
229 66
296
424
185
288
161
47 112
443
157
358
7
456
195
356
162
76
217
373
170
137
359
412
490
101
168
156
360
339
155
78
268
501
275
233
351
413
56
215
332
313
231
435
192
525
173
232
3
176
2
327
289
291
113
368
489
8
371
352
69
505
372
210
494
523
428
416
1
411
255
301
225
521
236
399
55
193
142
74
350
357
460
344
347
380
377
383
124
110
60
107
20
12 457
93
77
180
194
281
139
38
338
432
45
361
106
410
65
423
54
251
84
228
474
294
70
184
363
247
353
374
145
243
284
475
320
203
5
118
169
122
73
221
307
384
507
343 440
83
63
214
271
464
369
198
300
172
493
322
68
429
189
481
100
317
79
324
396
312
134
117
495
415
144
346
491
510
354
34
95
472
459
257
209
103
148
269
446
323
519
246
143
14
270
50
94
302
469
437
9
154
104
90
419
394
53
238
295
186
500
304
91
254
256
13
513
188
348
224
454
206
108
285
51
187
21
177
362
264
242
453
329
22
482
340
249
349
35
88
98
41
86
200
51615
434
201
135
310
190
152
314
261
240
102
87
292
136
19
197
263
451 40
75
204
17
321

0

.01

128

.02
Normalized residual squared

24

.03

.04

 There are signs of a high residual point & a high
leverage point (id=24), but, given that no points appear
in the the top right-hand area, no observations appear to
be influential.

. avplots: look for simultaneous x/y
1
-1

0
.5
e( scol | X )

1

0
.5
e( _Itencat_1 | X )

1

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

coef = -.0027571, se = .0491805, t = -.06

-1

-.5
0
.5
e( female | X )

1

coef = -.29395956, se = .03576118, t = -8.22

-.5

0
.5
e( nonwhite | X )

1

coef = -.06409284, se = .05772311, t = -1.11

0
-1
-10

-5
0
5
e( exper | X )

10

coef = .02854957, se = .00503299, t = 5.67

1

-1
-2

-2
-.5

coef = -.00057702, se = .00010737, t = -5.37

1

coef = .68798333, se = .05753517, t = 11.96

0

e( lwage | X )

-1

0

e( lwage | X )

600

0
.5
e( ccol | X )

1

1

coef = .32030543, se = .05467635, t = 5.86

1
0
-1
-2
-400 -200
0
200 400
e( exper2 | X )

-.5

0

coef = .22241643, se = .04881795, t = 4.56

-.5

-2

-2

1

-1

0
.5
e( hsc | X )

-2

-.5

e( lwage | X )

-1

e( lwage | X )

1
-1

0

e( lwage | X )

0
-1
-2

-2

-1

0

e( lwage | X )

1

1

2

extremes.

-1

-.5
0
.5
e( _Itencat_2 | X )

1

coef = .22096745, se = .04916835, t = 4.49

-1

-.5
0
.5
e( _Itencat_3 | X )

1

coef = .28798112, se = .0557211, t = 5.17

. avplot _Itencat_3, ml(id), ml(id)

-1

0

1

59
15 186
381
343
440
66
172
260
105
61
41
522
203
112
58
107
12
89170
95
18
450
229 42
98
177
33
171
46
17
497
140
198 405
104
142
324 505
183 444
16
92
436
25
62
326
278
284
265
352
10
29
88 52 386
355
200
513 230
217
256
378
525
476
76
31
175
220
411
421
275
154
40
399
21
383
94
245
339
235
72
158
144
515
202
167
283
325
214394
518
478
449 13488 197
178
406
35
410
227
400
196
168
334
294
165
279
420
454
68
149
417
4
484
106
299
127
385
182
252
37
176
208
423
209
79
249
181
370
302
8
468
521
187
287
194
156
44
162
22
45
486
26
401
63
267
34
122
1
307
90
281
431
375
328
179
96
356
338
195
143
84
304
130
456
110
65
239
469
28
64
5 199
30
101
251
32 427398
474
228
354
416
231
377
433
314
428
489
85
508
206
408
395
479
173
463 379
345185
269
435
11
434
246
500
430
164
138
131
319
348
366
155
78
519
83
54
425
71
308
210
121
321
362
443
318
407
329
108
201
393
357
7 372
441
169255
499
257
317
477
75
234
501
141
134
467
342
215
510 6
471
481
146
461
459
23
67
113
458
475
263
189
268
253
93
226
163
288
373
293
103
323
437
297
460
396
271
259
397184
424
413
159
462
233
221
266
298
472
274
514
311
520
145
344
365
367
496
270
292
494
191
351
213
330
114
91
384
43
157
117
523
332
272
273
20
418
211
358
409
99
422
491
353
503
102
240
232
3
526
442
309
403
336
465
148
19
414
27
495
161
180
363
129
361
419
412
452
77 359
296
216
285
512
280
264160
190
147
438
306387
337
493119
380
333
360
225
118
115
12073
404
368
192
14
374290
135
133
126
111
341 241
81
364
116
100
222
139
320
504
482
340
402
204
415
247
511
517
87
262
2
446
53
57
80
261
243
506
97
39498
132
250480
451
388464
432
485
50
310
38 312
237
316
207
335
524
322
483
48
369
509
238
507
82
86
347
301
429
313
236
305
376
223
392
153
224
70
455
9
49
350
152
109
490
124
242
258
151
390
219205
56
136
439
303 277
371
123
137 327
473
389
295
74
300
254
349
218
470
244
445
69
331
55
289
291
276 174
502
346
315
391
60
248
36
286426
166 188
47
492193
282
457
382
150
448
447
453
212
487
51
125
516
128

466

-2

24

-1

-.5

0
e( _Itencat_3 | X )

.5

coef = .28798112, se = .0557211, t = 5.17

 There again is id=24. Why isn’t it a problem?

1

 So far, we see no problems with
influential observations.
 Next let’s numerically & graphically
examine the studentized residuals
(rstu), hat values (h), Cook’s Distance
(d), & dfbeta (DF_).

. predict rstu if e(sample), rstu
. predict h if e(sample), hat

. predict d if e(sample), cooksd
. dfbeta

. su rstu-DF_Ixtenure_3
 Although rstu has some clear outliers on
both the low & high sides, the other
diagnostics look good.
 Let’s use d (Cook’s Distance) to illustrate
the further analysis of influence diagnostics.

. scatter d id
.03

24

.02

150
128
59
58

282

212

105

260

382
381
487

.01

125
448
440
457
447
453
436
450

522
186
42
89
343
516
36 61
172 203
66
166
248
29
5162
12
492
174
229
276
16 41
391
112
502
405
188
72
241
171
47
167
426
230
18
315
473
355 390
193 218
175
25 52 74 95107 142 170
235 265286
497
178
300 324
505
177198
388
513
245
1731
291
498
3346
69
202
217
378
444
305
449
92
98
60
346
352
465
140
524
289
347
55
258
326
515
104
183
386
438
399
287
369
525
151
327
341
406
278
283
303
411
488
254
421
70
13 2838
371
464
196
277
339
10
123
137
509
88
144
244
284
445
39
312
49
331
111
237
57
483
40
82
94
127
158
476
120
149
197
242
316
223
410
470
56
208
219
295
350
73
115
431
490
165
262
275
455
7686 109
3748
299
325
389
506
376
429
200
224
9 21
139
220
227
320
27
153
154
517
328
420
35
256
349
364
64
68
136
180
236
252
296
335
400
322
392
179
290
119
407
526
374
222
412
439
50
216
313
360
507
511
521
417
156
168
207
279
238
485
97
182
380
53
106
26
81
110
124
126
133
152
160
214
249
394
2
23
147
162
181
205
383
385
423
118
301
96
294
4
122
414
454
211
130
191
192
336
518
1
270
337
353
367
370
418
478
514
14
194
264
361
402
415
493
45
100
164
314
375
384
430
432
6
129
239
247
250
297
310
356
366
408
451
195
213
319
334
348
422
456
811
99
132
261
272
280
281
333
365
401
419
425
80
87
90
143
204
269
308
395
437
484
512
44
103
159
161
228
243
309
416
446
461
468
469
474
508
65
116
209
225
288
338
354
368
403
404
413
428
452
471
30
34
71
79
84
138
148
176
187
255
302
332
340
373
387
435
475
482
489
510
3
5
22
85
135
169
199
206
232
267
274
344
458
480
491
495
504
63
83
91
141
190
215
233
240
246
251
273
293
304
307
363
472
523
20
67
101
146
210
257
285
321
342
345
359
379
393
409
442
460
494
496
500
501
519
7
19
32
75
77
102
108
113
114
117
131
134
145
163
173
185
201
231
234
253
259
266
292
306
311
317
330
351
358
377
397
427
433
434
441
443
459
467
477
479
481
486
499
43
54
78
93
121
155
157
184
189
221
226
263
268
271
298
318
323
329
357
362
372
396
398
424
462
463
466
503
520

0

15

0

100

200

300
id

 Note id=24.

400

500

. list rstu h d DF_Itencat_1 DF_Itencat_2
DF_Itencat_3 wage educ exper tenure
female nonwhite if id==24
To repeat, there’s no problem of influential
outliers.
 If there were a problem, what would we do?

 Correct outliers that are coding errors if

possible.

 Examine the model’s adequacy (see the
sections on model specification & nonconstant variance) & make adjustments
(e.g., adding omitted variables,
interactions, log or other transforms).
 Other options: try other types of
regression—

 Try, e.g., quantile—including median—regression
(qreg, iqreg, sqreg, bsqreg) or robust regression
(rreg). See Stata manual; Hamilton, Statistics with
Stata; & Chen et al., Regression with Stata (UCLAATS web book).
 Quantile regression works with y-outliers only,
while robust regression works with x-outliers.

 One more thing: keep in mind that there
are no significance tests associated with
these outlier/influence diagnostics.
 A given outlier, then, could result from
chance alone—as occurs by definition in a
normal distribution.
 For a way—which includes a significance
test—to examine if observations are
outliers on more than one quantitative
variable, see the command ‘hadimvo.’

 lvr2 & hadimov are very useful tools
that cut through lots of the other
outlier/influence diagnostics.

 Recall that outliers are most likely to

cause problems in small samples.
 In a normal distribution we expect
5% of the observations to be outliers.
 Don’t over-fit a model to a
sample—remember that there is
sample-to-sample variation.

Let’s Wrap It Up

 Using robust standard errors:

. xi:reg lwage hs scol ccol exper exper2
i.tencat female nonwhite, robust

Regression diagnostics

Transcript Regression diagnostics

Directory