Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: variable misspecification i: omitted variable bias Original citation: Dougherty, C.

Download Report

Transcript Christopher Dougherty EC220 - Introduction to econometrics (chapter 6) Slideshow: variable misspecification i: omitted variable bias Original citation: Dougherty, C.

Slide 1

Christopher Dougherty

EC220 - Introduction to econometrics
(chapter 6)
Slideshow: variable misspecification i: omitted variable bias
Original citation:
Dougherty, C. (2012) EC220 - Introduction to econometrics (chapter 6). [Teaching Resource]
© 2012 The Author
This version available at: http://learningresources.lse.ac.uk/132/
Available in LSE Learning Resources Online: May 2012
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License. This license allows
the user to remix, tweak, and build upon the work even for commercial purposes, as long as the user
credits the author and licenses their new creations under the identical terms.
http://creativecommons.org/licenses/by-sa/3.0/

http://learningresources.lse.ac.uk/


Slide 2

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification
TRUE MODEL

FITTED MODEL

Y  1  2X 2  u

Y  1  2 X 2  3 X 3  u

Yˆ  b1  b 2 X 2

Yˆ  b1  b 2 X 2
 b3 X 3

In this sequence and the next we will investigate the consequences of misspecifying the
regression model in terms of explanatory variables.
1


Slide 3

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification
TRUE MODEL

FITTED MODEL

Y  1  2X 2  u

Y  1  2 X 2  3 X 3  u

Yˆ  b1  b 2 X 2

Yˆ  b1  b 2 X 2
 b3 X 3

To keep the analysis simple, we will assume that there are only two possibilities. Either Y
depends only on X2, or it depends on both X2 and X3.
2


Slide 4

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification
TRUE MODEL

FITTED MODEL

Y  1  2X 2  u

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

Correct specification,
no problems

Yˆ  b1  b 2 X 2
 b3 X 3

If Y depends only on X2, and we fit a simple regression model, we will not encounter any
problems, assuming of course that the regression model assumptions are valid.
3


Slide 5

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification
TRUE MODEL

FITTED MODEL

Y  1  2X 2  u

Yˆ  b1  b 2 X 2

Yˆ  b1  b 2 X 2
 b3 X 3

Y  1  2 X 2  3 X 3  u

Correct specification,
no problems

Correct specification,
no problems

Likewise we will not encounter any problems if Y depends on both X2 and X3 and we fit the
multiple regression.
4


Slide 6

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification
TRUE MODEL

FITTED MODEL

Y  1  2X 2  u

Yˆ  b1  b 2 X 2

Yˆ  b1  b 2 X 2
 b3 X 3

Y  1  2 X 2  3 X 3  u

Correct specification,
no problems

Correct specification,
no problems

In this sequence we will examine the consequences of fitting a simple regression when the
true model is multiple.
5


Slide 7

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification
TRUE MODEL

FITTED MODEL

Y  1  2X 2  u

Yˆ  b1  b 2 X 2

Yˆ  b1  b 2 X 2
 b3 X 3

Y  1  2 X 2  3 X 3  u

Correct specification,
no problems

Correct specification,
no problems

In the next one we will do the opposite and examine the consequences of fitting a multiple
regression when the true model is simple.
6


Slide 8

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Consequences of variable misspecification
TRUE MODEL

FITTED MODEL

Y  1  2X 2  u

Yˆ  b1  b 2 X 2

Yˆ  b1  b 2 X 2
 b3 X 3

Correct specification,
no problems

Y  1  2 X 2  3 X 3  u
Coefficients are biased (in
general). Standard
errors are invalid.

Correct specification,
no problems

The omission of a relevant explanatory variable causes the regression coefficients to be
biased and the standard errors to be invalid.
7


Slide 9

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

E ( b2 )   2   3

  X  X  X  X 
 X  X 
2i

2

3i

3

2

2i

2

In the present case, the omission of X3 causes b2 to be biased by the term highlighted in
yellow. We will explain this first intuitively and then demonstrate it mathematically.
8


Slide 10

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

E ( b2 )   2   3

  X  X  X  X 
 X  X 
2i

2

3i

3

2

2i

2

Y
effect of X3

direct effect of
X2, holding X3
constant

X2

2

3
apparent effect of X2,
acting as a mimic for X3

X3

The intuitive reason is that, in addition to its direct effect 2, X2 has an apparent indirect
effect as a consequence of acting as a proxy for the missing X3.
9


Slide 11

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

E ( b2 )   2   3

  X  X  X  X 
 X  X 
2i

2

3i

3

2

2i

2

Y
effect of X3

direct effect of
X2, holding X3
constant

X2

2

3
apparent effect of X2,
acting as a mimic for X3

X3

The strength of the proxy effect depends on two factors: the strength of the effect of X3 on
Y, which is given by 3, and the ability of X2 to mimic X3.
10


Slide 12

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

E ( b2 )   2   3

  X  X  X  X 
 X  X 
2i

2

3i

3

2

2i

2

Y
effect of X3

direct effect of
X2, holding X3
constant

X2

2

3
apparent effect of X2,
acting as a mimic for X3

X3

The ability of X2 to mimic X3 is determined by the slope coefficient obtained when X3 is
regressed on X2, the term highlighted in yellow.
11


Slide 13

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Y  1  2 X 2  3 X 3  u

Yˆ  b1  b 2 X 2

Yi  Y   1   2 X 2i   3 X 3i  ui    1   2 X 2   3 X 3  u 
  2  X 2i  X 2    3  X 3i  X 3   ui  u

We will now derive the expression for the bias mathematically. It is convenient to start by
deriving an expression for the deviation of Yi about its sample mean. It can be expressed in
terms of the deviations of X2, X3, and u about their sample means.
12


Slide 14

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

Yi  Y   1   2 X 2i   3 X 3i  ui    1   2 X 2   3 X 3  u 
  2  X 2i  X 2    3  X 3i  X 3   ui  u
b2 

  X  X Y  Y 
 X  X 
   X  X     X
2i

2

i

2

2i

2

2



2

 2  3

2i

2

3

2i

 X 2  X 3 i  X 3    X 2 i  X 2  u i  u 

  X 2i

 X2



2

  X  X  X  X     X  X  u  u 
 X  X 
 X  X 
2i

2

3i

3

2i

2

i

2

2i

2

2

2i

2

Although Y really depends on X3 as well as X2, we make a mistake and regress Y on X2 only.
The slope coefficient is therefore as shown.
13


Slide 15

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

Yi  Y   1   2 X 2i   3 X 3i  ui    1   2 X 2   3 X 3  u 
  2  X 2i  X 2    3  X 3i  X 3   ui  u
b2 

  X  X Y  Y 
 X  X 
   X  X     X
2i

2

i

2

2i

2

2



2

 2  3

2i

2

3

2i

 X 2  X 3 i  X 3    X 2 i  X 2  u i  u 

  X 2i

 X2



2

  X  X  X  X     X  X  u  u 
 X  X 
 X  X 
2i

2

3i

3

2i

2

i

2

2i

2

2

2i

2

We substitute for the Y deviations and simplify.

14


Slide 16

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u

Yi  Y   1   2 X 2i   3 X 3i  ui    1   2 X 2   3 X 3  u 
  2  X 2i  X 2    3  X 3i  X 3   ui  u
b2 

  X  X Y  Y 
 X  X 
   X  X     X
2i

2

i

2

2i

2

2



2

 2  3

2i

2

3

2i

 X 2  X 3 i  X 3    X 2 i  X 2  u i  u 

  X 2i

 X2



2

  X  X  X  X     X  X  u  u 
 X  X 
 X  X 
2i

2

3i

3

2i

2

i

2

2i

2

2

2i

2

Hence we have demonstrated that b2 has three components.

15


Slide 17

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u
b2   2   3

  X  X  X  X     X  X  u  u 
 X  X 
 X  X 
2i

2

3

2i

2

i

2

2i

E b2    2   3

3i

2

2

2i

2

  X  X  X  X   E    X  X  u  u  




X

X
 X  X 



2i

2

3i

3

2i

2

i

2

2i

2

2

2i

2

To investigate biasedness or unbiasedness, we take the expected value of b2. The first two
terms are unaffected because they contain no random components. Thus we focus on the
expectation of the error term.
16


Slide 18

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

E b2    2   3

  X  X  X  X   E    X  X  u  u  




X

X
 X  X 



2i

2

3i

3

2i

2

i

2

2i

   X 2 i  X 2  u i  u  

E
2


  X 2i  X 2 






2

2

2i

1

 X

2i

 X2

1

 X

2i

2i

E   X 2 i  X 2  u i  u  

 X2

E  X



 X2

X



1

 X

2

2

2

2

2i

2i

 X 2  u i  u 

 X 2 E ui  u 

 0
X2 is nonstochastic, so the denominator of the error term is nonstochastic and may be
taken outside the expression for the expectation.
17


Slide 19

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

E b2    2   3

  X  X  X  X   E    X  X  u  u  




X

X
 X  X 



2i

2

3i

3

2i

2

i

2

2i

   X 2 i  X 2  u i  u  

E
2


  X 2i  X 2 






2

2

2i

1

 X

2i

 X2

1

 X

2i

2i

E   X 2 i  X 2  u i  u  

 X2

E  X



 X2

X



1

 X

2

2

2

2

2i

2i

 X 2  u i  u 

 X 2 E ui  u 

 0
In the numerator the expectation of a sum is equal to the sum of the expectations (first
expected value rule).
18


Slide 20

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

E b2    2   3

  X  X  X  X   E    X  X  u  u  




X

X
 X  X 



2i

2

3i

3

2i

2

i

2

2i

   X 2 i  X 2  u i  u  

E
2


  X 2i  X 2 






2

2

2i

1

 X

2i

 X2

1

 X

2i

2i

E   X 2 i  X 2  u i  u  

 X2

E  X



 X2

X



1

 X

2

2

2

2

2i

2i

 X 2  u i  u 

 X 2 E ui  u 

 0
In each product, the factor involving X2 may be taken out of the expectation because X2 is
nonstochastic.
19


Slide 21

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

E b2    2   3

  X  X  X  X   E    X  X  u  u  




X

X
 X  X 



2i

2

3i

3

2i

2

i

2

2i

   X 2 i  X 2  u i  u  

E
2


  X 2i  X 2 






2

2

2i

1

 X

2i

 X2

1

 X

2i

2i

E   X 2 i  X 2  u i  u  

 X2

E  X



 X2

X



1

 X

2

2

2

2

2i

2i

 X 2  u i  u 

 X 2 E ui  u 

 0
By Assumption A.3, the expected value of u is 0. It follows that the expected value of the
sample mean of u is also 0. Hence the expected value of the error term is 0.
20


Slide 22

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u
E b2    2   3

  X  X  X  X   E    X  X  u  u  




X

X
 X  X 



2i

2

3i

3

2i

2

i

2

2i

E b2    2   3

2

2

2i

2

  X  X  X  X 
 X  X 
2i

2

3i

3

2

2i

2

Thus we have shown that the expected value of b2 is equal to the true value plus a bias
term. Note: the definition of a bias is the difference between the expected value of an
estimator and the true value of the parameter being estimated.
21


Slide 23

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

Yˆ  b1  b 2 X 2

Y  1  2 X 2  3 X 3  u
E b2    2   3

  X  X  X  X   E    X  X  u  u  




X

X
 X  X 



2i

2

3i

3

2i

2

i

2

2i

E b2    2   3

2

2

2i

2

  X  X  X  X 
 X  X 
2i

2

3i

3

2

2i

2

As a consequence of the misspecification, the standard errors, t tests and F test are invalid.

22


Slide 24

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
147.36
0.0000
0.3543
0.3519
1.963

-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------

We will illustrate the bias using an educational attainment model. To keep the analysis
simple, we will assume that in the true model S depends only on ASVABC and SM. The
output above shows the corresponding regression using EAEF Data Set 21.
23


Slide 25

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
147.36
0.0000
0.3543
0.3519
1.963

-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------

E ( b2 )   2   3

  ASVABC  ASVABC  SM  SM 
  ASVABC  ASVABC 
i

i

2

i

We will run the regression a second time, omitting SM. Before we do this, we will try to
predict the direction of the bias in the coefficient of ASVABC.
24


Slide 26

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
147.36
0.0000
0.3543
0.3519
1.963

-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------

E ( b2 )   2   3

  ASVABC  ASVABC  SM  SM 
  ASVABC  ASVABC 
i

i

2

i

It is reasonable to suppose, as a matter of common sense, that 3 is positive. This
assumption is strongly supported by the fact that its estimate in the multiple regression is
positive and highly significant.
25


Slide 27

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574

. cor SM ASVABC
Number of obs =
540
(obs=540)
F( 2,
537) = 147.36
Prob
> F SM
= 0.0000
|
ASVABC
R-squared
= 0.3543
--------+-----------------Adj R-squared
= 0.3519
SM|
1.0000
Root 0.4202
MSE
=
1.963
ASVABC|
1.0000

-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
------------------------------------------------------------------------------

E ( b2 )   2   3

  ASVABC  ASVABC  SM  SM 
  ASVABC  ASVABC 
i

i

2

i

The correlation between ASVABC and SM is positive, so the numerator of the bias term
must be positive. The denominator is automatically positive since it is a sum of squares
and there is some variation in ASVABC. Hence the bias should be positive.
26


Slide 28

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
274.19
0.0000
0.3376
0.3364
1.9865

-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.148084
.0089431
16.56
0.000
.1305165
.1656516
_cons |
6.066225
.4672261
12.98
0.000
5.148413
6.984036
------------------------------------------------------------------------------

Here is the regression omitting SM.

27


Slide 29

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
-----------------------------------------------------------------------------. reg S ASVABC
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.148084
.0089431
16.56
0.000
.1305165
.1656516
_cons |
6.066225
.4672261
12.98
0.000
5.148413
6.984036
------------------------------------------------------------------------------

As you can see, the coefficient of ASVABC is indeed higher when SM is omitted. Part of the
difference may be due to pure chance, but part is attributable to the bias.
28


Slide 30

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
80.93
0.0000
0.1308
0.1291
2.2756

-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------SM |
.3130793
.0348012
9.00
0.000
.2447165
.3814422
_cons |
10.04688
.4147121
24.23
0.000
9.232226
10.86153
------------------------------------------------------------------------------

E ( b3 )   3   2

  ASVABC  ASVABC  SM
  SM  SM 
i

i

 SM



2

i

Here is the regression omitting ASVABC instead of SM. We would expect b3 to be upwards
biased. We anticipate that 2 is positive and we know that both the numerator and the
denominator of the other factor in the bias expression are positive.
29


Slide 31

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------ASVABC |
.1328069
.0097389
13.64
0.000
.1136758
.151938
SM |
.1235071
.0330837
3.73
0.000
.0585178
.1884963
_cons |
5.420733
.4930224
10.99
0.000
4.452244
6.389222
-----------------------------------------------------------------------------. reg S SM
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------SM |
.3130793
.0348012
9.00
0.000
.2447165
.3814422
_cons |
10.04688
.4147121
24.23
0.000
9.232226
10.86153
------------------------------------------------------------------------------

In this case the bias is quite dramatic. The coefficient of SM has more than doubled. The
reason for the bigger effect is that the variation in SM is much smaller than that in ASVABC,
while 2 and 3 are similar in size, judging by their estimates.
30


Slide 32

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
147.36
0.0000
0.3543
0.3519
1.963

. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
274.19
0.0000
0.3376
0.3364
1.9865

. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
80.93
0.0000
0.1308
0.1291
2.2756

Finally, we will investigate how R2 behaves when a variable is omitted. In the simple
regression of S on ASVABC, R2 is 0.34, and in the simple regression of S on SM it is 0.13.
31


Slide 33

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
147.36
0.0000
0.3543
0.3519
1.963

. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
274.19
0.0000
0.3376
0.3364
1.9865

. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
80.93
0.0000
0.1308
0.1291
2.2756

Does this imply that ASVABC explains 34% of the variance in S and SM 13%? No, because
the multiple regression reveals that their joint explanatory power is 0.35, not 0.47.
32


Slide 34

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

S   1   2 ASVABC

  3 SM  u

. reg S ASVABC SM
Source |
SS
df
MS
-------------+-----------------------------Model | 1135.67473
2 567.837363
Residual | 2069.30861
537 3.85346109
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
147.36
0.0000
0.3543
0.3519
1.963

. reg S ASVABC
Source |
SS
df
MS
-------------+-----------------------------Model | 1081.97059
1 1081.97059
Residual | 2123.01275
538 3.94612035
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
274.19
0.0000
0.3376
0.3364
1.9865

. reg S SM
Source |
SS
df
MS
-------------+-----------------------------Model | 419.086251
1 419.086251
Residual | 2785.89708
538 5.17824736
-------------+-----------------------------Total | 3204.98333
539 5.94616574

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
80.93
0.0000
0.1308
0.1291
2.2756

In the second regression, ASVABC is partly acting as a proxy for SM, and this inflates its
apparent explanatory power. Similarly, in the third regression, SM is partly acting as a
proxy for ASVABC, again inflating its apparent explanatory power.
33


Slide 35

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

LGEARN

  1   2 S   3 EXP  u

. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
100.86
0.0000
0.2731
0.2704
.50274

-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
------------------------------------------------------------------------------

However, it is also possible for omitted variable bias to lead to a reduction in the apparent
explanatory power of a variable. This will be demonstrated using a simple earnings
function model, supposing the logarithm of hourly earnings to depend on S and EXP.
34


Slide 36

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

LGEARN

  1   2 S   3 EXP  u

. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637

. cor S EXP
540
(obs=540)Number of obs =
F( 2,
537) = 100.86
=EXP0.0000
|Prob > F S
R-squared
= 0.2731
--------+-----------------R-squared = 0.2704
S|Adj 1.0000
MSE
= .50274
EXP|Root
-0.2179
1.0000

-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
------------------------------------------------------------------------------

E ( b2 )   2   3

 S

i

 S  EXP i  EXP

 Si

 S



2

If we omit EXP from the regression, the coefficient of S should be subject to a downward
bias. 3 is likely to be positive. The numerator of the other factor in the bias term is
negative since S and EXP are negatively correlated. The denominator is positive.
35


Slide 37

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE

LGEARN

  1   2 S   3 EXP  u

. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637

. cor S EXP
540
(obs=540)Number of obs =
F( 2,
537) = 100.86
=EXP0.0000
|Prob > F S
R-squared
= 0.2731
--------+-----------------R-squared = 0.2704
S|Adj 1.0000
MSE
= .50274
EXP|Root
-0.2179
1.0000

-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
------------------------------------------------------------------------------

E ( b3 )   3   2

  EXP  EXP  S  S 
  EXP  EXP 
i

i

2

i

For the same reasons, the coefficient of EXP in a simple regression of LGEARN on EXP
should be downwards biased.
36


Slide 38

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S EXP
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1235911
.0090989
13.58
0.000
.1057173
.141465
EXP |
.0350826
.0050046
7.01
0.000
.0252515
.0449137
_cons |
.5093196
.1663823
3.06
0.002
.1824796
.8361596
. reg LGEARN S
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.1096934
.0092691
11.83
0.000
.0914853
.1279014
_cons |
1.292241
.1287252
10.04
0.000
1.039376
1.545107

. reg LGEARN EXP
-----------------------------------------------------------------------------LGEARN |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------EXP |
.0202708
.0056564
3.58
0.000
.0091595
.031382
_cons |
2.44941
.0988233
24.79
0.000
2.255284
2.643537

As can be seen, the coefficients of S and EXP are indeed lower in the simple regressions.

37


Slide 39

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
100.86
0.0000
0.2731
0.2704
.50274

. reg LGEARN S
Source |
SS
df
MS
-------------+-----------------------------Model | 38.5643833
1 38.5643833
Residual |
148.14326
538 .275359219
-------------+-----------------------------Total | 186.707643
539
.34639637

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
140.05
0.0000
0.2065
0.2051
.52475

. reg LGEARN EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 4.35309315
1 4.35309315
Residual |
182.35455
538 .338948978
-------------+-----------------------------Total | 186.707643
539
.34639637

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
12.84
0.0004
0.0233
0.0215
.58219

A comparison of R2 for the three regressions shows that the sum of R2 in the simple
regressions is actually less than R2 in the multiple regression.
38


Slide 40

VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE
. reg LGEARN S EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 50.9842581
2
25.492129
Residual | 135.723385
537 .252743734
-------------+-----------------------------Total | 186.707643
539
.34639637

Number of obs
F( 2,
537)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
100.86
0.0000
0.2731
0.2704
.50274

. reg LGEARN S
Source |
SS
df
MS
-------------+-----------------------------Model | 38.5643833
1 38.5643833
Residual |
148.14326
538 .275359219
-------------+-----------------------------Total | 186.707643
539
.34639637

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
140.05
0.0000
0.2065
0.2051
.52475

. reg LGEARN EXP
Source |
SS
df
MS
-------------+-----------------------------Model | 4.35309315
1 4.35309315
Residual |
182.35455
538 .338948978
-------------+-----------------------------Total | 186.707643
539
.34639637

Number of obs
F( 1,
538)
Prob > F
R-squared
Adj R-squared
Root MSE

=
=
=
=
=
=

540
12.84
0.0004
0.0233
0.0215
.58219

This is because the apparent explanatory power of S in the second regression has been
undermined by the downwards bias in its coefficient. The same is true for the apparent
explanatory power of EXP in the third equation.
39


Slide 41

Copyright Christopher Dougherty 2011.
These slideshows may be downloaded by anyone, anywhere for personal use.
Subject to respect for copyright and, where appropriate, attribution, they may be
used as a resource for teaching an econometrics course. There is no need to
refer to the author.
The content of this slideshow comes from Section 6.2 of C. Dougherty,
Introduction to Econometrics, fourth edition 2011, Oxford University Press.
Additional (free) resources for both students and instructors may be
downloaded from the OUP Online Resource Centre
http://www.oup.com/uk/orc/bin/9780199567089/.
Individuals studying econometrics on their own and who feel that they might
benefit from participation in a formal course should consider the London School
of Economics summer school course
EC212 Introduction to Econometrics
http://www2.lse.ac.uk/study/summerSchools/summerSchool/Home.aspx
or the University of London International Programmes distance learning course
20 Elements of Econometrics
www.londoninternational.ac.uk/lse.

11.07.25