No Slide Title

Download Report

Transcript No Slide Title

Research Method
Lecture 9 (Ch9)
More on specification
and Data issues
©
1
Using Proxy Variables for
Unobserved Explanatory Variables
Suppose you are interested in estimating the return
to Education. So you consider the following model.
Log(Wage)=β0+β1Educ+ β2Exp+ (β3Ability+u) …(1)
Ability is unobserved, so it is included in the
composite error term. If Ability is correlated with
the year of education, β1 will be biased.
Question: if ability is correlated with Educ, what is
the direction of the bias?
2
One way to eliminate the bias is to use a
Panel data then apply the fixed effect or
the first differencing method.
Another method is to use a proxy
variable for ability. This is the topic of this
section.
Suppose that IQ is a proxy variable for
ability, and that IQ is available in your
data.
3
 Then, the basic idea is to estimate the following.
Regress Log(Wage) on Educ, Exp, and IQ ……………(2)
This is called the plug-in solution to the omitted variables
problem.
The question is under what conditions (2) produces
consistent estimates for the original regression (1). I will
explain these conditions using the above example (though
the arguments can be easily generalized).
It turns out, the following two conditions ensure that you get
consistent estimates by using the plug-in solution.
4
Condition 1: u is uncorrelated with IQ. In addition, the
original equation should satisfy the usual conditions (i.e, u
is also uncorrelated with Educ, Exp, and Ability).
Omitted variable
The initial explanatory variables
Condition 2: E(Ability|Educ, Exp, IQ)=E(Ability|IQ)
The proxy variable
Condition 2 means that, once IQ is conditioned, Educ and Exp
does not explain Ability. More simple way to express
condition 2 is that the ability can be written as:
Ability=δ0+δ3IQ+v3
…………(3)
where, v3 is a random error which is uncorrelated with either
IQ, Educ or Exp. What it means is that Ability is a function
5
of IQ only.
 Then, it is clear why these two conditions guarantee that the
plug-in condition produces consistent estimates. Just plug
(3) into (1). Then you have
Log(Wage)=(β0+δ0)+β1Educ + β2Exp + β3δ3IQ + (u+β3v3 ) …(4)
Where
 Since u and v3 are uncorrelated with any of the explanatory
variables under condition1 and condition 2, the slope
parameters are consistent. The intercept has changed, but
usually you are not interested in the intercept. Importantly,
you get consistent estimates for the slope parameters.
6
 It is also obvious that, if condition 2 is violated, then the plug
in solution will not work. If the condition 2 is violated, then
ability will be a function of not only IQ, but also Educ and
Exp. So you will have:
If condition 2 is
Ability=δ0+ δ1Educ+δ2Exp+δ3IQ+v3 …(5)
violated then, ability is
a function of all the
variables.
If you plug (5) into (1), you have
Log(Wage)=(β0+δ0)+(β1+β3δ1)Educ + (β2+β3δ2)Exp + β3δ3IQ +
(u+β3v3 ) …(4)
Thus, the coefficient for Educ is no longer β1, but it is β1+β3δ1.
Thus, the plug-in solution produces inconsistent estimates
when condition 2 is violated.
7
Exercise
Ex.1: Use Wage2.dta to estimate a log wage
equation to examine the return to
education. Include in the equation exper,
tenure, married, south, urban, black. Do
you think that the return to education is
unbiased? What do you think is the
direction of the bias
Ex.2: Now, use IQ as a proxy for
unobserved ability. Did the result change?
Was your prediction of the direction of
the bias correct?
8
Answer: OLS without IQ
. use "D:\My Documents\IUJ_teaching\Research Methodology\Wooldridge Econometrics resources\data\WAGE2.DTA", clear
. reg lwage educ exper tenure married south urban black
Source
SS
df
MS
Model
Residual
41.8377619
7 5.97682312
123.818521 927 .133569063
Total
165.656283 934 .177362188
lwage
educ
exper
tenure
married
south
urban
black
_cons
Coef. Std. Err.
.0654307
.014043
.0117473
.1994171
-.0909036
.1839121
-.1883499
5.395497
.0062504
.0031852
.002453
.0390502
.0262485
.0269583
.0376666
.113225
t
10.47
4.41
4.79
5.11
-3.46
6.82
-5.00
47.65
Number of obs
F( 7, 927)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
935
44.75
0.0000
0.2526
0.2469
.36547
P>|t|
[95% Conf. Interval]
0.000
0.000
0.000
0.000
0.001
0.000
0.000
0.000
.0531642 .0776973
.007792
.020294
.0069333 .0165613
.1227801
.276054
-.142417 -.0393903
.1310056 .2368185
-.2622717 -.1144281
5.17329 5.617704
9
Answer: OLS with IQ
. reg lwage educ exper tenure married south urban black IQ
Source
SS
df
MS
Model
Residual
43.5360162
8 5.44200202
122.120267 926 .131879338
Total
165.656283 934 .177362188
lwage
educ
exper
tenure
married
south
urban
black
IQ
_cons
Coef. Std. Err.
.0544106
.0141458
.0113951
.1997644
-.0801695
.1819463
-.1431253
.0035591
5.176439
.0069285
.0031651
.0024394
.0388025
.0262529
.0267929
.0394925
.0009918
.1280006
t
7.85
4.47
4.67
5.15
-3.05
6.79
-3.62
3.59
40.44
Number of obs
F( 8, 926)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
935
41.27
0.0000
0.2628
0.2564
.36315
P>|t|
[95% Conf. Interval]
0.000
0.000
0.000
0.000
0.002
0.000
0.000
0.000
0.000
.0408133
.068008
.0079342 .0203575
.0066077 .0161825
.1236134 .2759154
-.1316916 -.0286473
.1293645 .2345281
-.2206304 -.0656202
.0016127 .0055056
4.925234 5.427644
10
Using lagged dependent variable
as proxy variables
Often the lag of the dependent variable is used as a
proxy for the unobserved variables.
First consider the following model.
(Crime rate) =β0+β1(unemp) + β2(expenditure) +u
If there are omitted factors that directly affect crime
rate and at the same time correlated with
unemployment rate, β1 will be biased. The omitted
factors may be some pre-existing conditions, like
demographic features (age, race etc). Crime rate could
be different among cities for historical factors.
11
The idea is that, the lag of the dependent variable
may summarize such pre-existing conditions.
So, estimate the following equation
(Crime rate)it =β0+β1(unemp)it + β2(expenditure)it
+ β3(Crime rate)it-1 +uit
The following slides estimate the model using
CRIME2.dta
12
Example
 We estimate Crime2.dta to estimate the regressions.
Results are the following.
. use "D:\My Documents\IUJ_teaching\Research Methodology\Wooldridge Econometrics resources\data\CRIME2.DTA", clear
. reg lcrmrte
Source
unem llawexpc if year==87
SS
df
MS
Model
Residual
.271987199
4.48998214
2
.1359936
43 .104418189
Total
4.76196934
45 .105821541
lcrmrte
Coef.
unem
llawexpc
_cons
-.0290032
.2033652
3.342899
. reg lcrmrte
Source
Std. Err.
.0323387
.1726534
1.250527
unem llawexpc
SS
t
-0.90
1.18
2.67
df
3
42
1.07910949
.036300973
Total
4.76196934
45
.105821541
unem
llawexpc
lcrmrt_1
_cons
.008621
-.1395764
1.193923
.0764511
0.375
0.245
0.011
MS
3.23732846
1.52464088
Coef.
P>|t|
=
=
=
=
=
=
46
1.30
0.2824
0.0571
0.0133
.32314
Without the lag of
dependent varriable.
[95% Conf. Interval]
-.0942205
-.1448236
.8209721
.0362141
.5515539
5.864826
lcrmrt_1 if year==87
Model
Residual
lcrmrte
Number of obs
F( 2,
43)
Prob > F
R-squared
Adj R-squared
Root MSE
Std. Err.
.0195166
.1086412
.1320985
.8211433
t
0.44
-1.28
9.04
0.09
Number of obs
F( 3,
42)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.661
0.206
0.000
0.926
=
=
=
=
=
=
46
29.73
0.0000
0.6798
0.6570
.19053
With the lag of
dependent variable.
[95% Conf. Interval]
-.0307652
-.3588231
.9273371
-1.580683
.0480072
.0796704
1.460508
1.733585
13
Measurement error
The existence of important omitted
variables causes endogeneity problem.
Another source of endogeneity is the
measurement error.
 This section explains under what
circumstance the measurement error
causes endogeneity, and under what
circumstance it does not.
14
Measurement error in
explanatory variable.
When the explanatory variables are
measured with errors, this causes the
endogeneity problem.
This is a common problem. For example,
in a typical survey, the respondents may
report their annual incomes with a lot of
errors. Variables such as GPA or IQ may
be reported with errors as well.
15
 Now, let us understand the nature of the problem.
 Suppose that you want to estimate the following simple
regression.
y=β0+β1x1* +u …………………….(1)
where x1* is the measurement-error free variable. Suppose
that this regression satisfies MLR.1 through MLR.4.
 Now, suppose that you only observe the error-ridden
variable x1. That is
x1=x1*+e1
where e1 is a random error uncorrelated with x1*.
16
To be more precise, the measurement error is
such that
x1=x1*+e1 …………….(2)
and
Cov(x1*, e1)=0 ………….(3)
(2) and (3) is called the classical errors-invariables (CEV) assumption.
Note that the above assumption has nothing to
do with u. We maintain the assumption that u is
uncorrelated with both x1* and x1. This also
means that u is uncorrelated with e1.
17
Because we only observe the error-ridden
variable x1, we can only estimate the
following model.
y=β0+β1x1+v…….(4)
Under the CEV assumption, the observed
(error-ridden) variable in regression (4) is
endogenous.
To see this, plug x1*=x1-e1 into the original
regression (1) to get
y=β0+β1x1+(u- β1e1)…….(5)
18
So, we have v=u- β1e1
Now, notice that
2



)= 1 e1
Cov(x1, v)=Cov(x1, u- β1e1
≠0
See the front board for the proof.
Therefore, x1 is correlated with the error
term. Therefore, x1 is endogenous. Thus,
OLS will be biased.
19
Under the CEV assumption, we can predict the
direction of the inconsistency (characterization
of the bias is difficult). Let ˆ1 be the estimated
coefficient from the error-ridden variable
regression (4). Then, we have

 x21*
p lim(ˆ1 )  1  2
  *   e2
1
 x1




Proof: see the front board
Since the term inside the parenthesis is always
smaller than 1, there is a bias towards zero. This
is called the attenuation bias.
20
Error in variable (more
general case)
Suppose you want to estimate the
following model.
y=β0+β1x1*+β2x2+….+βkxk+u
where x1* is measurement free variable.
However, you only observe error-ridden
variable x1. So you can only estimate the
above regression by replacing x1* with x1.
21
Assume that other variables are
measurement error free.
Then the probability limit of

 r21*
p lim(ˆ1 )  1  2
  *   e2
1
 r1
ˆ1
is given by




where
is the population error from the
following regression.
 r2
*
1
x1=δ0+δ1x2+…+ δk-1xk+ r1*
22
Measurement error in the
dependent variable
When the measurement error is in the
dependent variable, but explanatory
variables have no measurement-errors,
there will be no bias in OLS.
Consider the following model.
y*=β0+β1x1 +u …………………….(1)
where y* is the measurement free variable.
But, you only observe the error-ridden
variable y.
23
Assume the following
y=y*+e ………………………………………….…….(2)
and
Cov(y, e)=0 ……………………………………………...(3)
Again, we maintain the assumption that u is
uncorrelated with both x1* and x1. This also means
that u is uncorrelated with e1.
By plugging y*=y-e into (1), we have the following
OLS.
y=β0+β1x1 +(u+e) ……………(5)
Since e and u are not correlated with the
explanatory variables, (5) causes no bias in the
estimation.
24
Non random sampling
1: Exogenous sampling
Consider the following regression
Saving=β0+β1(income)+β2(age)+u
Suppose that the survey is conducted for people over
35 years old. This is non-random sampling, but the
sampling criteria is based on the independent variable.
This is called the sample selection based on the
independent variables, and is an example of
exogenous sample selection.
In this case, OLS regression of the above model has no
bias.
25
Non random sampling
2: Enogenous sampling
Consider the following regression.
Wealth=β0+β1(Educ)+β2(Exper)+u
However, suppose that only people with wealth
below $250,000 are included in the sample. Then the
sample selection criteria is based on the dependent
variable. This is called the sample selection based
on dependent variable, and is an example of
endogenous sample selection.
In this case, OLS estimate of the above regression are
always biased.
26
Stratified sampling
This is a common survey method, in
which the population is divided into nonoverlapping groups, or strata. The
sampling is random within each group.
However, some groups are often
oversampled in order to increase
observations for that group. Whether this
causes the bias depends on whether the
selection is exogenous or endogenous.
27
If females are oversampled, and you are
interested in the gender differences in
savings, then this is the exogenous sample
selection. Thus, this causes no bias.
If people with low wealth are
oversampled, and if you are interested in
the wealth regression, then this is
endogenous sample selection. This causes
a bias in the regression.
28
More subtle form of sample
selection.
Suppose that you are interested in estimating
the wage offer regression.
Low(wage offer)= β0+β1(Educ)+β2(Exper)+u
When the wage offer is `too low’ for a particular
person, the person may decide not to work. Thus, this
person will not be included in the sample. This is the
case where sample selection is caused by the person’s
decision to work or not.
29
When the decision is based on
unobserved factors, then the OLS
regression causes a bias. This is called the
sample selection bias.
This is typically a problem for the study
of the wage offer for women.
This course does not cover the method to
correct for this type of bias. In the fall
semester, I will cover this type of issues in
a new course `the Cross Section and Panel
Data Analysis’.
30