Sample Selection Example Bill Evans 1

Download Report

Transcript Sample Selection Example Bill Evans 1

Sample Selection Example
Bill Evans
1
• Draw 10,000 obs at random
• educ uniform over [0,16]
• age uniform over [18,64]
• wearnl=4.49 + 0.08*educ + 0.012*age + ε
• Generate missing data for wearnl
2
•
drawn from standard normal [0,1]
• d*=-1.5+0.15*educ+0.01*age+0.15*z+v
• wearnl missing if d*≤0
• wearn reported if d*>0
• wearnl_all=wearnl with non-missing obs.
3
• εi and vi are assumed to be bivariate
normal
• E(εi) = E(vi) =0
• Var(εi) = σ2
• Var(vi) = 1
• Corr(εi,vi) = ρ
• Cov(εi,vi) = ρ σ
• In this case, ρ=0.25 and σ=0.46
4
• Yi = β0 + β1educi + β2agei + εi
• E[Yi | SSR] = β0 + β1educi + β2agei +
E[εi | SSR]
• E[εi | SSR] = E[εi | vi>-wiγ]
= ρ σ φ(wiγ)/Φ(wiγ)
5
• λi = φ(wiγ)/Φ(wiγ)
• wiγ = γ0+educ γ1+age γ2+z γ3
• γ2 and γ3 are both constructed to be
positive
• cov(educ, λi) < 0 and
• cov(age, λi) < 0
6
• The omitted variable λi is negatively
correlated with what is observed in the
model
• Therefore, the coefficients on educ and
age in the selected sample will be too low
7
Numbe rof non-missing
observations
. * get frequency of missing data;
. tab missing;
weekly |
wages are |
missing |
Freq.
Percent
Cum.
------------+----------------------------------0 |
5,337
53.37
53.37
1 |
4,663
46.63
100.00
------------+----------------------------------Total |
10,000
100.00
8
OLS on all data (no missing obs)
Generated by the equation
wearnl=4.49 + 0.08*educ + 0.012*age + ε
. * run ols model with real data;
. reg wearnl_all educ age;
Source |
SS
df
MS
-------------+-----------------------------Model |
1635.2229
2
817.61145
Residual | 2104.81874 9997 .210545037
-------------+-----------------------------Total | 3740.04164 9999 .374041568
Number of obs
F( 2, 9997)
Prob > F
R-squared
Adj R-squared
Root MSE
=
10000
= 3883.31
= 0.0000
= 0.4372
= 0.4371
= .45885
-----------------------------------------------------------------------------wearnl_all |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ |
.0803226
.0009926
80.92
0.000
.0783769
.0822684
age |
.0121512
.0003465
35.06
0.000
.0114719
.0128305
_cons |
4.483993
.016937
264.74
0.000
4.450793
4.517193
------------------------------------------------------------------------------
9
OLS on reported data
Smaller MSE
. reg wearnl educ age;
Source |
SS
df
MS
-------------+-----------------------------Model | 559.207267
2 279.603633
Residual | 1079.42294 5334 .202366505
-------------+-----------------------------Total |
1638.6302 5336 .307089618
Number of obs
F( 2, 5334)
Prob > F
R-squared
Adj R-squared
Root MSE
=
5337
= 1381.67
= 0.00 00
= 0.3413
= 0.3410
= .44985
-----------------------------------------------------------------------------wearnl |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ |
.0703361
.0015006
46.87
0.000
.0673943
.073278
age |
.0118533
.0004631
25.60
0.000
.0109455
.0127611
_cons |
4.6703
.0258484
180.68
0.000
4.619626
4.720973
------------------------------------------------------------------------------
Notice that the estimates for educ and age are now smaller
10
Probit, why is data non-missing
Generated by the equation
d*=-1.5+0.15*educ+0.01*age+0.15*z+v
. gen nonmissing=1-missing;
. label var nonmissing "=1 if data for wearnl is reported";
. * run probit, why data is reported;
. probit nonmissing educ age z;
Probit regression
Log likelihood = -5586.4551
Number of obs
LR chi2(3)
Prob > chi2
Pseudo R2
=
=
=
=
10000
2644.57
0.0000
0.1914
-----------------------------------------------------------------------------nonmissing |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------educ |
.1497386
.003211
46.63
0.000
.1434451
.1560321
age |
.0100424
.0010406
9.65
0.000
.0080029
.0120819
z |
.1403535
.0138183
10.16
0.000
.1132702
.1674369
_cons | -1.504516
.0526162
-28.59
0.000
-1.607642
-1.401391
------------------------------------------------------------------------------
11
Syntax for Heckman model in STATA
. heckman
wearnl educ age, select(educ age z);
Equation of interest
Variables
in selection
equation
12
Notice β’s have increased over OLS w/ missing data
Heckman selection model
(regression model with sample selection)
Log likelihood = -8893.456
Number of obs
Censored obs
Uncensored obs
=
=
=
10000
4663
5337
Wald chi2(2)
Prob > chi2
=
=
428.68
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------wearnl
|
educ |
.0816778
.0064328
12.70
0.000
.0690698
.0942858
age |
.0125495
.0006062
20.70
0.000
.0113614
.0137376
_cons |
4.445103
.1268036
35.06
0.000
4.196572
4.693633
-------------+---------------------------------------------------------------select
|
educ |
.1495845
.0032106
46.59
0.000
.1432918
.1558771
age |
.010029
.0010394
9.65
0.000
.0079919
.0120661
z |
.1390889
.0138789
10.02
0.000
.1118867
.1662911
_cons | -1.502784
.0525991
-28.57
0.000
-1.605877
-1.399692
-------------+---------------------------------------------------------------/athrho |
.3085357
.1723748
1.79
0.073
-.0293128
.6463841
/lnsigma | -.7759725
.0268514
-28.90
0.000
-.8286003
-.7233447
-------------+---------------------------------------------------------------rho |
.2991043
.1569536
-.0293044
.5692308
sigma |
.460256
.0123585
.4366601
.4851269
lambda |
.1376646
.0756984
-.0107016
.2860307
-----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0):
chi2(1) =
1.96
Prob > chi2 = 0.1618
-----------------------------------------------------------------------------Cannot reject null 13
Sigma right on
Rho is a little off
Rho=0
Comparison of Estimates
Covariate
Educ
OLS w/
All data
0.0803
(0.0010)
OLS w/
Selected
sample
MLE of
Heckman
SS model
0.0703
(0.0015)
0.0817
(0.0064)
Age
0.0122
(0.0035)
0.0119
(0.0046)
0.0125
(0.0006)
Constant
4.484
(0.169)
4.670
(0.258)
4.445
(0.127)
14
Comparison of Estimates
Covariate
Educ
OLS w/
All data
0.0803
Age
0.0122
OLS w/
Selected
sample
MLE of
Heckman
SS model
0.0703
[-12.5%]
0.0817
[1.7%]
0.0119
[-2.5%]
0.0125
[2.5%]
15
[% difference from OLS w/ all data]
• * run heckman sample selection
correction;
• . * but use functional form to identify the
model;
• . heckman wearnl educ age, select(educ
age);
16
Heckman selection model
(regression model with sample selection)
Log likelihood = -8946.485
Number of obs
Censored obs
Uncensored obs
=
=
=
10000
4663
5337
Wald chi2(2)
Prob > chi2
=
=
295.99
0.0000
-----------------------------------------------------------------------------|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------wearnl
|
educ |
.0650405
.0113713
5.72
0.000
.0427531
.087328
age |
.0115308
.0008296
13.90
0.000
.0099048
.0131568
_cons |
4.775379
.2251815
21.21
0.000
4.334031
5.216727
-------------+---------------------------------------------------------------select
|
educ |
.1487287
.0031911
46.61
0.000
.1424743
.1549832
age |
.009841
.0010369
9.49
0.000
.0078088
.0118733
_cons | -1.491731
.0524251
-28.45
0.000
-1.594483
-1.38898
-------------+---------------------------------------------------------------/athrho | -.1419248
.3029263
-0.47
0.639
-.7356494
.4517998
/lnsigma | -.7939694
.0238974
-33.22
0.000
-.8408074
-.7471314
-------------+---------------------------------------------------------------rho | -.1409795
.2969056
-.6265094
.4233773
sigma |
.4520469
.0108027
.4313621
.4737235
lambda | -.0637293
.1356091
-.3295183
.2020596
-----------------------------------------------------------------------------LR test of indep. eqns. (rho = 0):
chi2(1) =
0.07
Prob > chi2 = 0.7894
-----------------------------------------------------------------------------17
No where close on rho
Comparison of Estimates
OLS w/
Selected
sample
Covariate
OLS w/
All data
Educ
0.0803
0.0703
[-12.5%]
MLE of
Heckman
SS model
Function
form Ident.
0.065
[-19.2%]
Age
0.0122
0.0119
[-2.5%]
0.0115
[-5.7%]
[% difference from OLS w/ all data]
18
19