DANE PANELOWE - Uniwersytet Warszawski

Download Report

Transcript DANE PANELOWE - Uniwersytet Warszawski

FINAL MEETING –
OTHER METHODS
Development
Workshop
General conclusions on causal
analyses


2
Magic tool of „ceteris paribus”
– Regression is ceteris paribus by definition
– But the data need not to be – they are just a
subsample of general populations and many other
things confound
Causal effects, i.e. cause and effect
– Propensity Score Matching
– Regression Discontinuity
– Fixed Effects
– Instrumental Variables
If we cannot experiment..…
Cross-sectional data
Panel data
IV
„Propensity Score
Matching“
„Regression
Discontinuity
Design“
3
„Propensity Score
Matching“ + DiD
Before After
Estimators
Difference in Difference
Estimators (DiD)
Problems with causal inference
Confounding
Influence
(environment)
Treatment
Effect
Observables
Unobservables
4
Instrumental Variables solution…
Confounding
Influence
Treatment
Outcome
Observed Factor
Unobserved Factor
Instrumental
Variable(s)
Fixed Effects Solution… (DiD does
pretty much the same)
Confounding
Influence
Fixed Influences
Treatment
Outcome
Observed Factor
Unobserved Factor
Propensity Score Matching
Confounding
Influence
Treatment
Outcome
Observed Factor
Unobserved Factor
Regression Discontinuity Design
Group that
is key for
this policy
Confounding
Influence
Treatment
Effect
Observables
Unobservables
8
A motivating story





9
Today women in Poland have on average 1,7 kid
About 50 years ago, women had 2,8 kids
Todays women are 6 times more educated than 50 years ago –
will a drop from 2.8 to 1.7 be an effect of this educational change?
Natural experiment: in 1960 schooling obligation was extended by
one year (11 to 12 years).
– THE SAME women born just before 1953 went to primary and
secondary schools a year shorter than born after 1953
– THE SAME = ?
RD allows to compare fertility (with individual characteristics) for
women born around 1953
Regression Discontinuity Design

Idea
–

How to do it?
–
–
–
10
Focus your analyses on a group for which treament was
random (or rather: independent)
Example: weaker students have lower grades, but are also
frequently „delayed” to repeat courses/years; if we give them
extra classes, better students will outperform them anyway, so
how to test if extra classes help?
RDD will compare the performance of students just above and
just below „threshold”, so quite similar ones
RDD will only work if people cannot „prevent” or „encourage”
treatment by relocating themselves around „threshold”
Regression Discontinuity Design

Advantages:
– Really marginal effect
– Causal, if RDD well applied

Disadvantages:
–
–

Problems:
–
–
11
Sample size largely limited
Only „local” character of estimations (marginal≠average)
How do we know how far away from threshold can we go
(bandwidth)?
How do we know if design is ok.?
Regression Discontinuity Design

Zastosowanie
–
–
Trade off between narrow “bandwidth” (for independence
assumption) and wide “bandwidth” to increase sample size
One can try to find it empirically ( “fuzzy” RD design)
ˆcutoff
–
12
Y  Y 
 
p  p
Y is the effect, p is treatment probability.
+ is effect of probability just above „cut-off”
- is effect of probability just below „cut-off”
Regression Discontinuity Design
13
Regression Discontinuity Design
14
Regression Discontinuity Design
15
How to do this in STATA?



16
First – download package: net instal rd
Second – define your model
– rd $out, treatment, $in [if] [in] [weight] [, options]
Third – there are some options
– mbw(numlist) multiplication of „bandwidth” in percent (default:
"100 50 200" which means we always do 50%, 100% and
200%)
– z0(real) sets cutoff Z0 (treatment)
– ddens asks for extra estimation of discontinuities in Z density
– graph – draws graphs we’ve seen automatically
Sample results in STATA - data
Contains data from votex.dta
obs:
349
vars:
20
size:
39,437 (99.9% of memory free)
variable name
fips
district
d
win
lne
i
votingpop
votpop
populatn
black
blucllr
farmer
fedwrkr
forborn
manuf
unemplyd
union
urban
veterans
ranwin
storage
type
byte
byte
double
byte
float
byte
long
double
long
double
double
double
double
double
double
double
float
double
double
byte
display
format
value
label
%8.0g
%8.0g
%10.0g
%9.0g
%9.0g
%9.0g
%12.0g
%10.0g
%12.0g
%12.0g
%12.0g
%12.0g
%12.0g
%12.0g
%12.0g
%12.0g
%9.0g
%12.0g
%12.0g
%8.0g
fips
102nd Congress
5 Nov 2007 17:02
variable label
State code
Congr district
Dem vote share minus .5
Dem Won Race
Log fed expenditure in district
Incumbent
Voting Age Population
Voting Age Population Share
Population
Black Population Share
Blue-collar Population Share
Farmer Population Share
Fed Worker Population Share
Foreign Born Population Share
Manufactur Population Share
Unemp Population Share
Unionized Population Share
Urban Population Share
Veteran Population Share
Output from STATA
. use votex
(102nd Congress)
. rd lne d, gr mbw(100)
Two variables specified; treatment is
assumed to jump from zero to one at Z=0.
Assignment variable Z is d
Treatment variable X_T unspecified
Outcome variable y is lne
Command used for graph: lpoly; Kernel used: triangle (default)
Bandwidth: .29287776; loc Wald Estimate: -.07739553
Estimating for bandwidth .29287775925349
18
lne
Coef.
lwald
-.0773955
Std. Err.
.1056062
z
-0.73
P>|z|
0.464
[95% Conf. Interval]
-.28438
.1295889
Output from STATA - graph
20
21
22
23
Log fed expenditure in district
Bandwidth .29287775925349
-.2
0
.2
.4
.6
Output from STATA –„fuzzy” version
-.2
-.4
-.8
-.6
Estimated effect
0
.2
gen byte ranwin=cond(uniform()<.1,1-win,win)
rd lne ranwin d, mbw(25(25)300) bdep ox
7.3e-02 .15
.22
.29
.37
.44
.51
Bandwidth
CI
20
.59
Est
.66
.73
.81
.88
One last thing 
Quintile regressions
A motivating story
4 500 zł
9 decyl
4 000 zł
3 500 zł
8 decyl
3 000 zł
2 500 zł
2 000 zł
1 500 zł
7 decyl
przeciętna
6 decyl
5 decyl
4 decyl
3 decyl
2 decyl
1 000 zł
1 decyl
500 zł
0 zł
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Some basics „doubts” of an
empirical economist…




Compare similar to similar
Keep statistical properties
Understand bezond „average x”
Understand (and be independent of) „outliers”
Robust estimators


First flavour of robust – regression with robust option
– Helps if problem is not systematic
– Does not help if problem is the nature of the process
(e.g. heterogeneity)
Second flavour of robust – nonparametric estimators
– Complex from mathematical point of view
– Takes longer to compute
– But veeeery elastic
=> Koenker (and his followers)
How to do this in STATA?



Estimate at median
– qreg y $in
Estimate at any other percentile
– qreg y $in, quantile(q) where q is your percentile
Estimate differences between different percentiles
– iqreg y $in, quantile(.25 .75) reps(100) + additionally may
bootstrap
Output from STATA
Iteration
1:
WLS sum of weighted deviations =
121.88268
Iteration
Iteration
1: sum of abs. weighted deviations =
2: sum of abs. weighted deviations =
111
110
Median regression
Raw sum of deviations
Min sum of deviations
Number of obs =
157 (about 14)
110
y
Coef.
x
_cons
17
3
Std. Err.
3.924233
2.774852
Pseudo R2
t
4.33
1.08
=
10
0.2994
P>|t|
[95% Conf. Interval]
0.003
0.311
7.950702
-3.39882
26.0493
9.39882
Output from STATA
Iteration
1:
WLS sum of weighted deviations =
80.060899
Iteration
Iteration
Iteration
1: sum of abs. weighted deviations =
2: sum of abs. weighted deviations =
3: sum of abs. weighted deviations =
80.66
79.36
78.66
.33 Quantile regression
Raw sum of deviations
Min sum of deviations
Number of obs =
122.86 (about 3)
78.66
y
Coef.
x
_cons
18
1
Std. Err.
4.608
3.258348
Pseudo R2
t
3.91
0.31
P>|t|
0.005
0.767
=
10
0.3598
[95% Conf. Interval]
7.373933
-6.513764
28.62607
8.513764
Summarising all this crap
Confounding
Influence
(environment)
Treatment
Effect
Observables
Unobservables
Problems

Sample
–
–

size
heterogeneity
Methods
–
–
–
None is perfect
Question important
Nonparametric (kernel in PSM or QR) are robust,
robust is not a synonim for miraculous