EXPERIMENTAL METHODS LECTURES 1-8 Eric Bettinger Stanford University Where do we see randomized experiments in public policy? Oportunidades/Progressa in Mexico Reducing Crime Juvenile Delinquency Early Childhood development Head start Education (general IES.
Download
Report
Transcript EXPERIMENTAL METHODS LECTURES 1-8 Eric Bettinger Stanford University Where do we see randomized experiments in public policy? Oportunidades/Progressa in Mexico Reducing Crime Juvenile Delinquency Early Childhood development Head start Education (general IES.
EXPERIMENTAL METHODS
LECTURES 1-8
Eric Bettinger
Stanford University
Where do we see randomized
experiments in public policy?
2
Oportunidades/Progressa in
Mexico
Reducing Crime
Juvenile Delinquency
Early Childhood development
Head start
Education (general IES focus)
Vouchers
Career Themes
Class Size
Electricity pricing
Automated Medical Response
Housing Assistance
Housing Vouchers
Job Training
Unemployment Insurance
Welfare to Work
Health Services
College Financial Aid
College Work
Mental Health Treatments
How Have Randomized Experiments
Affected Public Policy?
3
Class Size and Tennessee STAR
California
Pre School and Perry Preschool
Head
Start
Reading Curricula and Success for All
Conditional Cash Transfer and Progressa
Escola
Class Size
Bolsa and others
Educational Vouchers
Experiments as the “Gold Standard”
4
Wide recognition as the “Gold Standard”
World
Bank
US Government and “No Child Left Behind” Legislature
Why the “Gold Standard”?
Identifying
Causal Impacts
Eliminating Selection Bias
Simplicity
Much
Easier to Explain
No Need for Complex Statistical Modeling
Are there limitations?
5
YES!
Some potential limitations:
General
equilibrium
Interpretation
Mechanism
Fragility of design
More…
We will return to this later in the course.
Key goals this week:
6
1.
2.
3.
4.
5.
6.
7.
Understand causal modeling
Understand relationship of randomization to causal
modeling
Gain the statistical tools to analyze randomized
experiments
Become aware of underlying assumptions, strengths, and
weaknesses of experimental approaches
Become acquainted with key players in the development
and implementation of randomized experiments
Gain the statistical tools to design randomized experiments
Understand other key issues in design and implementation
of random experiments.
What is causality?
7
There are many potential definitions …
Does the treatment have to be manipulable?
Morgan and Winship: C causes E if 1) Both C and E occur; and 2)
If C did not occur and everything else was equal, E would not
have occurred.
Fisher: Outcomes differ by treatment
Neyman: Average outcomes of treatments A and B differ
Can there be effects of being female?
How do we think about mechanisms and counterfactuals?
Can differences in outcomes come from different dosages?
What is the counterfactual in the Neyman and Fisher definitions?
Defining Counterfactuals
8
Assume that each individual in the population has a
potential outcome for each potential causal state.
In two-state model, we can define
Yi =Y1i if exposed to treatment
Y0i if not exposed to treatment
We could generalize this to Y2i, Y3i, . . . , Yki for k different
treatments.
These are potential outcomes, and we assume they exist.
Each individual has their own outcome – heterogeneous
outcomes.
We never observe more than one outcome for each person.
Counterfactual?
9
KEY POINT: We have to make assumptions about
whether the average observed for some
aggregated group is an accurate representation of
the counterfactual. In essence, the strength of an
identification strategy is our ability to accurately
measure the counterfactual outcome.
Rubin (1986): “What ‘Ifs’ Have Causal
Answers?”
10
Not all questions have causal answers.
Would
I be here in Russia if I had studied art history
instead of economics?
Would I be here in Russia if I were a woman rather
than a man?
SUTVA is the key assumption for determining which
questions have causal answers.
SUTVA
= “stable unit treatment value assumption”
Baseline framework for understanding
SUTVA
11
What are the ingredients of an experiment?
N
units indexed from i=1,…,N
T Treatments indexed by t=1,…,T
Y is outcome and indexed by t and i.
Two Key Conditions:
SUTVA
says that Y will be the same for any i no matter
how t is assigned.
SUTVA also says that Y will be the same no matter what
treatments the other units receive.
Condition 1.
Same Y no matter on how t assigned.
12
Consider the following statement: If John Doe, had
been born a female, his life would have been different.
How do you make John Doe a female?
Hypothetical Y to X chromosome treatment at conception
Massive doses of hormones in utero
At-birth sex-change
Does the form of the change make a difference?
Consider an educational example
13
Are there treatments where different versions of t
matter?
Consider Class size.
If we assign a small class to a student (t), is there more than
one (t) in existence. Are all small classes equal?
No – teachers differ, peers differ, and so on.
What about educational vouchers?
Vouchers are “coupons” which allow students to attend the
school of their choice.
Are outcomes the same no matter the voucher?
We have to assume that all versions of t are the same
under SUTVA.
Consider Tennessee class size
experiment (Ding and Lehrer 2011)
14
Under SUTVA, should class size effect vary by
the proportion of the school involved in
experiment?
Consider Tennessee class size
experiment
15
The problem is that it does. Why?
16
Condition 2. Yti cannot depend on
whether i’ received to or t1
[d1=1, d2=0, d3=0} Y1i = 3
d1=0,d2=1,d3=0
d1=0 d2=0 d3=1
[d1=1, d2=1, d3=0} Y1i = 2
d1=0,d2=1,d3=1
d1=1 d2=0 d3=1
Y0i=1
Y0i=1
Consider the example above.
Are there treatments that are diluted as more people
get them?
Condition 2. Outcomes do not depend
on others’ assignments.
17
What other examples are there?
General
equilibrium effects.
Do outcomes change for the control group?
What is the effect to take this course in Russia versus
Stanford? Is the treatment the same?
Key point: My treatment cannot affect your
treatment.
Why do we care about SUTVA?
18
We can identify which questions can really be
answered. Causal question is possible when SUTVA
is satisfied.
SUTVA
can help us sort through possible mechanisms.
It make the interpretation clearer.
Consider religious schools
19
Would we expect the religious school effect to
change if more students attended religious school?
SUTVA does not hold if effectiveness of religious
schools depends on the number/composition of
students.
The distribution of students change
What then are the implications of religious school
literature on voucher debates?
What should we think about SUTVA?
20
This is a useful framework for isolating the precise
question for which we have a causal question.
It likely holds in small samples.
In large movements, we could have general
equilibrium effects.
Simple Example
21
Recall that experiments include treatments, units of
observation, and outcomes.
Consider the following scenario: If females at firm f
had been male, their starting salaries would have
averaged 20% higher.
How do we divide this into treatments, units, and
outcomes?
It
is likely not possible.
A useful motto: There is no causation without
manipulation.
The Perry Preschool Experiment
22
Motivation for large investments in preschool. Hundreds
of billions of dollars in investments.
Outline of the experiment:
1962.
123 black preschoolers (58 in treatment)
Low-income and low IQ scores (70-85, 1sd below mean)
Randomly assigned at ages 3-4 to a high-quality preschool
program or no program.
Data collected annually from 3 through 11 and at 14, 15,
19, 27, and 40.
Overview of Perry Results
23
Source: Schweinhart (2007)
More on Perry.
24
Source: Schweinhart (2007)
More Perry Results
25
Source: Schweinhart (2007)
More from Perry on Crime
26
Source: Schweinhart (2007)
Which Causal Questions Does Perry
Help Resolve?
27
Is attending preschool better than not attending for
low-income, low-ability, African-American students in
1962?
Is attending preschool better than not attending for
low-income, low-ability students?
Is attending preschool better than not attending for
low-income students?
Is attending preschool better than not attending for all
students?
SUTVA may not be fully satisfied but it helps us identify
which questions our studies may resolve.
Is Perry valid today?
28
Source: Schweinhart (2007)
Let’s formulate causality mathematically
29
Yi =
Y1i if Di=1
Y0i if Di=0
Yi = Y0i + (Y1i – Y0i)Di
The most common approach is to compare students attending a
“treatment” to those not attending a treatment.
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] +
( E[Y0i | Di=1] – E[Y0i | Di=0])
1st term is average effect on treated
2nd term is selection bias
Thinking through the expectation
30
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
Example 1. Treatment = Religious Private Schooling
1st term is average effect on treated
Average private school outcome of individuals in religious private
schooling minus the average public school outcome for individuals
attending private school
The latter term is the unobserved counterfactual. We have to find a way
to estimate it.
2nd term is selection bias
The difference between the public school outcome that private school
attendees would have had and the public school outcome for public
school attendees.
Informs us as to how different private school attendees are from public
school attendees.
Thinking through the expectation
31
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
Example 2. Treatment = Attending Preschool
1st
term is average effect on treated
Average
private school outcome of individuals in preschool
minus the average outcome that students who attended
preschool would have had if they had not gone to preschool
2nd
term is selection bias
The
difference between the outcome that preschool attendees
would have had without preschool and the outcome of students
not attending preschool.
Use of the formulation
32
It helps us figure out what we are estimating.
Later
we will augment this model with the “probability
of compliance.”
It identifies the key means we need to estimate to
gain causal estimates.
It helps us analyze our approach regardless of our
methodology
What happens if we have
randomization?
33
E[Y0i | Di=0] = E[Y0i | Di=1]
Selection Bias is gone.
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
=
E[Y1i | Di=1] – E[Y0i | Di=1] +
=
E[Y1i –Y0i | Di=1]
0
= E[Y1i –Y0i]
Simple differences in averages reveal the treatment
effect.
Hypothetical Example
34
A recent study found in the overall population that
students who attended schools of type X had higher test
scores than other students.
What do we expect the selection bias to look like? This
was a school requiring motivated parents.
If we randomized in the whole population, what
direction should we expect the treatment effect to go?
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
Abdulkadiroglu et al (2011)
35
Examines charter schools in Boston
Charter schools are public schools which operate more
closely to private schools.
Obama has pushed for more charter schools.
Important research question to know if they work.
Charter schools are often oversubscribed.
When oversubscribed, they use lotteries to determine who
gets in.
Other charters are not oversubscribed.
If a school is oversubscribed, what would you expect?
It probably does a pretty good job.
Abdulkadiroglu et al (2011)
36
Abdulkadiroglu et al (2011)
37
Synthesizing
38
The authors found a selection of schools of type X
which ran lotteries to determine who entered the
schools. In these schools, the researchers
exploited the randomization. The difference
between winners and losers was even higher than
the previous comparison.
Explain?
Basically,
observed effect greater than treatment effect
then two possibilities:
Selection bias is negative at no wait list schools [likelihood?]
Treatment effect is much lower. Stronger the selection effects in
these other schools the lower the treatment effect.
Regression Formulation
39
Suppose we want to use regression analysis, how do we
estimate treatment effects?
Yi = a + b*Treatmenti + ei
E[Yi | Di=1] = a + b +E(e|Di=1)
E[Yi | Di=0] = a + E(e|Di=0)
E[Yi | Di=1] – E[Yi | Di=0] =b + E(e|Di=1) - E(e|Di=0)
Underlined term is selection bias.
This is just equal to correlation between e and D
Regression Formulation
40
Yi = a + b*Treatmenti + ei
Consider the OLS estimator of b. We will call it bhat.
E[bhat] = b + E[T’e/T’T]
Selection bias would suggest that e and T are correlated.
Think of it as an omitted variable problem.
If T is randomly assigned, then E[T’e]=0 . No omitted variable
can be correlated with T.
Multivariate Regression
41
What is the consequence of including more X in a
regression where there is randomization?
Generally,
the standard errors are lower
X is correlated with Y and once controlled for reduce
the variance of Y
Estimated treatment effect should be unbiased if
there is no selection bias
We will return to this.
Recap up to now.
42
Experiments is our key interest
Causal questions need treatments that are able to be
manipulated whether by the researcher or “nature.”
SUTVA are essential for identifying which questions we are
asking which are causal.
We need units, treatments, and outcomes.
Condition 1. Treatment is constant.
Condition 2. Your treatment does not affect mine.
Typical comparisons mask treatment effects and selection
bias.
Randomization removes selection bias, but there are other
ways to get rid of it.
Experiments vs. Observational Studies
43
Cox and Reid: The word experiment is used in a
quite precise sense to mean an investigation where
the system under study is under the control of the
investigator. This means that the individuals or
material investigated, the nature of the treatments
or manipulations under study and the measurement
procedures used are all selected, in their important
features at least by the investigator. By contrast in
an observational study some of these feature, and
in particular the allocation of individuals to
treatment groups is outside the investigator’s control.
Definition of Experiment
44
Notice that Cox and Reid never mentioned “randomization” in their
definition.
Are there experiments which are not randomized?
Not all studies lend themselves to randomization:
Effect of parental death on child outcomes
Effect of divorce on children
Effect of no schooling
We will distinguish between “field experiments” and “natural
experiments.”
Natural experiments are places where randomization occurs “by nature.”
Variation between treatment and control takes place for arbitrary and
seemingly random reasons.
Next week we are going to focus here.
In field experiments, the researcher “controls” the randomization.
Choosing the treatment
45
In designing random experiments, we start with the
treatment.
Two schools of thought:
1.
2.
Start with theory, policy, or conceptual frameworks.
Identify opportunities for randomization and then identify
questions which might be of interest to academics or
policymakers.
Angrist and Kremer and divergent approaches
The difference between program evaluation and research
The importance of partners.
Oftentimes the relationship is more important than the question
Duflo article discusses partners before the basics of
randomization
The importance of partners
46
Who are our partners?
Governments (Job Training, Negative Income Tax, Progressa,
Generating new pilots)
NGO (Can focus on smaller population than governments)
Private Groups
Partners have priorities and existing beliefs.
These create limitations and opportunities in our design.
Partners have resources.
Few of us have the money or time to implement new
treatments to units or to gather data on outcomes.
Partners are key to this.
Partners can help us find populations to study.
Some examples to work through
47
Hypothesis: students’ writing and
self-confidence are linked. Selfaffirming writing experiences
reinforces student confidence and
subsequently student outcomes.
Hypothesis: Remediation is
actually counterproductive in
college.
What’s the ideal experiment?
Is it plausible?
Experiment?
Plausibility?
Hypothesis: Deworming improves
health and educational outcomes.
Experiment?
Plausibility?
Hypothesis: Paying students for
test scores improves academic
performance.
Experiment?
Plausibility?
Hypothesis: Positive (negative)
verbal reinforcement improves
(destroys) academic progress.
Experiment?
Plausibility?
Worms paper results
48
Administer a vaccine to cure students of worms.
Worms
are in dirty water. They are a parasite causing
health problems. School attendance declines.
Randomized which students received the treatment
within a community
Randomized which communities received treatments.
Key results?
No
difference within communities
Communities who received the treatment were better
off.
Unit of randomization
49
Once we have the treatment, we can determine the
optimal units.
Why do we care about “units”?
Statistical power. Typically the unit of analysis is the same
as the unit of randomization.
Statistical modeling.
Contamination.
The causal question often changes with the units.
“External Validity.” Can we generalize to other populations?
Mode of delivery.
Political considerations.
Consider some simple cases
50
Which level (individuals, classes, schools) would be
best to test the following treatments?
Adoption
of a new curriculum.
Creating a conditional cash transfer program based on
schooling.
Preschool subsidies.
Giving incentives to teachers.
Giving incentives to students studying in an online
course.
How many?
51
Once we know the level of randomization, we now
need to think about the ramifications in terms of the
number of units.
Power calculations are our key to determining the
size of the population necessary for answering our
causal questions.
We can derive power calculations from our
statistical models.
Case 1. Individual randomization
52
Our simple model works well (simple t-stats do
also):
Yi = a + b*Treatmenti + ei
Recall that the OLS estimator of b can be given
by
( y y )(t p )
N
i
i 1
N
i
2
(
t
p
)
i
i 1
Case 1. (cont.)
53
More importantly, we need to worry about the
variance of the estimator.
1
1
V E ( X ' X ) X ' ' X ( X ' X )
N
X'X
pN
pN
pN
Case 1. (cont.)
54
2
p p
2
1
V (X ' X )
p(1 p) N p 1
V
2
p(1 p) N
Case 1. (cont.)
55
So our t-statistic will be
tH
ˆ
0 0
ˆ
2
ˆ
p (1 p ) N
And hence the standard error bands
ˆ 1.96*
ˆ
p(1 p) N
Case 1. Standard Error Bands
56
ˆ 1.96*
ˆ
p(1 p) N
Notice that the standard error bands depend on the
proportion getting the treatment, the total sample size,
and the estimated variance of the error term.
The standard error bands shrink as . . .
As p approaches ½.
As N increases.
As the variance of the error becomes smaller.
Including additional explanatory variables reduces the error term.
Case 1. Power Calculation
57
tH
ˆ
0 0
ˆ
ˆ
2
p (1 p ) N
Suppose that I think a plausible effect size is 2 percent. I
want to make sure that 2 percent is statistically significant.
Given the variance of the error, a coefficient of 2 percent,
and the proportion treated, I can vary N to determine the
sample size needed to make 2 percent significant for a
given statistical power.
This exercise is called a power calculation.
Optimal Design
58
Doing these power calculations by hand is tedious. The
Smith Richardson Foundation funded the creation of a
software which would do these power calculations. It is
freely available on their website
http://www.wtgrantfdn.org/resources/research-tools
The manual is very easy to read.
All of the ingredients are simple except the standard
deviation of the error term.
However, we can standardize our effect sizes by dividing effects
by standard deviation of the error term.
Optimal Design reports things in standard deviation units.
Optimal Design also assumes that 50 percent are treated as this
maximizes power.
Optimal Design (Screenshot 1)
59
Optimal Design
(Screenshot for Case 1)
60
Optimal Design
(Screenshot for Case 1)
61
Case 1. Choose the
62
st
1
Option
Case 1. Customizing
63
We next customize our
calculations for what our
case.
For example, in Perry
Preschool there were
123 observations.
Effects on high school
graduation were about
0.4 standard deviations.
Let’s check on the
statistical power for 3
alternatives.
Case 1. Customizing
64
I’m also going to
adjust the X axis so
that we can get a
better picture.
Power calculations
65
So what kind of power did they have?
66
Generally, we want 70-80 percent power at a minimum. This was a risky
study. I would have counseled them to get closer to 150 observations.
Other ways to view it.
67
Let’s focus on the lower bound of the confidence
interval.
We will focus on the Minimal Detectable Effect Size
(MDES).
We choose the MDES versus sample size option.
MDES
68
We no longer assign a
treatment effect. Now
we assume a level of
power – 80 percent is
the default.
Effect size is on Y-axis.
Sample size on X.
69
Case 2. Blocking (Multi-site Trials)
70
Oftentimes, it might be easier to run lotteries at
multiple sites.
For
example, we examined the charter school paper.
Lotteries were held at each school. In some sense, the
charter paper is a combination of about 15 different
lotteries or experiments. The authors could have used
just one site, but they use all of the sites.
Blocking can also be used to decrease the variance
and the probability of bad randomization.
Blocks
could be groups of homogeneous students.
Statistical Models with Blocks
71
Split between fields.
Sociology:
Yij = a + (b+uj)*Treatmentij + eij
The model allows for heterogeneity in treatment effects.
Good for trying to understand heterogeneous treatment
effects (e.g. what percentage of schools have positive
treatment effects)
Economics:
Yij = a + b*Treatmentij + uj + eij
The models typically have fixed effects for each block or
site.
Good for trying to decrease the standard error.
Let’s move on to more complex designs
72
Bettinger (2011) examines student incentives in Coshocton, Ohio,
United States.
A donor gave students up to $100 per year based on their test
scores. Students took five tests each year and could earn $20 per
test.
Coshocton had about 150 students in grades 3-6 each year. There
were four schools in the city with two classrooms in each school.
Teachers did not want to randomize within classes.
Principals did not want to randomize by classroom since many
teachers “team” teach and combine classes for some subjects.
The donor did not want to bring in additional schools outside of
Coshocton.
How do you maximize power given the political constraints?
Coshocton Randomization
73
Politically, the only option was to randomize at the
grade-school level.
There
were four grades in each of the four schools.
The district was willing to run the project for as
many years as would be necessary (within reason).
How would you do the power calculation?
Somewhat
limited because we are going to assume that
individuals are independent over time. We won’t be
able to
75
Internal Validity
Unit of Randomization
Design Variation
Statistical Model
Verifying Randomization
Limits to Randomization
Key goals this week:
76
1.
2.
3.
4.
5.
6.
7.
Understand causal modeling (Yesterday)
Understand relationship of randomization to causal modeling
(Today)
Gain the statistical tools to analyze randomized experiments
(Today)
Become aware of underlying assumptions, strengths, and
weaknesses of experimental approaches
Become acquainted with key players in the development and
implementation of randomized experiments
Gain the statistical tools to design randomized experiments
(Today and Tomorrow)
Understand other key issues in design and implementation of
random experiments.
Defining causality
77
Compare outcomes of different treatments
Treatments
have be manipulable or able to be
changed
Key is to identify the counterfactual outcome
What
would have happened without the treatment?
The counterfactual is never observed.
We need assumptions to justify why our comparison
group represents the counterfactual.
Causal Questions and SUTVA
78
What are the ingredients of an experiment?
N
units indexed from i=1,…,N
T Treatments indexed by t=1,…,T
Y is outcome and indexed by t and i.
Two Key Conditions:
SUTVA
says that Y will be the same for any i no matter
how t is assigned.
SUTVA also says that Y will be the same no matter what
treatments the other units receive.
SUTVA and the Question
79
SUTVA is not about the experiment. It is about the
question.
When we decide on a comparison, SUTVA helps us
understand what causal question we are really able
to answer.
Remember
how the question for Perry Preschool could
be very narrow.
What should we think about SUTVA?
80
This is a useful framework for isolating the precise
question for which we have a causal question.
It likely holds in small samples.
In large movements, we could have general
equilibrium effects.
Let’s formulate causality mathematically
81
Yi =
Y1i if Di=1
Y0i if Di=0
Yi = Y0i + (Y1i – Y0i)Di
The most common approach is to compare students attending a
“treatment” to those not attending a treatment.
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] +
( E[Y0i | Di=1] – E[Y0i | Di=0])
1st term is average effect on treated
2nd term is selection bias
Thinking through the expectation
82
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
Example 1. Treatment = Religious Private Schooling
1st term is average effect on treated
Average private school outcome of individuals in religious private
schooling minus the average public school outcome for individuals
attending private school
The latter term is the unobserved counterfactual. We have to find a way
to estimate it.
2nd term is selection bias
The difference between the public school outcome that private school
attendees would have had and the public school outcome for public
school attendees.
Informs us as to how different private school attendees are from public
school attendees.
Let’s formulate causality mathematically
83
Yi =
Y1i if Di=1 (TREATED GROUP)
Y0i if Di=0 (CONTROL GROUP)
Yi = Y0i + (Y1i – Y0i)Di
(Average for group 1)– (Average for group 2) =
Average improvement for group 1 because of treatment
+ Average difference between group 1 and group 2 without the
experiment
Thinking through the expectation
84
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
Example 2. Treatment = Attending Preschool
1st
term is average effect on treated
Average
private school outcome of individuals in preschool
minus the average outcome that students who attended
preschool would have had if they had not gone to preschool
2nd
term is selection bias
The
difference between the outcome that preschool attendees
would have had without preschool and the outcome of students
not attending preschool.
Use of the formulation
85
It helps us figure out what we are estimating.
Later
we will augment this model with the “probability
of compliance.”
It identifies the key means we need to estimate to
gain causal estimates.
It helps us analyze our approach regardless of our
methodology
What happens if we have
randomization?
86
E[Y0i | Di=0] = E[Y0i | Di=1]
Selection Bias is gone.
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
=
E[Y1i | Di=1] – E[Y0i | Di=1] +
=
E[Y1i –Y0i | Di=1]
0
= E[Y1i –Y0i]
Simple differences in averages reveal the treatment
effect.
Hypothetical Example
87
A recent study found in the overall population that
students who attended schools of type X had higher test
scores than other students.
What do we expect the selection bias to look like? This
was a school requiring motivated parents.
If we randomized in the whole population, what
direction should we expect the treatment effect to go?
E[Yi | Di=1] – E[Yi | Di=0] =
E[Y1i | Di=1] – E[Y0i | Di=1] + ( E[Y0i | Di=1] – E[Y0i | Di=0])
Abdulkadiroglu et al (2011)
88
Examines charter schools in Boston
Charter schools are public schools which operate more
closely to private schools.
Obama has pushed for more charter schools.
Important research question to know if they work.
Charter schools are often oversubscribed.
When oversubscribed, they use lotteries to determine who
gets in.
Other charters are not oversubscribed.
If a school is oversubscribed, what would you expect?
It probably does a pretty good job.
Abdulkadiroglu et al (2011)
89
Abdulkadiroglu et al (2011)
90
Synthesizing
91
The authors found a selection of schools of type X
which ran lotteries to determine who entered the
schools. In these schools, the researchers
exploited the randomization. The difference
between winners and losers was even higher than
the previous comparison.
Explain?
Basically,
observed effect greater than treatment effect
then two possibilities:
Selection bias is negative at no wait list schools [likelihood?]
Treatment effect is much lower. Stronger the selection effects in
these other schools the lower the treatment effect.
Regression Formulation
92
Suppose we want to use regression analysis, how do we
estimate treatment effects?
Yi = a + b*Treatmenti + ei
E[Yi | Di=1] = a + b +E(e|Di=1)
E[Yi | Di=0] = a + E(e|Di=0)
E[Yi | Di=1] – E[Yi | Di=0] =b + E(e|Di=1) - E(e|Di=0)
Underlined term is selection bias.
This is just equal to correlation between e and D
Regression Formulation
93
Yi = a + b*Treatmenti + ei
Consider the OLS estimator of b. We will call it bhat.
E[bhat] = b + E[T’e/T’T]
Selection bias would suggest that e and T are correlated.
Think of it as an omitted variable problem.
If T is randomly assigned, then E[T’e]=0 . No omitted variable
can be correlated with T.
Multivariate Regression
94
What is the consequence of including more X in a
regression where there is randomization?
Generally,
the standard errors are lower
X is correlated with Y and once controlled for reduce
the variance of Y
Estimated treatment effect should be unbiased if
there is no selection bias
We will return to this.
Recap up to now.
95
Experiments is our key interest
Causal questions need treatments that are able to be
manipulated whether by the researcher or “nature.”
SUTVA are essential for identifying which questions we are
asking which are causal.
We need units, treatments, and outcomes.
Condition 1. Treatment is constant.
Condition 2. Your treatment does not affect mine.
Typical comparisons mask treatment effects and selection
bias.
Randomization removes selection bias, but there are other
ways to get rid of it.
Experiments vs. Observational Studies
96
Cox and Reid: The word experiment is used in a
quite precise sense to mean an investigation where
the system under study is under the control of the
investigator. This means that the individuals or
material investigated, the nature of the treatments
or manipulations under study and the measurement
procedures used are all selected, in their important
features at least by the investigator. By contrast in
an observational study some of these feature, and
in particular the allocation of individuals to
treatment groups is outside the investigator’s control.
Definition of Experiment
97
Notice that Cox and Reid never mentioned “randomization” in their
definition.
Are there experiments which are not randomized?
Not all studies lend themselves to randomization:
Effect of parental death on child outcomes
Effect of divorce on children
Effect of no schooling
We will distinguish between “field experiments” and “natural
experiments.”
Natural experiments are places where randomization occurs “by nature.”
Variation between treatment and control takes place for arbitrary and
seemingly random reasons.
Next week we are going to focus here.
In field experiments, the researcher “controls” the randomization.
Choosing the treatment
98
In designing random experiments, we start with the
treatment.
Two schools of thought:
1.
2.
Start with theory, policy, or conceptual frameworks.
Identify opportunities for randomization and then identify
questions which might be of interest to academics or
policymakers.
Angrist and Kremer and divergent approaches
The difference between program evaluation and research
The importance of partners.
Oftentimes the relationship is more important than the question
Duflo article discusses partners before the basics of
randomization
The importance of partners
99
Who are our partners?
Governments (Job Training, Negative Income Tax, Progressa,
Generating new pilots)
NGO (Can focus on smaller population than governments)
Private Groups
Partners have priorities and existing beliefs.
These create limitations and opportunities in our design.
Partners have resources.
Few of us have the money or time to implement new
treatments to units or to gather data on outcomes.
Partners are key to this.
Partners can help us find populations to study.
Some examples to work through
100
Hypothesis: students’ writing and
self-confidence are linked. Selfaffirming writing experiences
reinforces student confidence and
subsequently student outcomes.
Hypothesis: Remediation is
actually counterproductive in
college.
What’s the ideal experiment?
Is it plausible?
Experiment?
Plausibility?
Hypothesis: Deworming improves
health and educational outcomes.
Experiment?
Plausibility?
Hypothesis: Paying students for
test scores improves academic
performance.
Experiment?
Plausibility?
Hypothesis: Positive (negative)
verbal reinforcement improves
(destroys) academic progress.
Experiment?
Plausibility?
Worms paper results
101
Administer a vaccine to cure students of worms.
Worms are in dirty water. They are a parasite causing
health problems. School attendance declines.
Worms are highly infectious. If a peer has worms, it is easy
to get it yourself.
Randomized which students received the treatment
within a community
Randomized which communities received treatments.
Why would you want this type of design?
Key results?
No difference within communities
Communities who received the treatment were better off.
Unit of randomization
102
Once we have the treatment, we can determine the
optimal units.
Why do we care about “units”?
Statistical power. Typically the unit of analysis is the same
as the unit of randomization.
Statistical modeling.
Contamination.
The causal question often changes with the units.
“External Validity.” Can we generalize to other populations?
Mode of delivery.
Political considerations.
Consider some simple cases
103
Which level (individuals, classes, schools) would be
best to test the following treatments?
Adoption
of a new curriculum.
Creating a conditional cash transfer program based on
schooling.
Preschool subsidies.
Giving incentives to teachers.
Giving incentives to students studying in an online
course.
How many?
104
Once we know the level of randomization, we now
need to think about the ramifications in terms of the
number of units.
Power calculations are our key to determining the
size of the population necessary for answering our
causal questions.
We can derive power calculations from our
statistical models.
Case 1. Individual randomization
105
Our simple model works well (simple t-stats do
also):
Yi = a + b*Treatmenti + ei
Recall that the OLS estimator of b can be given
by
( y y )(t p )
N
i
i 1
N
i
2
(
t
p
)
i
i 1
Case 1. (cont.)
106
More importantly, we need to worry about the
variance of the estimator.
1
1
V E ( X ' X ) X ' ' X ( X ' X )
N
X'X
pN
pN
pN
Case 1. (cont.)
107
2
p p
2
1
V (X ' X )
p(1 p) N p 1
V
2
p(1 p) N
Case 1. (cont.)
108
So our t-statistic will be
tH
ˆ
0 0
ˆ
2
ˆ
p (1 p ) N
And hence the standard error bands
ˆ 1.96*
ˆ
p(1 p) N
Case 1. Standard Error Bands
109
ˆ 1.96*
ˆ
p(1 p) N
Notice that the standard error bands depend on the
proportion getting the treatment, the total sample size,
and the estimated variance of the error term.
The standard error bands shrink as . . .
As p approaches ½.
As N increases.
As the variance of the error becomes smaller.
Including additional explanatory variables reduces the error term.
Case 1. Power Calculation
110
tH
ˆ
0 0
ˆ
ˆ
2
p (1 p ) N
Suppose that I think a plausible effect size is 2 percent. I
want to make sure that 2 percent is statistically significant.
Given the variance of the error, a coefficient of 2 percent,
and the proportion treated, I can vary N to determine the
sample size needed to make 2 percent significant for a
given statistical power.
This exercise is called a power calculation.
Optimal Design
111
Doing these power calculations by hand is tedious. The
Smith Richardson Foundation funded the creation of a
software which would do these power calculations. It is
freely available on their website
http://www.wtgrantfdn.org/resources/research-tools
The manual is very easy to read.
All of the ingredients are simple except the standard
deviation of the error term.
However, we can standardize our effect sizes by dividing effects
by standard deviation of the error term.
Optimal Design reports things in standard deviation units.
Optimal Design also assumes that 50 percent are treated as this
maximizes power.
Optimal Design (Screenshot 1)
112
Optimal Design
(Screenshot for Case 1)
113
Optimal Design
(Screenshot for Case 1)
114
Case 1. Choose the
115
st
1
Option
Case 1. Customizing
116
We next customize our
calculations for what our
case.
For example, in Perry
Preschool there were
123 observations.
Effects on high school
graduation were about
0.4 standard deviations.
Let’s check on the
statistical power for 3
alternatives.
Case 1. Customizing
117
I’m also going to
adjust the X axis so
that we can get a
better picture.
Power calculations
118
So what kind of power did they have?
119
Generally, we want 70-80 percent power at a minimum. This was a risky
study. I would have counseled them to get closer to 150 observations.
Other ways to view it.
120
Let’s focus on the lower bound of the confidence
interval.
We will focus on the Minimal Detectable Effect Size
(MDES).
We choose the MDES versus sample size option.
MDES
121
We no longer assign a
treatment effect. Now
we assume a level of
power – 80 percent is
the default.
Effect size is on Y-axis.
Sample size on X.
122
Case 2. Blocking (Multi-site Trials)
123
Oftentimes, it might be easier to run lotteries at
multiple sites.
For
example, we examined the charter school paper.
Lotteries were held at each school. In some sense, the
charter paper is a combination of about 15 different
lotteries or experiments. The authors could have used
just one site, but they use all of the sites.
Blocking can also be used to decrease the variance
and the probability of bad randomization.
Blocks
could be groups of homogeneous students.
Statistical Models with Blocks
124
Split between fields.
Sociology:
Yij = a + (b+uj)*Treatmentij + eij
The model allows for heterogeneity in treatment effects.
Good for trying to understand heterogeneous treatment
effects (e.g. what percentage of schools have positive
treatment effects)
Economics:
Yij = a + b*Treatmentij + uj + eij
The models typically have fixed effects for each block or
site.
Good for trying to decrease the standard error.
Let’s move on to more complex designs
125
Bettinger (2011) examines student incentives in Coshocton, Ohio,
United States.
A donor gave students up to $100 per year based on their test
scores. Students took five tests each year and could earn $20 per
test.
Coshocton had about 150 students in grades 3-6 each year. There
were four schools in the city with two classrooms in each school.
Teachers did not want to randomize within classes.
Principals did not want to randomize by classroom since many
teachers “team” teach and combine classes for some subjects.
The donor did not want to bring in additional schools outside of
Coshocton.
How do you maximize power given the political constraints?
Coshocton Randomization
126
Politically, the only option was to randomize at the
grade-school level.
There
were four grades in each of the four schools.
The district was willing to run the project for as
many years as would be necessary (within reason).
How would you do the power calculation?
Somewhat
limited because we are going to assume that
individuals are independent over time. We won’t be
able to control for this covariance.
Some examples to work through
127
Hypothesis: students’ writing and
self-confidence are linked. Selfaffirming writing experiences
reinforces student confidence and
subsequently student outcomes.
Hypothesis: Remediation is
actually counterproductive in
college.
What’s the ideal experiment?
Is it plausible?
Experiment?
Plausibility?
Hypothesis: Deworming improves
health and educational outcomes.
Experiment?
Plausibility?
Hypothesis: Paying students for
test scores improves academic
performance.
Experiment?
Plausibility?
Hypothesis: Positive (negative)
verbal reinforcement improves
(destroys) academic progress.
Experiment?
Plausibility?
Worms paper results
128
Administer a vaccine to cure students of worms.
Worms are in dirty water. They are a parasite causing
health problems. School attendance declines.
Worms are highly infectious. If a peer has worms, it is easy
to get it yourself.
Randomized which students received the treatment
within a community
Randomized which communities received treatments.
Why would you want this type of design?
Key results?
No difference within communities
Communities who received the treatment were better off.
Unit of randomization
129
Once we have the treatment, we can determine the
optimal units.
Why do we care about “units”?
Statistical power. Typically the unit of analysis is the same
as the unit of randomization.
Statistical modeling.
Contamination.
The causal question often changes with the units.
“External Validity.” Can we generalize to other populations?
Mode of delivery.
Political considerations.
Consider some simple cases
130
Which level (individuals, classes, schools) would be
best to test the following treatments?
Adoption
of a new curriculum.
Creating a conditional cash transfer program based on
schooling.
Preschool subsidies.
Giving incentives to teachers.
Giving incentives to students studying in an online
course.
How many?
131
Once we know the level of randomization, we now
need to think about the ramifications in terms of the
number of units.
Power calculations are our key to determining the
size of the population necessary for answering our
causal questions.
We can derive power calculations from our
statistical models.
Let’s do a brief review of power on the whiteboard.
Case 1. Individual randomization
132
Our simple model works well (simple t-stats do
also):
Yi = a + b*Treatmenti + ei
Recall that the OLS estimator of b can be given
by
( y y )(t p )
N
i
i 1
N
i
2
(
t
p
)
i
i 1
Case 1. (cont.)
133
More importantly, we need to worry about the
variance of the estimator.
1
1
V E ( X ' X ) X ' ' X ( X ' X )
N
X'X
pN
pN
pN
Case 1. (cont.)
134
2
p p
2
1
V (X ' X )
p(1 p) N p 1
V
2
p(1 p) N
Case 1. (cont.)
135
So our t-statistic will be
tH
ˆ
0 0
ˆ
2
ˆ
p (1 p ) N
And hence the standard error bands
ˆ 1.96*
ˆ
p(1 p) N
Case 1. Standard Error Bands
136
ˆ 1.96*
ˆ
p(1 p) N
Notice that the standard error bands depend on the
proportion getting the treatment, the total sample size,
and the estimated variance of the error term.
The standard error bands shrink as . . .
As p approaches ½.
As N increases.
As the variance of the error becomes smaller.
Including additional explanatory variables reduces the error term.
Case 1. Power Calculation
137
tH
ˆ
0 0
ˆ
ˆ
2
p (1 p ) N
Suppose that I think a plausible effect size is 2 percent. I
want to make sure that 2 percent is statistically significant.
Given the variance of the error, a coefficient of 2 percent,
and the proportion treated, I can vary N to determine the
sample size needed to make 2 percent significant for a
given statistical power.
This exercise is called a power calculation.
Optimal Design
138
Doing these power calculations by hand is tedious. The
Smith Richardson Foundation funded the creation of a
software which would do these power calculations. It is
freely available on their website
http://www.wtgrantfdn.org/resources/research-tools
The manual is very easy to read.
All of the ingredients are simple except the standard
deviation of the error term.
However, we can standardize our effect sizes by dividing effects
by standard deviation of the error term.
Optimal Design reports things in standard deviation units.
Optimal Design also assumes that 50 percent are treated as this
maximizes power.
Optimal Design (Screenshot 1)
139
Optimal Design
(Screenshot for Case 1)
140
Optimal Design
(Screenshot for Case 1)
141
Case 1. Choose the
142
st
1
Option
Case 1. Customizing
143
We next customize our
calculations for what our
case.
For example, in Perry
Preschool there were
123 observations.
Effects on high school
graduation were about
0.4 standard deviations.
Let’s check on the
statistical power for 3
alternatives.
Case 1. Customizing
144
I’m also going to
adjust the X axis so
that we can get a
better picture.
Power calculations
145
So what kind of power did they have?
146
Generally, we want 70-80 percent power at a minimum. This was a risky
study. I would have counseled them to get closer to 150 observations.
Other ways to view it.
147
Let’s focus on the lower bound of the confidence
interval.
We will focus on the Minimal Detectable Effect Size
(MDES).
We choose the MDES versus sample size option.
MDES
148
We no longer assign a
treatment effect. Now
we assume a level of
power – 80 percent is
the default.
Effect size is on Y-axis.
Sample size on X.
149
Case 2. Blocking (Multi-site Trials)
150
Oftentimes, it might be easier to run lotteries at
multiple sites.
For
example, we examined the charter school paper.
Lotteries were held at each school. In some sense, the
charter paper is a combination of about 15 different
lotteries or experiments. The authors could have used
just one site, but they use all of the sites.
Blocking can also be used to decrease the variance
and the probability of bad randomization.
Blocks
could be groups of homogeneous students.
Statistical Models with Blocks
151
Split between fields.
Sociology:
Yij = a + (b+uj)*Treatmentij + eij
The model allows for heterogeneity in treatment effects.
Good for trying to understand heterogeneous treatment effects
(e.g. what percentage of schools have positive treatment effects)
You could also include a site specific error.
Economics:
Yij = a + b*Treatmentij + uj + eij
The models typically have fixed effects for each block or site.
Good for trying to decrease the standard error.
Variances with Blocks
152
Models with random effects (sociology model)
V
J
2
p(1 p)nJ
There are J blocks and n people in each block.
is the variance between sites of the treatment
effect
The equation is the same except for the new term.
Implication of the new formula
153
Suppose that there is no variability between sites in
the treatment effect.
First
term disappears
First term is going to be much larger than the
second term.
Denominator
of second term will be much larger.
Variance of treatment effect across sites is hard to
measure. You have to make an assumption as to
what it might be.
What about the economic model
154
The economics model basically assumes zero
variability across sites.
The focus is on how much of the variance the
blocking variable can explain.
You
can estimate this by running a regression of the
outcome on dummy variables for the blocks.
An Example
Bettinger and Baker (2011)
155
Study of “coaching” in college
Students arrive on campus and receive coaches
Coaches
call students and help them figure out how to
do homework, study, plan, and prepare for course
events.
The company providing coaches, Inside Track,
wanted to prove itself.
It
conducted random evaluations each year at each
school it helped.
Bettinger and Baker (2011) cont.
156
In 2004 and 2007, Inside Track conducted 17
lotteries.
Each lottery is a “block.”
Randomization did not happen across sites but
within sites.
We used the economics model:
Yij = a + b*Treatmentij + uj + eij
Outcomes were staying in college for 1 year or 2
years.
Baseline Results with Covariates
Covars include age, gender, ACT, HS GPA, SAT, On Campus, Home Residence, Merit Scholarship, Pell Grant,
Math Remediation, English Remediation, and controls for having missing values in any of the covars
Model
6-month
retention
12-month
retention
18-month
retention
24-month
retention
.580
.435
.286
.242
Treatment Effect
(std error)
.052***
(.008)
.053***
(.008)
.043***
(.009)
.034**
(.008)
Lottery Controls
Yes
Yes
Yes
Yes
13,552
13,553
11,149
11,153
Lottery Controls
.051***
(.008)
Yes
.052***
(.008)
Yes
.042***
(.009)
Yes
.033**
(.008)
Yes
N
13,552
13,553
11,149
11,153
Control Mean
1. Baseline
N
2. Baseline w/ Covariates
Treatment Effect
(std error)
Variation in Treatment Effect
158
The economists assume there is no difference in
treatment effects across sites.
Is that a good assumption?
Effects by Lottery?
Lottery
12-month
24-month
Persistence Persistence
.078***
.020
1 (n=1583)
Lottery
2 (n=1629)
.057**
3 (n=1546)
10 (n=326)
12-month
Persistence
.052
24-month
Persistence
--
.039**
11 (n=479)
.091**
--
.043*
.050**
12 (n=400)
-.055
--
4 (n=1552)
.050**
.050**
13 (n=300)
.162***
.054
5 (n=1588)
.040
.029
14 (n=600)
.054
-.010
6 (n=552)
.072*
--
15 (n=221)
.136**
--
7 (n=586)
.018
.066**
16 (n=176)
.062
.047
8 (n=593)
.023
-.017
17 (n=450)
.000
.058
9 (n=974)
.058**
--
Standard deviation across sites in 12-month effect is about 0.098 if we
were to convert the effects to be standard deviations rather than effect
sizes.
Go Back to Design
160
Suppose you were starting a new project and
blocking seemed appropriate.
Let’s go through the software.
Blocking in Optimal Design
161
Options in Optimal Design
162
In order, the options are the following:
alpha = type I error (for our confidence interval)
P
= Power
sigma2 = Effect size variability (variance
of treatment across sites)
n
= Number of units in each site
B
= Proportion of variance explained by Blocks
R2
= Proportion of variance explained by
other variables
Graphs in Optimal Design
163
Design the Inside Track Evaluation
164
Suppose you were told that you had 17 blocks with
about 700 people in each block
What is your minimum detectable effect size?
We
can assume a treatment variability across sites of
0.1
We can also run a regression of the outcome on dummy
variables for each block.
How much variation do blocks pick up?
165
MDES for Inside Track
166
Summary on Case 2
167
Blocking can really help our standard errors
especially when there are big differences across
blocks.
Especially in sociologists preferred model,
variability across blocks alters our standard errors
causing them to increase.
Standard
errors start to be more responsive to number
of blocks than the overall sample size.
Let’s move on to more complex designs
168
Bettinger (2011) examines student incentives in Coshocton, Ohio,
United States.
A donor gave students up to $100 per year based on their test
scores. Students took five tests each year and could earn $20 per
test.
Coshocton had about 150 students in grades 3-6 each year. There
were four schools in the city with two classrooms in each school.
Teachers did not want to randomize within classes.
Principals did not want to randomize by classroom since many
teachers “team” teach and combine classes for some subjects.
The donor did not want to bring in additional schools outside of
Coshocton.
How do you maximize power given the political constraints?
Coshocton Randomization
169
Politically, the only option was to randomize at the
grade-school level.
There
were four grades in each of the four schools.
The district was willing to run the project for as
many years as would be necessary (within reason).
How would you do the power calculation?
Somewhat
limited because we are going to assume that
individuals are independent over time. We won’t be
able to control for this covariance.
Case 3. Group or Cluster Randomization
170
Some treatments are implausible at the individual
level.
We need to randomize at the group level.
Deworming
paper
Coshocton paper.
Clusters or groups all receive the treatment at the
same time
Variances with Clusters
171
Cluster randomization
V
p(1 p) J
2
p(1 p)nJ
There are J clusters and n people in each cluster.
is the variance between clusters
Case 2. Blocking (Multi-site Trials)
172
Oftentimes, it might be easier to run lotteries at
multiple sites.
For
example, we examined the charter school paper.
Lotteries were held at each school. In some sense, the
charter paper is a combination of about 15 different
lotteries or experiments. The authors could have used
just one site, but they use all of the sites.
Blocking can also be used to decrease the variance
and the probability of bad randomization.
Blocks
could be groups of homogeneous students.
An Example
Bettinger and Baker (2011)
173
Study of “coaching” in college
Students arrive on campus and receive coaches
Coaches
call students and help them figure out how to
do homework, study, plan, and prepare for course
events.
The company providing coaches, Inside Track,
wanted to prove itself.
It
conducted random evaluations each year at each
school it helped.
Bettinger and Baker (2011) cont.
174
In 2004 and 2007, Inside Track conducted 17
lotteries.
Each lottery is a “block.”
Randomization did not happen across sites but
within sites.
We used the economics model:
Yij = a + b*Treatmentij + uj + eij
Outcomes were staying in college for 1 year or 2
years.
Baseline Results with Covariates
Covars include age, gender, ACT, HS GPA, SAT, On Campus, Home Residence, Merit Scholarship, Pell Grant,
Math Remediation, English Remediation, and controls for having missing values in any of the covars
Model
6-month
retention
12-month
retention
18-month
retention
24-month
retention
.580
.435
.286
.242
Treatment Effect
(std error)
.052***
(.008)
.053***
(.008)
.043***
(.009)
.034**
(.008)
Lottery Controls
Yes
Yes
Yes
Yes
13,552
13,553
11,149
11,153
Lottery Controls
.051***
(.008)
Yes
.052***
(.008)
Yes
.042***
(.009)
Yes
.033**
(.008)
Yes
N
13,552
13,553
11,149
11,153
Control Mean
1. Baseline
N
2. Baseline w/ Covariates
Treatment Effect
(std error)
Variation in Treatment Effect
176
The economists assume there is no difference in
treatment effects across sites.
Is that a good assumption?
Effects by Lottery?
Lottery
12-month
24-month
Persistence Persistence
.078***
.020
1 (n=1583)
Lottery
2 (n=1629)
.057**
3 (n=1546)
10 (n=326)
12-month
Persistence
.052
24-month
Persistence
--
.039**
11 (n=479)
.091**
--
.043*
.050**
12 (n=400)
-.055
--
4 (n=1552)
.050**
.050**
13 (n=300)
.162***
.054
5 (n=1588)
.040
.029
14 (n=600)
.054
-.010
6 (n=552)
.072*
--
15 (n=221)
.136**
--
7 (n=586)
.018
.066**
16 (n=176)
.062
.047
8 (n=593)
.023
-.017
17 (n=450)
.000
.058
9 (n=974)
.058**
--
Standard deviation across sites in 12-month effect is about 0.098 if we
were to convert the effects to be standard deviations rather than effect
sizes.
Go Back to Design
178
Suppose you were starting a new project and
blocking seemed appropriate.
Let’s go through the software.
Blocking in Optimal Design
179
Options in Optimal Design
180
In order, the options are the following:
alpha = type I error (for our confidence interval)
P
= Power
sigma2 = Effect size variability (variance
of treatment across sites)
n
= Number of units in each site
B
= Proportion of variance explained by Blocks
R2
= Proportion of variance explained by
other variables
Graphs in Optimal Design
181
Design the Inside Track Evaluation
182
Suppose you were told that you had 17 blocks with
about 700 people in each block
What is your minimum detectable effect size?
We
can assume a treatment variability across sites of
0.1
We can also run a regression of the outcome on dummy
variables for each block.
How much variation do blocks pick up?
183
MDES for Inside Track
184
Summary on Case 2
185
Blocking can really help our standard errors
especially when there are big differences across
blocks.
Especially in sociologists preferred model,
variability across blocks alters our standard errors
causing them to increase.
Standard
errors start to be more responsive to number
of blocks than the overall sample size.
Let’s move on to more complex designs
186
Bettinger (2011) examines student incentives in Coshocton, Ohio,
United States.
A donor gave students up to $100 per year based on their test
scores. Students took five tests each year and could earn $20 per
test.
Coshocton had about 150 students in grades 3-6 each year. There
were four schools in the city with two classrooms in each school.
Teachers did not want to randomize within classes.
Principals did not want to randomize by classroom since many
teachers “team” teach and combine classes for some subjects.
The donor did not want to bring in additional schools outside of
Coshocton.
How do you maximize power given the political constraints?
Coshocton Randomization
187
Politically, the only option was to randomize at the
grade-school level.
There
were four grades in each of the four schools.
The district was willing to run the project for as
many years as would be necessary (within reason).
How would you do the power calculation?
Somewhat
limited because we are going to assume that
individuals are independent over time. We won’t be
able to control for this covariance.
Case 3. Group or Cluster Randomization
188
Some treatments are implausible at the individual
level.
We need to randomize at the group level.
Deworming
paper
Coshocton paper.
Clusters or groups all receive the treatment at the
same time
Variances with Clusters
189
Cluster randomization
V
p(1 p) J
2
p(1 p)nJ
There are J clusters and n people in each cluster.
is the variance between clusters
As in the prior case, J is more important than the
overall sample size (nJ)
Let’s take it to Optimal Design
190
What are the options?
191
In order, the options are the following:
alpha = type I error (for our confidence interval)
n
= Size of cluster
P
= Power
rho
= correlation across clusters
R12 = Proportion of variance explained by
cluster-level variables
Power in Coshocton?
192
In Coshocton, there were
about 40 students in each
cluster.
There were 16 clusters
each year for three years
for a total of 48 clusters.
At the end, we convinced
them to do it one more
year to get 64 clusters.
We assume either a 0.05
or a 0.10 correlation
between sites.
193
Why do we care so much about power?
194
Consider a very hypothetical scenario.
A “partner” wants to run a new experiment.
They want to randomize across 80 students. Only
40 will receive the treatment.
In prior studies, the same treatment has generated
only small effects – about 0.10 standard deviations.
Is it worth your time to run this small experiment?
195
Is it worth it?
196
NO!!!!
With 80 students, the minimal detectable effect size
is 0.64.
Given
the prior studies, we can never find an effect.
Suppose that the treatment generates effects of
0.50. We would not have the power to actually
measure them.
In the end, we would have to conclude that our
estimated effect is no different from zero even
though it could be very large.
One more exercise… Costs.
197
Suppose that the mayor of a large city comes and
asks you to conduct a randomized experiment in
paying students for test scores.
There are 300 schools with 4 classrooms in each
school and 30 students per classroom.
The expected cost per student treated will be
600руб.
Suppose we use everyone. How much power do we
have?
How should we design it?
198
Randomize at the student, the class, or the school?
For simplicity, let’s randomize each school.
150
schools in the treatment and 150 in the control
We will have 120 children in each school.
18000 students in the treatment and 18000 in the
control
The total cost = 10,800,000 руб.
199
How should we design it?
200
Randomize at the student, the class, or the school?
For simplicity, let’s randomize each school.
150
schools in the treatment and 150 in the control
We will have 120 children in each school.
18000 students in the treatment and 18000 in the
control
The total cost = 10,800,000 руб.
The MDES = 0.078 standard deviations.
Just for fun…
201
How much more power would we have if we were
able to randomize at the student level rather than
the school?
MDES = 0.03 (over 1/2 the prior MDES)
Let’s go back to our new exercise…
202
The mayor is really frustrated when you bring the cost
estimate. It is too expensive! He wants the costs to be half.
The total cost = 10,800,000 руб.
The MDES = 0.078 standard deviations.
How can we save money? (Many possible answers)
For simplicity, let’s pick three options which would reduce the
costs by half. Any predictions?
Randomize at school level but only choose 2 classrooms at each
school.
Treat schools as blocks and randomly choose 1 treatment and 1
control school in each school.
Only use 150 schools.
Randomize at school level with only
two classes per school.
203
Schools as blocks with randomization
of 2 classrooms within the block
204
Only 150 Schools
205
Summing it up
206
Treating everyone:
Treating only 2 classrooms per school
The total cost = 5,400,000 руб.
The MDES = 0.083 standard deviations.
Schools as blocks and randomize across two classes
The total cost = 10,800,000 руб.
The MDES = 0.078 standard deviations.
The total cost = 5,400,000 руб.
The MDES = 0.055 standard deviations.
But it possibly changes the nature of the treatment
Only using half of the schools
The total cost = 5,400,000 руб.
The MDES = 0.111 standard deviations.
Where are we at in the course?
207
We chose a treatment
We
used theory or policy and enlisted a partner.
We determined where to randomize and how many
to randomize.
Now we randomize.
Random Assignment
208
Use a transparent form of randomization.
Have witnesses.
Random numbers are preferred.
In children, lotteries or flipping coins is best.
Make sure that partners understand the process.
Keep control.
Michigan voucher experience.
Instead of using randomized wait list, an administrator chose
“randomly.” She chose students who lived closest to her office.
Unfortunately, richer students lived near her office.
Even worse, we discovered it after the experiment.
Colombia vouchers.
One city cheated and gave the vouchers to political friends.
Randomization in Coshocton
209
Verify the Randomization
210
If it works, then there should be few differences
across control and treatment groups.
0
.2
.4
.6
.8
1
Pre-Lottery Math Test Scores in Coshocton
(Regression Corrected)
-3
-2
-1
0
1
Math Scores Pre-Lottery
Treatment
2
Control
3
0
.2
.4
.6
.8
1
Pre-Lottery Reading Test Scores
(Regression Corrected)
-3
-2
-1
0
1
Reading Scores Pre-Lottery
Treatment
Control
2
3
Basic Descriptive Statistics & Balance
Characteristic
Control Group
Mean
Difference for
Treatment
(std error)
Sample Size
Number of
Lotteries
Female
.488
.009
(.009)
12,525
15
Missing Gender
.675
-.001
(.001)
13,555
17
Age
30.5
.123
(.209)
9,569
8
Missing Age
.294
.0001
(.0010)
13,555
17
College
Entrance Exam
(SAT)
886.3
-11.01
(16.19)
1,857
4
Missing SAT
.827
.001
(.002)
13,555
17
Living On
Campus
.581
-.005
(.017)
1,955
4
0
.01
.02
.03
.04
.05
Age Distributions
0
20
40
Age
Treatment Age
60
Control Age
80
0
.0005
.001
.0015
.002
SAT Scores
0
500
1000
SAT
Treatment
Control
1500
0
.2
.4
.6
.8
High School GPA
0
1
2
HS GPA
Treatment
3
Control
4
Significant Differences by Lottery?
Lottery
# Characteristics
#
Significant
Diff (90%)
Lottery
# Characteristics
# Significant
Diff
10 (n=326)
6
0
1 (n=1583) 2
0
11 (n=479)
6
0
2 (n=1629) 2
0
12 (n=400)
2
0
3 (n=1546) 2
0
13 (n=300)
1
0
4 (n=1552) 2
0
14 (n=600)
1
0
5 (n=1588) 2
0
15 (n=221)
3
1
6 (n=552)
3
0
16 (n=176)
14
0
7 (n=586)
3
0
17 (n=450)
12
0
8 (n=593)
3
0
9 (n=974)
9
0
After we verify randomization. . .
218
We need to collect outcome data and estimate
effects.
Outcome data can be very expensive.
To cost…
Colombia Vouchers (American Economic Review 2002)
Random assignment of high school voucher in Bogota, Colombia
Hunting and gathering for lottery participants
16 college students from Javeriana University
Phone and house visits
Middle of Colombia’s civil war
Police at times limited our data collection
In the end, response rates were near 55 percent.
Cost of contacting each observation was roughly $300
Very rough estimate since I was not involved in financials
Data collection took 6 months, 4 trips to Colombia, and 1 full-time
manager
Publication in premiere economics journal
… or not to cost
Colombia Vouchers Part 2 (AER 2006)
In
2000, we learned of administrative data on high
school exit exam in Colombia
One visit arranged the entire matching
Given universe of coverage, 100 percent response
among students whose data was valid.
Cost per observation was roughly $6
Very
rough estimate since I was not involved in financials
Data collection took 2 months, 1 trip to Colombia, and 1
part-time manager
Publication in premiere economics journal
Lessons Learned
Administrative data hold the keys to reducing cost
of RCT evaluations
Quality of initial data collection greatly influenced
the probability that we could track students.
Time to complete the evaluation was greatly
reduced with administrative data.
The trade-off to administrative data was that we
have less information on underlying mechanisms.
Threats to Validity
222
Internal Validity
External Validity
Does the experiment give unbiased answers for population
being randomized?
Does the experiment help us understand other populations?
Threats to internal validity. . .
Contamination between treatment and control
Switching treatments
Attrition
Hawthorne or John Henry Effects
Key goals this week:
223
1.
2.
3.
4.
5.
6.
7.
Understand causal modeling
Understand relationship of randomization to causal
modeling
Gain the statistical tools to analyze randomized
experiments
Become aware of underlying assumptions, strengths, and
weaknesses of experimental approaches
Become acquainted with key players in the development
and implementation of randomized experiments
Gain the statistical tools to design randomized experiments
Understand other key issues in design and implementation
of random experiments.
Treatment Fidelity
224
I skipped this in Lecture 4 only mentioning it in class.
Once the treatment starts, we need to verify that it is
actually happening.
Are people participating in new programs?
Are individuals administering the treatment actually doing
it?
Examples:
Professional Development and Subsequent Teacher
Teaching.
Do coaches call students? 90 percent get one call. 63
percent get more than one call.
Student incentives, lightning strikes, and Kenya.
Threats to Validity
225
Internal Validity
External Validity
Does the experiment give unbiased answers for population
being randomized?
Does the experiment help us understand other populations?
Threats to internal validity. . .
Contamination between treatment and control
Switching treatments
Attrition
Hawthorne or John Henry Effects
Contamination
226
Treatment Affects Control
Peer effects are frequent culprit.
Teachers
talk to their peers.
Students talk to their peers.
Other spillovers can exist.
Creative research designs (e.g. the deworming) can
help control for contamination.
Hawthorne or John Henry Effects
227
Hawthorne effects: Treatment group works harder
than they normally would as a result of the
treatment.
Multiple
papers provide some evidence that the class
size experiment in Tennessee may have had Hawthorne
effects.
John Henry effects: The control group works harder
than they normally would as a result of the
experiment.
Attrition
228
Attrition occurs when students leave the experiment at some point
after the treatment.
Attrition can be a tricky problem.
Always focus on whether attrition is related to either being in the control
or treatment.
Consider how you are collecting outcome data and whether attrition
favors one group over the other.
Example:
In the coaching example, attrition occurs if students leave the university. If
coaching has positive effects, then treatment students are more likely to stay
at the university.
If we are collecting data through surveys or using records from the university,
then we are more likely to find treatment kids. If our outcomes come from
surveys, then we might have outcome for other students.
More on Attrition
229
It is useful for us to imagine the treatment and
control groups.
For example, suppose that this
High Ability
box represents ability with high
ability at the top and low at the
bottom.
Low Ability
We will assume one of these exist
for both the control and treatment groups.
Outcome Data and Attrition?
230
Treatment
Control
Non-Attriters
Non-Attriters
Attriters
If we rely on outcome data and cannot collect data on
attriters, then our randomization is compromised.
For example, if attriters come from the low end of the ability
distribution, then we are going to have too many treatment
people with low ability.
Our results will be biased downward in this example.
Outcome Data on Attrition
231
Choose data collection strategies that are neutral to
treatment and control.
Examples:
Survey all students using information from initial contact lists.
Why can’t we update information from the school from the end of
the experiment?
Rely on administrative data from government or other
collection group.
Always ask whether outcomes are related to attrition.
In surveys, always check in survey response is symmetric
across treatment and control groups.
Attrition in “Downstream” Outcomes
232
Some outcomes that we wish to measure are
“downstream” or they occur after another outcome we
care about.
For example, we may care if students take college
entrance exams. In many countries only a fraction of
students take these exams.
We might be interested in two outcomes:
Did the student take the exam?
How did they score on the exam?
We never observe score unless they complete the first outcome.
Colombia Voucher Experiment
233
In Colombia Voucher Experiment, Colombian
government provided educational vouchers (or
coupons) so that low-income students could attend
private school instead of government schools.
We want to measure the impact on the likelihood
that students take the college entrance exam and
on their scores on the exam.
Only about 35 percent of students take the college
entrance exam.
Outcomes After 3 Years
Dependent Variable
th
Started 6 in
Private
Started 7th in
Private
Currently In Private
School
Highest Grade
Completed
Currently In School
Finished 7th Grade
(excludes Bog 97)
Finished 8th Grade
Repetitions of 6th
Grade
Sample Size
Loser’s
Means
(1)
.877
(.328)
.673
(.470)
.539
(.499)
7.5
(.960)
.831
(.375)
.847
(.360)
.632
(.483)
.194
(.454)
562
Bogota 95
No Ctls
Basic Ctls
(2)
.063**
(.017)
.174**
(.025)
.160**
(.028)
.164**
(.053)
.019
(.022)
.040**
(.020)
.112**
(.027)
-.066**
(.024)
(3)
.057**
(.017)
.168**
(.025)
.153**
(.027)
.130**
(.051)
.007
(.020)
.031
(.019)
.100**
(.027)
-.059**
(.024)
1147
Basic +19
Barrio Ctls
(4)
.058**
(.017)
.171**
(.024)
.156**
(.027)
.120**
(.051)
.007
(.020)
.029
(.019)
.094**
(.027)
-.059**
(.024)
Long Run Concerns
Effects Observed after 3 Years
1/2 of Voucher Winners No Longer Using Voucher
after 3 Years
No Difference in Attendance Rates
Ambiguity of repetition result.
Reliance on Survey Data and Response Bias
Response
Rates around 55% but symmetric across
treatment group
Why Administrative Records?
All college entrants and most high school grads take the
ICFES college entrance exams.
We use ICFES
registration status and test scores as an outcome.
Advantages:
• Long-term outcome of major significance
• No need to survey; no attrition
Disadvantages:
• Score outcomes may be hard to interpret
• Differential record-keeping by win/loss status may
generate a spurious treatment effect
Vouchers and the Probability of ICFES Match
Exact ID Match
ID and City
Match
ID and 7-letter Name
Match
.354
.072
(.016)
3542
.339
.069
(.016)
3542
.331
.072
(.016)
3542
.387
.067
(.023)
1789
.372
.069
(.023)
1789
.361
.071
(.023)
1789
.320
.079
(.022)
1752
.304
.071
(.022)
1752
.302
.074
(.022)
1753
A. All Applicants
Dep. Var. Mean
Voucher Winner
N
B. Female Applicants
Dep. Var. Mean
Voucher Winner
N
C. Male Applicants
Dep. Var. Mean
Voucher Winner
N
Evaluation Strategy
How do we think of effects on scores since voucher
affected registration for the exam?
We look at test scores among those who were tested.
These are clearly contaminated by selection bias since
vouchers affect testing probability.
The resulting bias probably masks positive effects; we
consider a number of selection corrections
Figure 1a. Language Scores by Voucher Status (No Correction for Selection Bias)
Voucher Winners
Voucher Losers
.1
.075
.05
.025
0
25
35
45
score
55
65
Figure 1a. Language Scores by Voucher Status (No Correction for Selection Bias)
Voucher Winners
Voucher Losers
.1
Voucher winners
here would never
have taken the
exam under the
assumption that the
voucher effect is
monotonic. These
individuals do not
have “twins” in the
control for test
scores.
.075
.05
.025
0
25
35
45
score
55
65
Solution for this type of attrition
241
Bounding exercises
Make
some reasonable assumptions about what test
scores would look like. Apply the assumptions to
generate “upper” and “lower” bounds.
Alternatively, we can censor the results. Censoring
means that we will assign a base value for everyone
scoring below that certain value (or not taking the
exam).
Bounding in Colombia
242
For example, in Colombia, we tried made two
assumptions.
First, we assumed the voucher did not hurt you. So test
scores should either not be affected or be positively
affected. For the lower bound, we can assume that there is
no selection into the test.
Second, to get the upper bound, we assume that the voucher
had a monotonic effect. Everyone’s test score was pushed
upward. In this case, we can eliminate the bottom 5-6
percent of test scores since these students would not have
taken the exam in the absence of the voucher.
Lower Bound
Figure 1a. Language Scores by Voucher Status (No Correction for Selection Bias)
Voucher Winners
Voucher Losers
.1
.075
.05
.025
0
25
35
45
score
55
65
Upper Bound
Figure 2a. Language Score Distribution by Voucher Status for Equal Proportions of
Winners and Losers
Voucher Winners
Voucher Losers
.1
.075
.05
.025
0
25
35
45
score
55
65
Censoring
Figure 3a. Tobit Coefficients by Censoring Percentile in Language Score Distribution
13
12
11
Tobit Coefficients
10
9
8
7
6
5
4
3
2
1
0
0
10
20
30
40
50
Percentile
60
70
80
90
Summary on Attrition
246
Attrition can lead to biased results when it favors
one group over the other.
Attrition can happen in surveys in terms of nonresponse.
Choose outcome collection techniques that do not
favor treatment over control.
On “downstream” outcomes, use bounding
strategies.
Switching Treatments
247
The treatment may be desirable to control group.
The control group may try to get the treatment
through private ways.
Examples:
Coaching.
Control group might seek out their own
coaches.
Colombia vouchers. Schools might offer scholarships to
control group students while not to treatment students.
Paying kids for test scores. Parents may give money to
kids not chosen.
Switching Treatments or Compliance
248
Compliance means two key points:
The
treatment group participates in the treatment.
The control group does not.
So far, we have assumed 100% compliance.
In practice, compliance is rarely 100%.
Estimation when Compliance < 100%
249
Consider the following:
A
researcher conducts a randomized experiment in
which students watch an additional lecture. The lecture
is online. He knows from the login tracking that 1/3 of
students did not watch and he can identify which
students did not watch.
He announces that he will exclude the students who did
not participate.
What is wrong?
Examining Compliance
250
What if compliance is related to another variable
which affects outcomes?
High Ability
For example, suppose that this
box represents ability with high
ability at the top and low at the
bottom.
Low Ability
As before we will assume one of
these exist for both the control and treatment
groups.
What if Ability is Related to Compliance?
251
Treatment
Control
Compliers
Non
Compliers
?
Excluding noncompliers would exclude individuals with the
lowest ability.
Ability is unobserved, and we do not know which control group
students would have complied.
Bias from Excluding Non-Compliers
252
Treatment
Control
Compliers
Non
Compliers
?
Excluding noncompliers would exclude individuals with the
lowest ability.
Ability is unobserved, and we do not know which control group
students would have complied.
Other Compliance Problems
253
Treatment
?
Control
Non
Compliers
Compliers
Non compliers in the control may have sought out the
treatment.
Ability is unobserved, and we do not know which treatment
group students would have been non compliers without the
experiment.
Summing Up Compliance
254
Treatment
Control
Always
Takers
(Compliers)
Always Takers
(Non
Compliers)
Compliers
Never Takers
(Non
Compliers)
Compliers
Never Takers
(Compliers)
Comparing students who complied could lead to big biases.
We never observe who would have been non-compliers in the
absence of the experiment.
Consider the treatment effect
255
We can think about what the estimated treatment
effect looks like with randomization.
E[Yi | Di=1] – E[Yi | Di=0]
We can break this up into our three groups: always
takers (AT), compliers, never takers (NT)
E[Yi | Di=1] – E[Yi | Di=0] =
PAT * (E[Yi | Di=1, AT] – E[Yi | Di=0, AT]) +
PC * (E[Yi | Di=1, C] – E[Yi | Di=0, C]) +
PNT * (E[Yi | Di=1, NT] – E[Yi | Di=0, NT])
Consider the treatment effect
256
E[Yi | Di=1] – E[Yi | Di=0] =
PAT * Effect on Always Takers+
PC * Effect on Compliers +
PNT * Effect on Never Takers
But what is the effect on Always Takers? Never
Takers?
They should be zero. Why?
Hence, the average difference between winners and
losers is really just (PC * Effect on Compliers)
Intention to Treat
257
The average difference in a comparison of
treatment and control groups with
randomization is just (PC * Effect on Compliers)
We call this quantity the “intention to treat.”
“Intention to treat” is likely the relevant policy
variable.
If we know PC , then we can divide “intention to
treat” by the probability to get the “effect of
the treatment on the treated”
Summing Up Compliance
258
Treatment
Control
Always
Takers
(Compliers)
Always Takers
(Non
Compliers)
Compliers
Never Takers
(Non
Compliers)
Compliers
Never Takers
(Compliers)
Comparing students who complied could lead to big biases.
We never observe who would have been non-compliers in the
absence of the experiment.
Consider the treatment effect
259
We can think about what the estimated treatment
effect looks like with randomization.
E[Yi | Di=1] – E[Yi | Di=0]
We can break this up into our three groups: always
takers (AT), compliers, never takers (NT)
E[Yi | Di=1] – E[Yi | Di=0] =
PAT * (E[Yi | Di=1, AT] – E[Yi | Di=0, AT]) +
PC * (E[Yi | Di=1, C] – E[Yi | Di=0, C]) +
PNT * (E[Yi | Di=1, NT] – E[Yi | Di=0, NT])
Consider the treatment effect
260
E[Yi | Di=1] – E[Yi | Di=0] =
PAT * Effect on Always Takers+
PC * Effect on Compliers +
PNT * Effect on Never Takers
But what is the effect on Always Takers? Never Takers?
They should be zero. Why?
Hence, the average difference between winners and
losers is really just (PC * Effect on Compliers)
Intention to Treat
261
The average difference in a comparison of
treatment and control groups with
randomization is just (PC * Effect on Compliers)
We call this quantity the “intention to treat.”
“Intention to treat” is likely the relevant policy
variable.
If we know PC , then we can divide “intention to
treat” by the probability to get the “effect of
the treatment on the treated”
Estimating the Probability of
Compliance
262
This is often called the 1st stage.
We want to run the following regression:
Participating in the Treatment =
a + b*(Randomly Chosen) + e
This is not as straightforward as it sounds.
What is “participating in the treatment” entail?
For vouchers, does it mean receiving a voucher, using a voucher, or
attending private school.
For coaching, does it mean talking to a coach, making an action
plan, or completing goals.
We need to monitor treatment fidelity if we want to
measure participation in the treatment.
Ethical and Practical Considerations
263
What treatments can we do?
In Russia, is there a process for getting research approved?
Can we withhold treatment if we know it works?
The only exceptions are if the knowledge gained justifies the
risk or we have budget constraints.
With budget constraints, we still have an obligation to serve
as many as our budget permits.
What happens if we observe the treatment “hurting”
people?
Some underlying considerations
Respect for persons, beneficence, justice
Preserve them from harm, do not take advantage of people
More Ethical Considerations
from Gueron (2000)
264
Social experiments should
not deny people access to services to which they are
entitled,
not reduce service levels,
address important unanswered questions,
include adequate procedures to inform program
participants and assure data confidentiality,
be used only if there is no less intrusive way to answer the
questions adequately,
have a high probability of producing results that will be
used.
Practical Considerations
265
The Judith Gueron paper on your reading list is a
good practical guide for performing randomized
experiments.
http://www.mdrc.org/publications/45/workpaper.html
Judith goes through a list of considerations which
she deems as being key in working with people in
the field. If you are ever going to run a
randomized experiment, please go through her
considerations before you start working with
potential partners.
Estimating Causal Mechanisms
266
In the case of vouchers suppose that the “real” treatment, the
reason why treatment kids do better than control, is attendance at a
private school.
Hence, we are really interested in the regression:
test scores = a + b*attend private + e
Ideally, we would like to run this regression on the population, but
we cannot remove selection bias.
We estimate the voucher effect, but the voucher effect is a mix of
compliance and the effect of the treatment on the treated.
Can we estimate the effect of the treatment on the treated at once?
The answer is instrumental variables or IV.
Haavelmo’s problem
Consider the following
scatter plot of price and
quantity
Is it a demand or supply
curve?
Given the obvious
downward slope, the
temptation is to say a
demand curve. But this is
wrong. Why?
Demand systems
These points are
equilibrium points. They
are points where supply
and demand intersect.
There are many possible
demand curves that are
consistent with these
points.
Simple supply and demand
So how do we solve the problem?
We need more information.
Once we have the information, we can use
instrumental variables to help us find the supply
curve.
Consider a very simple supply and demand
model:
y a bp w
d
d
y d fp v
s
s
Simple Model
In the simple model of supply and demand equilibrium will
happen when demand (yd) is equal to supply (ys).
If a positive shock hits demand (e), it will cause the demand
curve to shift out and to the right. Because y and p and jointly
determined, the shock (e) will also cause the price to shift.
In OLS estimation, we assumed that our shocks (e) are
uncorrelated with our X variables.
The result is that our OLS estimates will be biased in this case.
They will not reveal the true slope of the demand curve.
This type of problem is often called the problem of
“endogenous” regressors and it is related to omitted variable
bias.
How prevalent is the problem?
If you have a model where the x variable reasonably
affects y and that the y variable reasonably affects
x, then you likely have this problem of endogeneity.
For example, suppose we are comparing test scores
to grades in a class. We had in mind to see if your
grade predicts your test score. However, your test
scores may also predict your grades and so we have
an endogeneity problem.
A more subtle example would be smoking and
obesity. Which causes which?
So how do you fix it?
This is where extra information helps. Consider our
same equation system:
I have changed it so that
y d a bpd cX w
X now enters the demand
y s d fp s v
equation but not supply.
X might be something like “fads.” They increase the
demand directly. People want more y because X
increased. These fads can only affect supply
through their impact on demand.
What happens as X changes?
As X changes, it causes the
demand curve to shift outward.
For each shift in demand, we
discover a new equilibrium point.
These equilibrium points show us
the shape of the supply curve.
Hence, added information about
demand helps us identify the slope
of the supply curve.
This added information is often
called “an instrument.”
Mathematical solution
We could have showed the same thing
mathematically.
Let’s let supply equal demand and solve the model
for p and y.
We can do this through multiple steps.
y a bp cX w y d fp v
(b d ) p (d a) cX (v w)
d
p
s
( d a )
(bd )
c
(bd )
X
yd f(
( d a )
(bd )
y [d f (
( d a )
(b d )
( v w)
(bd )
c
(bd )
)]
X
cf
(bd )
( v w)
(bd )
X [ f
)v
( v w)
(bd )
) v]
Two Key Equations
There are two key equations from this exercise:
p
( d a )
(bd )
(b c d ) X ((bv wd ))
y [d f ( ((bdda )) )] (bcfd ) X [ f
These two equations show each endogenous
variable (p and y) as functions of the one
exogenous variable X.
These two equations are called the “reduced
form.”
( v w)
(bd )
) v]
Two Key Equations
Our previous graphical solution showed that we could solve for the
coefficient “f” (i.e. the slope of the supply curve) by using the
information in X
To see this mathematically, look at the reduced form equations again:
p
( d a )
(bd )
(b c d ) X ((bv wd ))
y [d f ( ((bdda )) )] (bcfd ) X [ f
( v w)
(bd )
) v]
Look at the coefficients on the X variable.
Can we use this information to solve for the slope? Absolutely. Just
take the ratio of the coefficients.
You will notice, however, that there is no way for us to deduce what the
slope of the demand curve is in this case.
A few definitions
This technique that we used here is called “Indirect
Least Squares” and it is a close cousin of
Instrumental Variables.
The basic model of supply and demand is called
the “structural model.” The model where the
endogenous variables are functions of the
exogenous variables is called the “reduced form.”
Another example
What if the added information is in the supply
equation instead?
y d a bp d w
Now Z enters the supply
y s d fp s gZ v
curve but not demand.
Z might be something like the weather that affects
the supply without impacting individuals’ demands
for y.
We solve the system the same way.
y a bp w y d fp gZ v
(b d ) p (d a ) gZ (v w)
d
p
s
( d a )
(bd )
y a b(
g
(bd )
( d a )
(bd )
y [ a b(
Z
( d a )
(bd )
( v w)
(bd )
g
(bd )
)]
Z
bf
(bd )
( v w)
(b d )
Z [b
)w
( v w)
(bd )
) w]
The new reduced form
There are two key equations from this exercise:
p
( d a )
(bd )
(b gd ) Z ((bv wd ))
y [a b( ((bdda )) )] (bgb d ) Z [b ((bv wd )) ) w]
These two equations show each endogenous
variable (p and y) as functions of the one
exogenous variable Z.
Now the coefficients on Z can be used to figure
out the slope of the demand curve.
Identification
In the very first structural model (i.e. no X or Z), we had one endogenous
variable on the right-hand side and no exogenous variables in either
regression. This situation is called “not identified.”
In the second model (included X in demand curve), we have a different
situation. In each equation of the structural model, we had one endogenous
right-hand side variable. In the demand equation, we had no excluded
exogenous variables. In the supply equation, we had one excluded
exogenous variable (X). In this example, the demand equation is “not
identified” while the supply equation is “just identified.” The supply
equation is identified because we can trace out its slope using shifts in the
demand equation.
In the third model (included Z in supply curve), the supply equation is no
longer identified but the demand equation is “just identified.”
Identification - extensions
There are some extra cases worth considering.
Suppose that there had been two X variables in the demand equation.
If you work this through, you would have two possible solutions for the
slope. In this case, the supply curve is “over identified.” There are more
exogenous excluded variables (the two X variables) than there are
included right hand side endogenous variables (p).
Suppose that X and Z were in the structural system, then both equations
would have been just identified.
Identification matters in that there are simple rules to really
knowing if you can trace the slope. The key is to compare the
number of EXCLUDED exogenous variables to the number of
included, right-hand side endogenous variables.
Instrumental Variables
d
d
Let’s go back to the
y a bp cX w
simple model w/
s
s
y d fp v
X.
We know that X can predict price. Earlier, we
solved for the reduced form of the price equation.
Instrumental variables is a similar technique to what
we used before except we have two distinct steps.
Instrumental Variables
First, we predict Price using X. We estimate the
reduced form model. This equation is called the first
stage. Using the equation, we predict price and save
the predicted values.
Second, we regress y on the PREDICTED price. We
don’t use the actual price. We use the predicted
price. This exercise gives us an UNBIASED estimate of
the slope of the supply curve.
Most software (e.g. Stata, SPSS) actually does this in
one simultaneous step. It does it this way in order to
get the correct standard errors.
Instrumental Variables
We don’t always use the reduced form as the first-stage
estimate. In practice, you regress the included endogenous
variable (price) on the EXCLUDED exogenous variables. The
excluded exogenous variable is called the “instrument”
Instruments satisfy two conditions:
They are uncorrelated with the “y” variable of interest. In this case, X
was uncorrelated with supply.
They are correlated with the included endogenous variable of interest.
In this case, X predicts price.
We can test the second condition, but the first is only an
assumption.
A practical example
Suppose you were examining the voucher program that we previously
talked about.
Consider the following equation system:
PrivateSchool=a + b*voucher+e
Testscores=c+d*private school + v
In this case, we can’t run the second equation in OLS since private schooling is
endogenous (i.e. higher test scores could “cause” private schooling).
However, voucher predicts private schooling. Additionally, voucher satisfies the
other condition that it does not affect test scores directly. Hence voucher is
an instrument for private schooling and the test scores equation is just
identified (one endogenous and one excluded exogenous).
If there had been additional X variables that predict test scores, we would
have included them in both equations so long as the variables were
exogenous.
Other examples of IV Studies
Y
College
X
Z
Financial Aid
Thresholds in Aid
Test scores
Class size
Thresholds of maximum class size
Health
Heart surgery
Proximity to hospital
Earnings
Year of school
Quarter of birth
Birth weight
Maternal smoking
State cigarette taxes
enrollment
Note: See Angrist & Krueger (2001)
Slide from Hedges (2008)
Let’s go back to compliers
289
Just as in our simple example of instrumental
variables we can imagine a set of equations.
Treatment = a + b(Randomly Selected) + e
Outcome = c + d(Treatment) + u
The variable (randomly selected) is an instrument
for whether a person participates in the treatment.
It
is correlated with treatment but uncorrelated with the
error term (u).
We can then get an unbiased estimate of “d,” the
effect of the treatment on the treated.
What if our treatment has many
dimensions?
290
We talked briefly about Progressa. It provided
support to families depending on their children’s
attendance at school. It also depended on regular
health visits.
Which “treatments” might it be providing?
Health
care
Education
Extra Income
Here is the problem.
291
Treatment = a + b(Randomly Selected) + e
Outcome = c + d(Treatment) + u
Which variable represents the treatment? Health? Education?
Income?
All three treatments likely affect the outcome, and all three
treatments are likely endogenous.
The instrument can help us predict that someone uses it; however, we
have four endogenous variables in this system (the outcome and the
three treatments) and only one instrument (random assignment). We
are not identified.
Random assignment can give us unbiased estimates of the intention
to treat parameter, but it can only give us unbiased estimates of the
treatment on the treated if certain conditions are met.
Missing compliance group
292
First, recall that we had three types of people:
always takers, never takers, and compliers.
There is one group that we left out: defiers.
Defiers
are people who defy the treatment status. If
assigned to the treatment, they choose not to take it
even though they might be an always taker if they were
in the control.
We assume that these people don’t exist.
This assumption is called monotonicity. People move
forward not back.
When does IV give us causal
mechanisms?
293
SUTVA has to be satisfied.
Random assignment has to influence participation
(i.e. some people need to comply).
Random assignment cannot be related to other
outcomes or treatments which may influence the
main outcome of interest.
Monotonicity (no defiers)
Do we need randomization?
294
We just need our instrument to be exogenous to
outcomes and related to our other endogenous
variable.
Randomization certainly helps, but there are many
other possible instruments.
The key is the exclusion restriction. You can’t test
that it is really exogenous. It is a theoretical
exercise and you have to think through stories which
may or may not threaten its exogeneity.
When Randomization is not Possible…
We started the research process and began imagining
the perfect experiment.
If it was feasible and we could find the partners, we
did the experiment.
If not, we became creative and tried to find the right
discontinuity or the right instrument.
Oftentimes, these are not enough. Our data lack any
exogenous variation to help us identify the causal
impact of a treatment.
What do we do?
Observational Studies
296
Earlier we talked about observational studies.
In these studies, we observe data where some have
had the treatment and others have not.
We had no control over who received the treatment
and the treatment may have been assigned
nonrandomly.
Raw comparisons of people in the treatment and
people not in the treatment are likely still biased.
We
have done nothing to eliminate the possibility of
selection bias.
So what do we do?
297
Matching is the most popular technique.
There are multiple ways to match students.
We
could match on a single characteristics. For
example, match everyone who has the same age.
We can match cities and “control” for pre-existing
differences.
In recent years, propensity score matching has
become the most popular matching technique.
298
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
Number of publications
Publications in Pub Med with phrase "Propensity Score"
180
160
140
120
100
80
60
40
20
0
Year
Source: Arbogast (2005)
Propensity Score Matching
299
Propensity Score Matching (PSM) attempts to match
students in the treatment to other students with similar
characteristics.
PSM is a two step process.
We need to specify the propensity score. This is the
probability that a person gets treatment given their
observable characteristics.
We then need to match students with similar propensities.
Key idea: You and I have the same probability of
being assigned treatment given our characteristics. We
might be a good match.
Compare two schools
300
Suppose School A uses a new curriculum while School B
does not.
We can imagine that School A is also, on average, a
school with richer students.
Our propensity score model would estimate the
probability that each student attends School A.
There are some students at School B which have very
similar characteristics to some students at School A.
Our analysis will focus on students in the treatment in
School A for which we can find a suitable matching
student in School B.
Why PSM?
301
Prior to PSM, we typically matched students based on a
small set of characteristics (e.g. age, gender,
socioeconomic status).
Matching could be cumbersome, and there were many
possible confounding variables.
For example, suppose that I match students in the same
grade, same gender, and same income level between
School A and School B. What other factors could
separate students at these schools?
Parent education levels, the types of courses that students
take, prior academic achievement, and so on.
Why PSM?
302
Rosenbaum and Rubin (1983) proposed using a
single index rather than several characteristics.
In our example, we can include age, gender, and
socioeconomic status, but we also can control for
many, many other characteristics.
We start with specifying a linear probability model.
Binary Treatment
303
In PSM, we start with a binary treatment variable.
T = 1 if person is in the treatment
= 0 if person is not.
We want to model the probability that T=1.
Typically we use one of three models:
Linear probability model. Uses OLS to estimate the
relationship.
Probit. Models the probability using a Normal Distribution.
Logit. Models the probability using a Logit Distribution.
In each case, we will allow a set of control variables
predict that T=1.
Let’s do some examples.
304
Consider a job training program.
Adults who lacked skills applied to the training
program.
The program initially used randomization to assign
students.
Dehejia and Wahba (2002) investigated whether a
different control group could be constructed using
PSM.
Because randomization was initially used, they can
compare PSM methods to randomization.
Randomized Sample
305
Current Population Survey
306
CPS is large, household survey in the United States.
They survey 50,000 households per month.
The survey inquires about labor market experiences.
Start with the Propensity Score
307
The propensity score is estimated using a Logit of
treatment status on:
Age, Age2, Age3, Years of Schooling, Yrs of School2
Dummy Variables for being Married, No degree, Black,
Hispanic
Real earnings in 1974, Real earnings in 1975
Dummy variables for being unemployed in 1974 and
1975
An interaction between years of schooling and earnings
in 1974.
All earnings are in 1982US$.
Things start pretty bleakly
308
If we had stopped there…
309
Notice how different the PS are
310
So how do we refine our matches?
311
The most common techniques are
Nearest Neighbor Match
Caliper Matching
Nearest Neighbor Match: For each person in the treatment,
identify the control group person with the closest propensity
score. This can be done with or without replacement.
Caliper Matching: Choose all control group individuals
whose propensity score are within a certain range of the
treatment group. Use weights to downweight cases where
multiple control group people are matched. Range is the
caliper and the researcher decides how big it should be.
Mahalanobis Metric Matching: Redefines nearest neighbor
match by defining distance in a more nuanced way.
Potential Problems with Matching
312
Incomplete Matches
Some
people may not match. They may not have a
nearest neighbor.
Caliper may be too wide.
The
wider the caliper, the less precise the match, the
more likely we introduce selection bias.
Some Individuals are Excluded at
Both Ends of the Propensity Score
Cases excluded
Range of
matched
cases.
Participants
Nonparticipants
Predicted Probability
Source: Guo, Barth, Gibbons (2004)
Propensity score distribution before trimming
(example from Hill pre-K Study)
.15
0
0
.05
.1
Comparison Group (n=1,144):
25th percentile: 0.30
50th percentile: 0.40
75th percentile: 0.53
Mean = 0.39
.15
1
0
.05
.1
Treatment Group (n=908):
25th percentile: 0.42
50th percentile: 0.52
75th percentile: 0.60
Mean = 0.50
0
1
propensity score
Graphs by prek
Source: Lipsey (2007)
Propensity score distribution after trimming
(example from Hill pre-K Study)
.15
0
0
.15
1
.1
Treatment Group (n=908):
25th percentile: 0.42
50th percentile: 0.52
75th percentile: 0.60
Mean = 0.50
0
.05
Fraction
.05
.1
Comparison Group (n=908):
25th percentile: 0.36
50th percentile: 0.45
75th percentile: 0.56
Mean = 0.46
0
1
propensity score
Graphs by prek
Source: Lipsey (2007)
Estimate the treatment effect, e.g., by
differences between matched strata
Propensity Score Quintiles
Treatment
Group
Matches
Control
Group
Source: Lipsey (2007)
Let’s go back to job training
317
• The sample is much more balanced and
comparable to NSW.
What about treatment effects?
318
Let’s go back to job training
319
• Again, the sample is much more
balanced and comparable to NSW.
What about treatment effects?
320
Two extensions
321
Can we refine the estimates even further?
How can we assess validity and fit of PSM?
Refinements
322
One common strategy to help reduce bias is to subclassify
the propensity score.
Divide the propensity score into multiple groups – quintiles
and deciles are common.
Within each group, we can assess balance between control
and treatment group.
For each group, we estimate a separate treatment effect.
For simplicity, let’s assume deciles are used.
These are interesting in themselves (e.g. effect of the treatment on
individuals who are highly likely to participate).
Combine group estimates by weighting each group by the
percent of the treatment sample in each group.
Smoking and Health
323
Don Rubin, “Using Propensity Score to Help Design
Observational Studies: Application to Tobacco
Litigation.” Health Services and Outcomes Research
Methodology. 2001.
Smoking is not randomly assigned.
Goal is to measure the impact of smoking on health
outcomes. The treatment is “smoking.”
Specifying the Propensity Score
324
More Variables for the PS
325
Testing Validity and Fit
326
One potential problem is misspecification in the
propensity score.
How
robust are results to alternative specifications?
Dehejia and Wahba
327
Testing Validity and Fit
328
One potential problem is misspecification in the
propensity score.
How
robust are results to alternative specifications?
Examine Bias
Researchers
refer to “Bias” in PSM as being the
difference in the respective propensity scores of the
treatment and control group in standard deviation units.
Evaluation of Project GRAD
329
Table 2.8. The Potential Bias of Propensity Score Matching
Sample
Mean(Propensity): Mean(Propensity):
Control Group
Treatment Group
All Schools
0.091
0.652
Matched on African-American
0.248
0.652
Representation
Matched on Pre-PG Achievement
0.378
0.682
Levels
Matched on All Characteristics
0.357
0.652
including Free/Reduced Lunch
Bias Term
1.832
1.319
0.994
0.963
Matching plus Stratification Reduces Bias
330
Table 2.8. The Potential Bias of Propensity Score Matching
Sample
Mean(Propensity): Mean(Propensity):
Control Group
Treatment Group
Boys, All
.1414
.2477
Boys, Matched
0.4963
0.5006
Boys, Matched, Median Averaged
Boys, Matched, Quartile Averaged
Girls, All
0.1391
0.2425
Girls, Matched
0.4996
0.5029
Girls, Matched, Median Averaged
Girls, Matched, Quartile Averaged
Bias Term
0.7733
0.0311
0.0118
0.0225
0.7493
0.0242
0.0070
0.0099
Testing Validity and Fit
331
One potential problem is misspecification in the
propensity score.
How
robust are results to alternative specifications?
Examine Bias
Researchers
refer to “Bias” in PSM as being the
difference in the respective propensity scores of the
treatment and control group in standard deviation units.
Examine Ratio of Standard Deviations of the PS
Samples
should have similar standard deviations PS.
Comparison of Standard Deviations
332
Table 2.9. Ratio of Standard Deviations in Propensity Score for
Treatment and Control Groups
Sample
StdDev(Propensity): StdDev(Propensity):
Control Group
Treatment Group
Boys, All
0.1061
0.1374
Boys, Matched
0.0337
0.0320
Boys, Matched, Median
Averaged
0.0177
0.0153
Boys, Matched, Quartile
Averaged
0.0127
0.0124
Girls, All
0.1038
0.1380
Girls, Matched
0.0296
0.0280
Girls, Matched, Median
Averaged
0.0163
0.0144
Girls, Matched, Quartile
Averaged
0.0153
0.0104
Ratio of
StdDevs
0.7725
1.0534
1.1557
1.0264
.7522
1.0584
1.1319
1.4702
Testing Validity and Fit
333
One potential problem is misspecification in the propensity
score.
Examine Bias
Researchers refer to “Bias” in PSM as being the difference in the
respective propensity scores of the treatment and control group in
standard deviation units.
Examine Ratio of Standard Deviations of the PS
How robust are results to alternative specifications?
Samples should have similar standard deviations PS.
Examine Ratio of Standard Errors for Each Covariate
Regress Covariate on PS. Save the residual.
Compare standard deviation of this residual for treatment and
control groups.
334
Examine where this ratio falls for all of the
variables. The goal is to have most of the ratios
between .95 and 1.05.
Table 2.10: Ratio of Standard Error Terms (r) for Control and Treatment Groups
For Each Explanatory Variable in Analysis
Sample
r<0.9
0.9<r<0.95
0.95<r<1.05 1.05<r<1.10
r>1.10
Boys, All
0.5
0
0.3125
0
0.1875
Boys, Matched
0
0
0.875
0.0625
0.0625
Girls, All
0.4375
0.0625
0.3125
0
0.1875
Girls, Matched
0
0
0.875
0.0625
0.0625
Assessing Bias in Smoking Study
Unmatched
335
Smoking Study – Matched Sample
336
Using Subclassifications
337
Limitations on Propensity Score
338
If individuals differ on unobserved characteristics, then the
propensity score could induce bias.
Propensity score in small samples can lead to imbalance.
Including irrelevant covariates in PS can reduce efficiency
(Guo, Barth, Gibbons 2005)
PSM can reduce large biases but it can eliminate all bias
(Bloom)
PSM might help identify bias in short-run, but over time,
quality of comparisons deteriorates (Bloom)
We need overlap in the two groups.
Context matters. Surveys that do not match context may
have strong unobservables.
Implementing in Stata
339
There is a nice introduction to PSM as Stata uses it.
http://fmwww.bc.edu/RePEc/usug2001/psmatch.pdf
Psmatch treated, on(score) cal(.01)