Within Study Comparisons of Experiments and Non

Download Report

Transcript Within Study Comparisons of Experiments and Non

Comparing Results from RCTs
and Quasi-Experiments that share
the same Intervention Group
Thomas D. Cook
Northwestern University
Why RCTs are to be preferred
• Statistical theory re expectations
• Relative advantage over other bias-free methods-e.g., regression-discontinuity (RDD) and
instrumental variables (IV)
• Ad hoc theory and research on implementation
• Privileged credibility in science and policy
• Claim that non-exp. alternatives routinely fail to
produce similar causal estimates
Dissimilar Estimates
• Come from empirical studies comparing
exp. and non-exp. results on same topic
• Strongest are within-study comparisons
• These take an experiment, throw out the
control group, and substitute a nonequivalent comparison group
• Given the intervention group is a constant,
this is a test of the different control groups
Within-Study Comparison Lit.
• 20 studies, mostly in job training. Of the 14 in job
training reviews contend:
• (1) no study produces a clearly similar causal
estimate, including Deheija & Wahba
• (2) Some design and analysis features associated
with less bias, but still bias
• (3) the average of the experiments is not different
from the average of the non-experiments--but be
careful here and note the variance of the effect
sizes differs by design type
Brief History of Literature on
Within Study Comparisons
• LaLonde; Fraker & Maynard
• 12 subsequent studies in job training
• Extension to examples in education in USA
and social welfare in Mexico, never yet
reviewed
Policy Consequences
• Department of Labor, as early as 1985
• Health and Human Services, job training
and beyond
• National Academy of Sciences
• Institute of Educational Sciences
• Do within-study comparisons deserve all
this?
We will:
• Deconstruct „non-experiment“ and compare
experimental estimates to
• 1. Regression-discontinuity estimates
• 2. Estimates from difference-of-differences (fixed
effects) design
• Ask: Is general conclusion about the inadequacy
of non-experiments true across at least these
different kinds of non-experiment
Criteria of Good Within-Study
Comparison Design
1. Variation in mode of assignment--random or not
2. No third variables correlated with both
assignment and outcome--e.g., measurement
3. Randomized experiment properly executed
4. Quasi-experiment good instance of “type”
5. Both design types estimate the same causal entity-e.g, LATE in regression-discontinuity
6. Acceptable criteria of correspondence between
design types--ESs seem similar; not formally
differ; stat significance patterns not differ, etc.
Experiments vs. RegressionDiscontinuity Design Studies
Three Known within-Study
Comparisons of Exp and R-D
• Aiken, West et al (1998)- R-D study;
experiment; LATE; analysis; results
• Buddelmeyer & Skoufias (2003)-R-D
study; experiment; LATE; analysis; results
• Black, Galdo & Smith (2005)-R-D study;
experiment; LATE; analysis; results
Comments on R-D vs Exp.
• Cumulative correspondence demonstrated over
three cases
• Is this theoretically trivial, though?
• Is it pragmatically significant, given variation in
implementation in both the experiment and R-D?
• As “existence proof”, it belies over-generalized
argument that non-experiments don’t work
• As practical issue, does it mean we should support
RDD when treatments are assigned by need, merit.
• Emboldens to deconstruct non-experiment further
Experiment vs Differences-inDifferences
• Most frequent non-experimental design by far
across many fields of study
• Also modal in within-study comparisons in job
training, and so it provides major basis for past
opinion that non-experiments are routinely biased
• We review: 3 studies with comparable estimates
• 14 job training studies with dissimilar estimates
• 2 education examples with dissimilar estimates
Bloom et al
• Bloom et al (2002; 2005)--job training the topic
• Experiment 11 sites - 8 pre earning waves; 20 post
• Non-Experiment = 5 within-state comparisons; 4
within-city; all comparison Ss enrolled in welfare
• We present only control/comparison contrast
because treatment time series is a constant
Issue is:
• Is there overall difference between control groups
randomly or non-randomly formed?
• If yes, can statistical controls—OLS, IV (incl.
Heckman models), propensity scores, random
growth models—eliminate this difference?
• Tested 1O modes, but only one longitudinal
• Why we treat this as d-in-d rather than ITS
Bloom et al. Results
Bloom et al. Results (continued)
Implications of Bloom et al
• Averaging across the 4 within-city sites showed no
difference-also true if 5th between-city site added
• Selecting within-study comparisons obviated the
need for statistical adjustments for nonequivalence--design alone did it.
• Bloom et al tested differential effects of statistical
adjustments in between-state comparisons where
there were large differences
• None worked, or did better than OLS
Aiken et al (1998) Revisited
The experiment. Remember that sample was
selected on narrow range of test score values
• Quasi-Experiment--sample selection limited to
students who register late or cannot be found in
summer but who score in the same range as the
experiment
• No differences between experiment and nonexperiment on test scores or pretest writing tests
• Measurement identical in experiment and non-exp
Results for Aiken et al
• Writing standardized test = .59 and .57 - sig
• Rated essay = .06 and .16 – ns
• High degree of comparability in statistical
test results and effect size estimates
Implications of Aiken et al
• Like Bloom et al, careful selection of
sample gets close correspondence on
important observables.
• Little need for stat adjustment for nonequivalence limited only to unobservables
• Statistical adjustment minor compared to
use of sampling design to construct initial
correspondence
What happens if there is an initial
selection difference?
• Shadish, Luellen & Clark (2006)
Figure 1: Design of Shadish et al. (2006)
N = 445 Undergraduate Psychology Students
Pretests, and then Random Assignment to
Randomized
Experiment
n = 235
Randomly Assigned to
Mathematics
Training
n = 119
Vocabulary
Training
n = 116
Nonrandomized
Experiment
n = 210
Self-Selected into
Mathematics
Training
n = 79
Vocabulary
Training
n = 131
All participants measured on both mathematics and vocabulary outcomes
What’s special in Shadish et al
• Variation in mode of assignment
• Hold constant most other factors thru first
RA--population/measures /activity patterns
• Good experiment? Pretests; short-term and
attrition; no chance for contamination.
• Good quasi-experiment? - selection process;
quality of measurement; analysis and role of
Rosenbaum
Bias
Results Shadish et al
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Vocabulary Outcome
Mathematics
Outcome
Unadjusted Propensity
QuasiScore
Experiment Stratification
Propensity
Score
ANCOVA
Method of Propensity Score
Adjustment
Implications of Shadish et al
• Here the sampling design produced nonequivalent groups on observables, unlike Bloom
• Here the statistical adjustments worked when
computed as propensity scores
• However, big overlap in experimental and nonexperimental scores due to first stage random
assignment, making propensity scores more valid
• Extensive, unusually valid measurement of a
relatively simple selection process, though not
homogeneous.
Limitations to Shadish et al
• What about more complex settings?
• What about more complex selection
processes?
• What about OLS and other analyses?
• This is not a unique test of propensity
scores!
Examine Within-Study
Comparison Studies with
different Results
• The Bulk of the Job Training Comparisons
• Two Examples from Education
Earliest Job Training Studies:
Adding to Smith/Todd Critique
• Mode of Assignment clearly varied
• We assume RCT implemented reasonably well
• But third variable irrelevancies were not
controlled, esp location and measurement, given
dependence on matching from extant data sets
• Large initial differences between randomly and
non-randomly formed comparison groups
• Reliance on statistical adjustment to reduce
selection, and not initial design
Recent Educational Examples
Agodini & M. Dynarski (2004)
• Drop-out prevention experiment, 16 m/h schools
• Individual students, likely dropouts, were
randomly assigned within schools—16 replicates
• Quasi-Experiment—students matched from 2
quite different sources: middle school controls in
another study, and national NELS data.
• Matching on individual and school demographic
factors
• 4 outcomes examined and so in non-experiment
• 128 propensity scores -16 x 4 x 2--computed
basically from demographic background variables
Results
• Only 29 of 128 cases were balanced matches
obtained
• Why quality matching so rare? In non-experiment,
groups hardly overlap. Treatment group is high
and middle schools, but comparisons are middle
only or from a very non-local national data set
• Mixed pattern of outcome correspondences in 29
cases of computable propensity scores. Not good
• OLS did as well as propensity scores
Critique
• Who would design a quasi-experiment this way?
Is a mediocre non-experiment being compared to a
good experiment?
• Alternative design might have been:
• 1. Regression-discontinuity.
• 2. Local comparison schools, same selection
mechanism to select similar comparison students.
3 Use of multi-year prior achievement data.
Wilde & Hollister (2005)
• The Experiment—reducing class size in 11 sites;
no pretest used at the individual level
• Quasi-experimental design—individuals in
reduced classes matched to individual cases from
other 10 sites
• Propensity scores; mostly demographic
• Analysis treat each site as a separate experiment
• And so 11 replicates comparing an experimental
and non-experimental effect size
Results
• Low level of correspondence in
experimental and non-experimental effect
sizes across the 11 sites
• So for each site it makes a causal difference
whether experiment or quasi-experiment
• When aggregated across sites, results
closer: exp = .68; non-exp = 1.07
• But they do reliably differ
Critique
• Who would design a quasi-exp on this topic
without a pretest on same scale as outcome?
• Who would design it with these controls?
• Instead select controls from one or more matched
schools on prior achievement history
• Again, a good experiment is being compared to a
bad quasi-experiment
• Who would treat this as 11 separate experiments
vs. a more stable pooled experiment? Even the
authors, pooled results are much more congruent.
Hypothesis is that...
• The job training and educational examples that
produce different conclusions from the experiment
are examples of poor quasi-experimental design
• To compare good exp to poor quasi-exp is to
confound a design type and the quality of its
implementation—a logical fallacy
• But I reach this conclusion ex post facto and
knowing the randomized experimental results in
advance
Big Conclusions:
• R-D has given results not much different
from experiment in three of three cases.
• Simpler Quasi-Experiments tend to give
same results as experiment if: (a) population
matching in the sampling design—Bloom
and Aiken studies, or if (b) careful
conceptualization and measurement of
selection model, as in Shadish et.
What I am not Concluding:
• That well designed quasi-experiment is as good as
an experiment. Difference in:
• Number and transparency of assumptions
• Statistical power
• Knowledge of implementation
• Social and political acceptance
• If you have the option, do an experiment because
you can rarely put right by statistics what you
have messed up by design
What I am suggesting you
consider:
• Whether this be a unit on RCTs or quality
causal studies
• Whether you want to do RDD studies in
cases where an experiment is not possible
because resources are distributed otherwise
• Whether you want to do quasi-experiments
if group matching on the pretest is possible,
as in many school-level interventions?
More Contentiously if:
• The selection process can be
conceptualized, observed and measured
very well.
• An abbreviated ITS analysis is possible, as
in Bloom et al.
• The instinct to avoid quasi-experiments is
correct, but it reduces the scope of the
causal issues that can be examined
Shadish, Luellen & Clark (2006)
Shadish, Luellen & Clark (2006)
Results-Aiken et al
•
•
•
•
•
•
pretest values on SAT/CAT, 2 writing measures
Measurement framework the same
Pretest ACTs and writing - ns exp vs non
OLS tests
Results for writing test = .59 and .57 - sig
Results for essay = .06 and .16 - ns
Bloom et al Revisited
• Analysis at the individual level
• Within city, within welfare to work center,
same measurement design
• Absolute bias- yes
• Average bias none across 5 within-state
sites, even w/o stat tests
• Average bias limited to small site and nonwithin-city site-Detroit vs Grand Rapids
Correspondence Criteria
•
•
•
•
•
•
Random error and no exact agreement
Shared stat sig pattern from zero - 68%
Two ESs not statistically different
“Comparable” magnitude estimates
One as percent of other
Indulgence, common sense and mix
Our Research Issues
• Deconstructing “non-experiment”--do
experimental and non-experimental ESs
correspond differently for R-D, for ITS, and for
simple non-equivalent designs?
• How far can we generalize results about invalidity
of non-experiments beyond job training?
• Do these within-study comparison studies bear the
weight ascribed to them in evaluation policy at
DoL and IES?