Introduction to the Concepts and Methods of Impact Evaluation Martin Ravallion

Download Report

Transcript Introduction to the Concepts and Methods of Impact Evaluation Martin Ravallion

Introduction to the Concepts
and Methods of Impact
Evaluation
Martin Ravallion
Development Research Group, World Bank
1. The type of program considered
2. The evaluation problem
3. Generic issues
4. Single difference: randomization
5. Single difference: controls for observables
6. Single difference: exploiting program design
7. Double difference
8. Higher-order differencing
9. Instrumental variables
10. Making evaluations more useful
1. The type of program
• Assigned programs without spillover effects
• some units (individuals, households, villages) get
the program and some do not;
• and the benefits are largely confined to those for
whom the program is assigned
• Possible examples:
• Social fund selects from applicants
• Workfare: gains to workers and benefiting
communities; others get nothing
• Cash transfers to eligible households only
• Ex-post evaluation
• But ex post does not mean start late!
2. The evaluation problem
What do we mean by “impact”?
• Impact = the difference between the relevant
outcome indicator with the program and that without it.
• However, we can never observe someone in two
different states of nature at the same time.
• While a post-intervention indicator is observed, its value
in the absence of the program is not, i.e., it is a counterfactual.
So all evaluation is essentially a problem of
missing data. Calls for counterfactual analysis.
Naïve comparisons can be
deceptive
Common practices:
 compare outcomes after the intervention to those
before, or
 compare units (people, households, villages) that
receive the program with those that do not.
Potential biases from failure to control for:
 Other changes over time under the counterfactual, or
 Unit characteristics that influence program
placement.
Naïve comparison 1: Before vs after.
We observe an outcome indicator,
Y0
time
t=0
Intervention
and its value rises after the program:
(observedl)
Y1
Y0
t=1 time
t=0
Intervention
However, we need to identify the
counterfactual…
Y1
(observedl)
Y1*
(counterfactual)
Y0
t=1 time
t=0
Intervention
… since only then can we determine
the impact of the intervention
Y1
Impact = Y1- Y1
Y1*
Y0
t=0
t=1 time
*
Naïve comparison 2: “With” vs.”without”
Impacts on poverty?
Case 1
Without
(n=56)
With
(n=44)
% increase
(t-test)
43
80
87% (2.29)
43
66
54% (2.00)
Case 2
Percent not poor
Impacts on poverty?
Case 1
Without
(n=56)
With
(n=44)
% increase
(t-test)
43
80
87% (2.29)
43
66
54% (2.00)
Case 2
Percent not poor
Without
(n=56)
With
(n=44)
% increase
(t-test)
1: Program
yields 20%
gain
43
80
87% (2.29)
2: Program
yields no
gain
43
66
54% (2.00)
How can we do better?
The missing-data problem in evaluation

For each unit (person, household, village,…)
there are two possible values of the outcome
variable:



However, we cannot observe both for all units



The value under the treatment
The value under the counterfactual
We cannot observe the counterfactual outcomes for
the treated units
Or the outcomes under treatment for the untreated
units
So evaluation is essentially a problem of missing
data => “counterfactual analysis.”
Archetypal formulation
Outcomes (Y) with and without treatment (D) given
exogenous covariates (X):
YiT  X i  T  iT (i=1,..,n)
YiC  X i  C  iC (i=1,..,n)
E ( 0i X i )  E ( 1i X i )  0
Archetypal formulation
Outcomes (Y) with and without treatment (D) given
exogenous covariates (X):
YiT  X i  T  iT (i=1,..,n)
YiC  X i  C  iC (i=1,..,n)
E ( 0i X i )  E ( 1i X i )  0
Gain from the program: Gi  YiT  YiC
ATE: average treatment effect: E (Gi )
conditional ATE: E (Gi X i )  X i (  T   C )
ATET: ATE on the treated: E (Gi Di  1)
conditional ATET:
E (Gi X i , Di  1)  X i (  T   C )  E ( iT  iC X i , Di  1)
The evaluation problem
Given that we cannot observe YiC for Di  1 or YiT for
Di  0 , suppose we estimate the following model?
YiT  X i  T  iT if Di  1
YiC  X i  C  iC if Di  0
Or the (equivalent) switching regression:
Yi  DiYiT  (1  Di )YiC  X i  C  X i (  T   C ) Di   i
 i  Di ( iT  iC )  iC
Common effects specification (only intercepts differ):
Yi  (  0T   0C ) Di  X i  C   i
The problem: X can be assumed exogenous but, without
random assignment, D is endogenous => ordinary
regression will give a biased estimate of impact.
Alternative solutions 1
Experimental evaluation (“Social experiment”)




Program is randomly assigned, so that everyone
has the same probability of receiving the
treatment.
In theory, this method is assumption free, but in
practice many assumptions are required.
Pure randomization is rare for anti-poverty
programs in practice, since randomization
precludes purposive targeting.
Although it is sometimes feasible to partially
randomize.
Alternative solutions 2
Non-experimental evaluation (“Quasi-experimental”;
“observational studies”)
One of two (non-nested) conditional
independence assumptions:
1. Placement is independent of outcomes given X
=>Single difference methods assuming conditionally
exogenous program placement.
Or placement is independent of outcome changes
=>Double difference methods
2. A correlate of placement is independent of
outcomes given D and X
=> Instrumental variables estimator.
3. Generic issues
• Selection bias
• Spillover effects
Selection bias in the outcome
difference between participants
and non-participants
Observed difference in mean outcomes between
participants (D=1) and non-participants (D=0):
E (Y T D  1)  E (Y C D  0) 
E (Y T D  1)  E (Y C D  1)
ATET=average treatment effect on the treated
 E (Y D  1)  E (Y D  0)
C
C
= 0 with exogenous
program placement
Selection bias=difference in mean outcomes
(in the absence of the intervention) between
participants and non-participants
Two sources of selection bias
• Selection on observables
Data
Linearity in controls?
• Selection on unobservables
Participants have latent attributes that yield
higher/lower outcomes
• One cannot judge if exogeneity is plausible
without knowing whether one has dealt
adequately with observable heterogeneity.
• That depends on program, setting and data.
Spillover effects
• Hidden impacts for non-participants?
• Spillover effects can stem from:
• Markets
• Non-market behavior of participants/nonparticipants
• Behavior of intervening agents
(governmental/NGO)
• Example 1: Poor-area programs
• Aid targeted to poor villages+local govt. response
• Example 2: Employment Guarantee Scheme
• assigned program, but no valid comparison group.
4. Randomization
“Randomized out” group reveals counterfactual
• As long as the assignment is genuinely random,
mean impact is revealed: E (Y C D  1)  E (Y C D  0)
• ATE is consistently estimated (nonparametrically) by
the difference between sample mean outcomes of
participants and non-participants.
• Pure randomization is the theoretical ideal for ATE,
and the benchmark for non-experimental methods.
• More common: randomization conditional on ‘X’
Examples for developing countries

PROGRESA in Mexico




Conditional cash transfer scheme
1/3 of the original 500 communities selected were
retained as control; public access to data
Impacts on health, schooling, consumption
Proempleo in Argentina



Wage subsidy + training
Wage subsidy: Impacts on employment, but not
incomes
Training: no impacts though selective compliance
Lessons from practice 1
Ethical objections and political sensitivities
• Deliberately denying a program to those who need it
and providing the program to some who do not.
• Yes, too few resources to go around. But is randomization
the fairest solution to limited resources?
• What does one condition on in conditional randomizations?
• Intention-to-treat helps alleviate these concerns
=> randomize assignment, but free to not participate
• But even then, the “randomized out” group may include
people in great need.
=> Implications for design
• Choice of conditioning variables.
• Sub-optimal timing of randomization
• Selective attrition + higher costs
Lessons from practice 2
Internal validity: Selective compliance
•
•
•
•
Some of those assigned the program choose not to
participate.
Impacts may only appear if one corrects for
selective take-up.
Randomized assignment as IV for participation
Proempleo example: impacts of training only appear
if one corrects for selective take-up
Lessons from practice 3
External validity: inference for scaling up
• Systematic differences between characteristics of
people normally attracted to a program and those
randomly assigned (“randomization bias”:
Heckman-Smith)
• One ends up evaluating a different program to the
one actually implemented
=> Difficult in extrapolating results from a pilot
experiment to the whole population
5. Controls
Regression controls and matching
5.1 OLS regression
Ordinary least squares (OLS) estimator of impact
with controls for selection on observables.
Switching regression:
Yi  DiYiT  (1  Di )YiC  X i  C  X i (  T   C ) Di   i
 i  Di ( iT  iC )  iC
Common effects specification:
Yi  (  0T   0C ) Di  X i  C   i
Even with controls…
OLS only gives consistent estimates under
conditionally exogenous program placement
 there is no selection bias in placement, conditional on X
 or (equivalently) that the conditional mean
outcomes do not depend on treatment:
E[Yi C X i , Di  1]  E[Yi C X i , Di  0]
Implying:
E[ i X i , Di ]  0
in common impact model.
5.2 Matching
Matched comparators identify counterfactual
• Match participants to non-participants from a
larger survey.
• The matches are chosen on the basis of
similarities in observed characteristics.
• This assumes no selection bias based on
unobservable heterogeneity.
• Mean impact on the treated (ATE or ATET) is
nonparametrically identified.
Propensity-score matching (PSM)
Match on the probability of participation.

Ideally we would match on the entire vector X of
observed characteristics. However, this is practically
impossible. X could be huge.

PSM: match on the basis of the propensity score
(Rosenbaum and Rubin) =
P( X i )  Pr( Di  1 X i )

This assumes that participation is independent of
outcomes given X. If no bias given X then no bias
given P(X).
Steps in score matching:
1: Representative, highly comparable, surveys of the
non-participants and participants.
2: Pool the two samples and estimate a logit (or probit)
model of program participation. Predicted values
are the “propensity scores”.
3: Restrict samples to assure common support
Failure of common support is an important source
of bias in observational studies (Heckman et al.)
Density
Density of scores for participants
0
1
Propensity score
Density
Density of scores for non-participants
0
1
Propensity score
Density
0
Density of scores for non-participants
Region of common support
Propensity score
1
Steps in PSM cont.,
5: For each participant find a sample of nonparticipants that have similar propensity scores.
6: Compare the outcome indicators. The difference
is the estimate of the gain due to the program for
that observation.
7: Calculate the mean of these individual gains to
obtain the average overall gain. Various
weighting schemes =>
The mean impact estimator
P
NP
j 1
i 1
G   (Y j1 - WijYij 0 ) / P
Various weighting schemes:
 Nearest k neighbors
 Kernel-weights (Heckman et al.,):
K ij  K [ P( X i )  P( X j )]
P
Wij  Kij /  Kij
j 1
Propensity-score weighting




PSM removes bias under the conditional
exogeneity assumption.
However, it is not the most efficient estimator.
Hirano, Imbens and Ridder show that weighting
the control observations according to their
propensity score yields a fully efficient estimator.
Regression implementation for the common
impact model:
Yi  Di   i
with weights of unity for the treated units and
Pˆ ( X ) /(1  Pˆ ( X )) for the controls.
How does PSM compare to an
experiment?




PSM is the observational analogue of an experiment
in which placement is independent of outcomes
The difference is that a pure experiment does not
require the untestable assumption of independence
conditional on observables.
Thus PSM requires good data.
Example of Argentina’s Trabajar program
 Plausible estimates using SD matching on good data
 Implausible estimates using weaker data
How does PSM differ from OLS?

PSM is a non-parametric method (fully nonparametric in outcome space; optionally nonparametric in assignment space)

Restricting the analysis to common support
=> PSM weights the data very differently to standard
OLS regression

In practice, the results can look very different!
How does PSM perform relative to
other methods?



In comparisons with results of a randomized
experiment on a US training program, PSM gave a
good approximation (Heckman et al.; Dehejia and
Wahba)
Better than the non-experimental regression-based
methods studied by Lalonde for the same program.
However, robustness has been questioned (Smith
and Todd)
Lessons on matching methods




When neither randomization nor a baseline survey
are feasible, careful matching is crucial to control for
observable heterogeneity.
Validity of matching methods depends heavily on
data quality. Highly comparable surveys; similar
economic environment
Common support can be a problem (esp., if treatment
units are lost).
Look for heterogeneity in impact; average impact
may hide important differences in the characteristics
of those who gain or lose from the intervention.
6. Exploiting program design 1
Discontinuity designs
• Participate if score M < m
• Impact=
E (YiT M i  m   )  E (YiC M i  m   )
• Key identifying assumption: no discontinuity in
counterfactual outcomes at m.
• Strict eligibility rules alone do not make this
plausible (e.g., geography and local govt.)
• “Fuzzy” discontinuities in prob. participation.
Exploiting program design 2
Pipeline comparisons
• Applicants who have not yet received program
form the comparison group
• Assumes exogeneous assignment amongst
applicants
• Reflects latent selection into the program
Lessons from practice



Know your program well: Program design
features can be very useful for identifying impact.
Know your setting well too: Is it plausible that
outcomes are continuous under the
counterfactual?
But what if you end up changing the program to
identify impact? You have evaluated something
else!
7. Difference-in-difference
Observed changes over time for non-participants
provide the counterfactual for participants.
Steps:
1. Collect baseline data on non-participants and
(probable) participants before the program.
2. Compare with data after the program.
3. Subtract the two differences, or use a regression
with a dummy variable for participant.
This allows for selection bias but it must be timeinvariant and additive.
Outcome indicator: YitT  YitC  Git t=0,1
where
Git = impact (“gain”);
C
it
Y
= counterfactual;
C
ˆ
Yit = estimate from comparison group
Difference-in-difference
C
T
C
ˆ
ˆ
DD  E[(Y  Yi1 )  (Yi 0  Yi 0 )]
T
i1
Post-intervention
difference in
outcomes
Baseline difference
in outcomes
Or
DD  E[(Yi1T  YiT0 )  (Yˆi1C  Yˆi C0 )]
Gain over time
for treatment group
Gain over time
for comparison group
Diff-in-diff: E[(Yi1  Yˆi1 )  (Yi 0  Yˆi 0 )]  Git
T
C
T
C
if (i) change over time for comparison group reveals
counterfactual:
C
C
ˆ
EYit  EYit
and (ii) baseline is uncontaminated by the program:
Gi 0  0
Selection bias
Y1
Impact
Y1*
Y0
Selection bias
t=0
t=1 time
Diff-in-diff requires that the bias is
additive and time-invariant
Y1
Impact
Y1*
Y0
t=0
t=1 time
The method fails if the comparison
group is on a different trajectory
Y1
Impact?
Y1*
Y0
t=0
t=1 time
=> DD overestimates impact
Or…
Y1
Y1*
Y0
t=0
t=1 time
DD underestimates impact
Common problem in assessing impacts of development projects?
Example of poor area programs:
areas not targeted yield a biased
counter-factual
Income
Not targeted
Targeted
Time
• The growth process in non-treatment areas is not
indicative of what would have happened in the
targeted areas without the program
• Example from China (Jalan and Ravallion)
Matched double difference
Matching helps control for time-varying
selection bias
• Score match participants and non-participants based
on observed characteristics in baseline
• Initial conditions (incl. outcomes)
• Prior outcome trajectories
• Then do a double difference
• This deals with observable heterogeneity in initial
conditions that can influence subsequent changes over
time
Propensity-score weighted
version of “matched diff-in-diff.”


Weighting the control observations according to
their propensity score yields a fully efficient
estimator (Hirano, Imbens and Ridder).
Regression:
Yit    Di1t  Di1   t   it (  DD)
with weights of unity for the treated units and
Pˆ ( X ) /(1  Pˆ ( X )) for the controls where P( X i )  Pr( Di  1 X i )
is the propensity score.
“Fixed effects” model

Fixed effects model on balanced panel:
Yit   *   .Di1t   t  i  it
where
i  iT Di1  iC (1  Di1 )  ( T   C ) Di1   C  i
Note:
T
C
 Adding (   ) Di1 picks up any differences in timemean latent factors.
 One does not require a balanced panel to estimate
DD.
Lessons from practice

Single-difference matching can be severely
contaminated by selection bias


Tracking individuals over time allows a double
difference


Latent heterogeneity in factors relevant to participation
This eliminates all time-invariant additive selection bias
Combining double difference with matching:

This allows us to eliminate observable heterogeneity in
factors relevant to subsequent changes over time
8. Higher-order differencing
Pre-intervention baseline data unavailable
e.g., safety net intervention in response to a
crisis
Can impact be inferred by observing participants’
outcomes in the absence of the program after the
program?
New issues



Selection bias from two sources:
1. decision to join the program
2. decision to stay or drop out
There are observed and unobserved characteristics
that affect both participation and income in the
absence of the program
Past participation can bring current gains for those
who leave the program
Double-Matched Triple Difference
1.
2.
3.
Match participants with a comparison group of nonparticipants
Match leavers and stayers
Compare gains to continuing participants with those
who drop out (Ravallion et al.)
Triple Difference (DDD) =
DD for stayers – DD for leavers
Outcomes for participants:
T
it
Y
Y
C
it
 Git
Single difference: E[Y  Y ]
T
it
T
Double difference: E[  (Yit
C
it
C
 Yit )]  Git
Triple difference:
E[(YiT2  Yi C2 ) Di 2  1]  E[(YiT2  Yi C2 ) Di 2  0]
“stayers”
in period 2
“leavers”
in period 2
E[(YiT2  YiC2 ) Di 2  1]  E[(YiT2  YiC2 ) Di 2  0] 
[ E (Gi 2 Di 2  1)  E (Gi 2 Di 2  0)]
 [ E (Gi1 Di 2  1)  E (Gi1 Di 2  0)]
net gain from
participation
selection bias
Joint conditions for DDD to estimate impact:

no current gain to ex-participants; E (Gi 2 Di 2  0)
no selection bias in who leaves the
program; E (Gi1 Di 2  1)  E (Gi1 Di 2  0)

Sign of the selection bias? If leavers have lower
gains then DDD underestimates impact
Test for whether DDD identifies
gain to current participants
Third round of data allows a test: mean gains in
round 2 should be the same whether or not one
drops out in round 3
DDD  E(Gi 2 Di 2  1, Di 3  1)  E(Gi 2 Di 2  1, Di 3  0)
Gain in round 2 for
stayers in round 3
Gain in round 2 for
leavers in round 3
Lessons from practice
1. Tracking individuals over time:


addresses some of the limitations of single-difference
on weak data
allows us to study the dynamics of recovery
2. “Baseline” can be after the program, but must
address the extra sources of selection bias
3. Single difference for leavers vs. stayers can work
well if there is an exogenous program contraction
9. Instrumental variables
Identifying exogenous variation using a 3rd
variable
Outcome regression:
Yi  Di   i
(D = 0,1 is our program – not random)
• “Instrument” (Z) influences participation, but
does not affect outcomes given participation
(the “exclusion restriction”).
• This identifies the exogenous variation in
outcomes due to the program.
Treatment regression:
Di  Z i  ui
Reduced-form outcome regression:
Yi   (Zi  ui )   i  Zi  i
where
  
and
 i  ui   i
Instrumental variables (two-stage least squares)
estimator of impact:
ˆIVE  ˆOLS / ˆOLS
Or:
Yi   (ˆZ i )  i
Predicted D purged of endogenous part.
Problems with IVE
1. Finding valid IVs;


Usually easy to find a variable that is correlated with
treatment.
However, the validity of the exclusion restrictions
is often questionable.
2. Impact heterogeneity due to latent factors
Sources of instrumental variables


Partially randomized designs as a source of IVs
Non-experimental sources of IVs



Geography of program placement (Attanasio and
Vera-Hernandez); “Dams” example (Duflo and
Pande)
Political characteristics (Besley and Case; Paxson
and Schady)
Discontinuities in survey design
Endogenous compliance:
Instrumental variables estimator
D =1 if treated, 0 if control
Z =1 if assigned to treatment, 0 if not.
Di  Zi1  1i
Compliance regression
Yi  Zi 2  2i
Outcome regression
(“intention to treat effect”)
ˆ 2
ˆ1
2SLS estimator (=ITT deflated
by compliance rate)
Essential heterogeneity and IVE





Common-impact specification is not harmless.
Heterogeneity in impact can arise from
differences between treated units and the
counterfactual in latent factors relevant to
outcomes.
For consistent estimation of ATE we must
assume that selection into the program is
unaffected by latent, idiosyncratic, factors
determining the impact (Heckman et al).
However, likely “winners” will no doubt be
attracted to a program, or be favored by the
implementing agency.
=> IVE is biased even with “ideal” IVs.
Stylized example

Two types of people (1/2 of each):


Type H: High impact; large gains (G) from program
Type L: Low impact: no gain

Evaluator cannot tell which is which
But the people themselves can tell (or have a useful
clue)

Randomized pilot:




Half goes to each type
Impact=G/2
Scaled up program:


Type H select into program; Type L do not
Impact=G
IVE is only a ‘local’ effect
IVE identifies the effect for those induced to switch by
the instrument (“local average effect”)
 Suppose Z takes 2 values. Then the effect of the
program is:

 IVE
E (Y | Z  1)  E (Y | Z  0)

E ( D | Z  1)  E ( D | Z  0)
Care in extrapolating to the whole population when
there is latent heterogeneity.

Local instrumental variables
LIV directly addresses the latent heterogeneity
problem.
 The method entails a nonparametric regression
of outcomes Y on the propensity score.

Yi  f [ Pˆ (Zi )]  X i   i
The slope of the regression function f [ Pˆ (Zi )]
gives the marginal impact at the data point.

This slope is the marginal treatment effect (Björklund
and Moffitt),
 from which any of the standard impact parameters can
be calculated (Heckman and Vytlacil).

Lessons from practice


Partially randomized designs offer great source of
IVs.
The bar has risen in standards for nonexperimental IVE



Past exclusion restrictions often questionable in
developing country settings
However, defensible options remain in practice, often
motivated by theory and/or other data sources
Future work is likely to emphasize latent
heterogeneity of impacts, esp., using LIV.
10. Making evaluations more
useful
Evaluations are often not as relevant
for practitioners as they could be
10 steps to more policy-relevant
evaluations
Step 1: Make the policy questions the
starting point

Start with the questions and remain eclectic on
methods of answering them




Make sure the evaluation process is linked to project




•
Policy relevant evaluations must start with interesting and
important questions.
But instead many evaluators start with a preferred method and
look for questions that can be addressed with that method.
By constraining evaluative research to situations in which one
favorite method is feasible, research may exclude many of the
most important and pressing development questions.
Evaluator did not know enough about setting and project
Data collection started too late
Data collection did not cover right outcomes and did not allow
for adequate controls
• Too many “monitoring indicators” + too few outcomes and
controls
Evaluation did not address, or even ask, the right questions!
Step 2: Take seriously the ethical
objections and political sensitivities;
policy makers do!
•
Pilots (using NGOs) can often get away with methods
not acceptable to governments accountable to voters.
•
Deliberately denying a program to those who need it
and providing the program to some who do not.
- Is randomization the fairest solution to limited resources?
- What does one condition on in conditional randomizations?
Key problem: The information available to the
evaluator (for conditioning impacts) is a partial
subset of the information available “on the ground”
Step 3: Take a comprehensive
approach to the sources of bias
• Two sources of selection bias: observables and
unobservables (to the evaluator)
• Some economists have become obsessed with the latter
bias, while ignoring enumerable other biases/problems.
• Less than ideal methods of controlling for observable heterogeneity
including ad hoc models of outcomes.
• Evidence that we have given too little attention to the problem of
selection bias based on observables.
• Arbitrary preferences for one conditional independence assumption
(exclusion restrictions) over another (conditional exogeneity of placement)
Cannot scientifically judge appropriate assumptions/
methods independently of program, setting and data.
•
Step 4: Look for spillover effects
• Are there hidden impacts for non-participants?
• Look for signs of spillover effects stemming from:
• Markets
• Behavior of participants/non-participants
• Behavior of intervening agents (governmental/NGO)
Step 5: Take a sectoral approach,
recognizing fungibility/flypaper
effects
• Fungibility
• You are not in fact evaluating what the extra
public resources (incl. aid) actually financed.
• So your evaluation may be deceptive about the
true impact of those resources.
• Flypaper effects
• Impacts may well be found largely within the
“sector”.
• Need for a broad sectoral approach
Step 6: Look for impact heterogeneity


Impacts varies with participant characteristics (including
those not observed by the evaluator) and context.
Participant heterogeneity




Interaction effects
Essential heterogeneity + participant responses
Implications for:
• evaluation methods
• project design
• external validity (generalizability) =>
Contextual heterogeneity


“In certain settings anything works, in others everything fails”
Local institutional factors in development impact
• Example of Bangladesh’s Food-for-Education program
• Same program works well in one village, but fails hopelessly nearby
Step 7: Take “scaling up” seriously
With scaling up:
 Inputs change:



Intervention changes:


Entry effects: nature and composition of those who “sign up”
changes with scale.
Migration responses.
Resources effects on the intervention
Outcome changes



Lags in outcome responses
Market responses (partial equilibrium assumptions are fine for
a pilot but not when scaled up)
Social effects/political economy effects; early vs. late capture.
But there has been little work on external validity and
scaling up.
Step 8: Understand what determines
impact
 Replication across differing contexts

Example of Bangladesh’s FFE:
• inequality etc within village => outcomes of program
• Implications for sample design => trade off between precision of overall
impact estimates and ability to explain impact heterogeneity
 Intermediate indicators

Example of China’s SWPRP
• Small impact on consumption poverty
• But large share of gains were saved
 Qualitative research/mixed methods


Test the assumptions (“theory-based evaluation”)
But poor substitute for assessing impacts on final outcome
In understanding impact, Step 9 is key =>
Step 9: Don’t reject theory and
structural modeling




Standard evaluations are “black boxes”: they give policy
effects in specific settings but not structural parameters
(as relevant to other settings).
Structural methods allow us to simulate changes in
program design or setting.
However, assumptions are needed. (The same is true
for black box social experiments.) That is the role of
theory.
PROGRESA example (Attanasio et al.; Todd & Wolpin)
• Modeling schooling choices using randomized assignment for
identification
• Budget-neutral switch from primary to secondary subsidy would
increase impact
Step 10: Develop capabilities for
evaluation within countries

Strive for a culture of evidence-based evaluation
practice.


Evaluation is a natural addition to the roles of the
government’s sample survey unit.



China example: “Seeking truth from fact” + role of research
Independence/integrity should already be in place.
Connectivity to other public agencies may be a bigger
problem.
Sometimes a private evaluation capability will still be
required.