Searching for answers Using RCTs?

Download Report

Transcript Searching for answers Using RCTs?

Angus Deaton, Princeton University
India International Center, October 15th, 2012
DOING GOOD EVALUATIONS:
WHAT DOES IT MEAN, WHAT
DOES IT TAKE?
Evidence for policy
 Everyone agrees that policies should be based on
evidence
 Much less agreement about the nature of the
evidence
 What methods should be used?
 Is there a hierarchy of evidence?
 Are some kinds of evidence better than others?
 Are randomized controlled trials the gold standard?
 How do we move from evidence to policy?
 Rigorous evidence is of limited value if the step to policy is
not well-justified
 Two steps: developing evidence, adapting to policy, and
outcome depends on weakest link
2
Running examples
 Building dams
 Do dams lead to poverty reduction?
 Sanitation
 Total sanitation campaign (TSC) and its effects on child mortality
and child health
 How should such schemes be implemented?
 Microfinance
 Is MF an effective tool for poverty reduction?
 Food subsidies
 In kind versus cash? PDS versus CCTs
 In general: “finding out what works”
 “Rigorous evaluation of CCTs has shown that they work”
 Is this true, and if so, what does it mean for India? Or anywhere
else?
3
Background
 The “failure” of development economics and the
whole development project
 Cycling fashions at the World Bank
 Infrastructure, structural adjustment, education, health,
women., political economy, governance . . . infrastructure
 Not just the Bank, but the development community (or at least
the community of “developers”)
 Unconstrained by evidence
 Bank unable to document its contribution, if any
 Deep skepticism about its own internal evaluations
 Many argued that there had been little or no
progress
 Much less so now, though remains unclear whether the
development effort by rich countries was positive
4
Diagnosing the problem
 Many possible stories for this state of affairs
 One story is a failure to learn from experience
 No systematic, “rigorous,” evaluation procedure for
projects
 Casual empirical evaluation does not give credible answers
 We need “rigorous” and “credible” evidence on what works
 If the Bank had done this on all of its projects in the past,
we would know what works by now, and poverty would be
history
 Is this just the latest turn of the wheel of fashion, or
is there some truth to this?
5
Better empirical analysis
 Certainly true that the quality of empirical
analysis was often weak
 Correlations that were obviously not causation
 Chinese railways
 Randomized controlled trials seem to offer
solutions to these issues
 They establish causality
 Solution to the statistical problems of bias, selection,
omitted variables (confounding) etc.
 These arguments have been very successful
 In World Bank, among foundations
 J-PAL and others doing many experiments
6
Chorus of approval
 “The World Bank is finally embracing science” Lancet
editorial, 2004
 “Creating a culture in which rigorous randomized
evaluations are promoted, encouraged, and
financed has the potential to revolutionize social
policy during the 21st century, just as randomized
trials revolutionized medicine during the 20th.”
Esther Duflo, 2004
 Did RCTs revolutionize medicine?
 “Britain has given the world Shakespeare,
Newtonian physics, the theory of evolution,
parliamentary democracy—and the randomized
trial” BMJ editorial, 2001.
7
What is an RCT?
 Trial population is randomly divided into two groups,
experimentals and controls
 Experimentals get treatment
 Controls get none
 Average outcome in experimental group minus average outcome
in control group tells us if the treatment works, and by how much
on average
 An RCT estimates an average treatment effect
 In general, each person (unit) will have a different treatment
effect
 We cannot observe these for each individual
 But RCT gives us the average for the group, which is a lot!
 Minimal assumptions, absence of bias, establishing
causality are big advantages
 But is this really the only “rigorous” evaluation?
8
Examples again
 CCTs in Mexico (Progresa), some villages got CCTs, some did not


Better average outcomes for treatment villages
Random selection means it must have been the CCT, not something else
 What do we learn?






Will it work in India? External validity.
Will it work for a specific village in Mexico?
Why did it work? If we knew, we could answer two questions?
Controls knew they were going to get CCTs later? Does that matter?
Mexico had a system of clinics: hard to take kids to a non-existent clinic
Big issue today for Santiago Levy at IADB today
 Dams: not possible to do randomized dam construction!!


So RCTs cannot be done in all cases
Some have argued that policies should not be implemented in these
cases
 Do many things routinely for which there have been no RCTs!
9
Alternative methods
 Rohini Pande and Esther Duflo’s work on dams used
placement of dams and NSS data on poverty
 Dean Spears’ work on TSC uses NFHS and other survey data
on health in conjunction with administrative data
 Alternative methods of estimating average treatment
effects
 Weaker than RCTs in some respects
 Causality, selection, bias are not automatic and must be argued
 More assumptions
 Stronger in other respects




Access distribution of treatment effects, not just the average
Usually much larger samples
Triangulation helps to pin down mechanisms at work
RCTs good at saying what happened, not good at saying why
 Ex post fairy stories (just-so stories) without evidence
10
Small RCTs
 Are often not large enough to be reliable



Expensive to do, so this is not a matter that is easily fixed
In a small trial, a few outliers can wreak havoc
Example might be microfinance, where one or two women might be able
to do really well, and the rest not at all
 Get lots of weird and counterintuitive results
 No idea if they are real, or method is just broken
 Doubt one can learn anything from a trial of 10 experimental villages and
10 control villages in CCT experiment
 Experiment is often conducted on a convenience sample

Not easy to get cooperation from all relevant units: e.g. in looking at
CCT, those opposed to the idea might be less willing to cooperate
 Results are correct only for the convenience population
 Not for population that would be affected by the policy
 Gold standard rhetoric protects results from questioning
11
Large scale RCTs
 Use all of the units in a country

PDS/CCT experiment for all of rural India
 Comparable to large social experiments in the US in the 70s


NJ income tax experiment, SIME/DIME
Rand Health experiment
 Rand experiment is an important part of the debate today, others
not
 Ex post data mining



Null result is never acceptable to the sponsors
Enormous pressure on investigators to find something
Usually by subgroup analysis, or looking for other outcomes
 MTO has now examined thousands of outcomes


Some of the statistically significant ones are spurious
And we are back to the small sample problem again
 Large experiments not decisive either
12
Dynamic effects
 Many policies take time to work out



Lots of things work as intended in the short-run, fail later
People learn to “work the system”
Food rationing in Britain during the war:
 Excellent at first, big nutritional benefits, solidarity
 Crooks (“spivs”) learned to exploit it and create a black market
 Support eventually vanished, when it was continued too long

Old age pensions in South Africa: cash transfer
 Burial insurers were allowed on site to get first access to recipients
 Higher level corruption: banks?

Procurement and supply effects in food policy
 What would an RCT show?



It works! Expensive and unethical to continue the experiment
We get the wrong answer, or only part of the answer
Issue in medicine too
13
TAKING EVIDENCE TO POLICY
14
Using a perfect evaluation

Suppose we have a result, e.g.




Suppose also that these were all done perfectly, so there is no dispute
about the conclusions


Which, of course, never happens!
What use can we make of those results in policy?




On average, CCTs make people happier than PDS
On average, dams increase poverty
On average, reducing open defecation improves child health and reduces
mortality
Should the Planning Commission ban new dams?
Should MRD encourage better sanitation?
Should we replace PDS by CCTs?
That dams don’t work on average tells us little about any individual dam


It is an individual dam that comes up for approval, not all dams!
We needs to know more, why dams cause poverty, under what circumstances,
none of which comes from an RCT
15
What should a village do?
 Or any local authority that decides
 Given an RCT about CCT v PDS
 Again, the average is useful but not decisive
 Will it have the same effect for us?
 We are not the average village
 Again, we need to know why it works, not whether it works
 Neighboring village tried and is happy with the outcome
 Perhaps this is just an anecdote (“your uncle likes his new TV”)
 But for the village, the average outcome is an anecdote too
 Perhaps the authorities should visit their neighbors and see what
is going on, see if it would work for them
 Average is more useful for a public health policy that will be
applied to the whole country
 Sanitation?
16
Finding out what works?
 A trial and error process
 But T & E is NOT the same as an RCT
 T & E, endless tinkering, is a good description of
the Industrial Revolution
 How to invent a steam engine, or a toaster
 How medical science works, on procedures and
devices
 For which trials are close to irrelevant, and in many
cases have never been done
 T & E using knowledge and intelligence can solve
the dimensionality problem
17
Seeing into the machine
 Allows a village, the ministry, or the Planning
Commission to make a better choice
 It may be able to see whether it would work for them
 It may be able to see places where they could adapt it and
make it better
 Hope to understand the process & how it would work in
context
 Trial and error, plus local knowledge, hard thought
 Experimentation but not necessarily RCTs
 What are the “helping factors” that made a trial work?
 E.g. clinics in Mexico!
 Can teach us why things work which is generalizable
knowledge
18
Causality & helping factors

Do not RCTs reveal causality?



Is this not particularly helpful in policy? Yes and no.
Causality, by itself, is not always useful





The house burned down because the TV was left on
Causal, but not general: TVs do not usually burn down houses
RCT would show this causal effect
But TVs need “helping factors” like bad wiring, or inflammable material left nearby
We have to think about what are the helping factors, how they work, and
whether they will work for us




It was the treatment that did it! Not something else
Will a CCT work in a particular village, or during food price inflation, or in a competent v a
corrupt state
Does it need banks, or clinics to make it work?
Does it matter who gets it? Men and women: gender issues in India v Latin America
Replication of an RCT is not useful, because get different results in different
contexts with or without helping factors

Causality is “local”
19
Cartwright: Local causality
Open window A, and fly kite B, String C opens door D, which allows moths E to escape
and eat shirt F. Lighter shirt lowers shoe G on to switch H which heats iron I which burns
pants J. Smoke K enters tree L and smokes out possum M into basket N, pulling rope O,
and lifting cage P, allowing woodpecker Q to chew pencil R. (Emergency knife S in case
20
woodpecker or possum gets sick and can’t work.)
Expanding literature
 We now have enough RCT papers to judge their quality and
the evidence that they claim
 Some excellent, some terrible
 Just like other empirical papers in development
 But they must be judged case by case, like all other empirical




work
There is no free pass, just because they are RCTs
Using the word “rigorous evaluation” as a code word for RCT is
without justification
Right now, in economics, and aid literature, they are being given
a free pass.
Sometimes absurd generalizations based on small special RCTs
 RCTs have no monopoly on rigour, there is no gold standard
21