Searching for answers Using RCTs?
Download
Report
Transcript Searching for answers Using RCTs?
Angus Deaton, Princeton University
India International Center, October 15th, 2012
DOING GOOD EVALUATIONS:
WHAT DOES IT MEAN, WHAT
DOES IT TAKE?
Evidence for policy
Everyone agrees that policies should be based on
evidence
Much less agreement about the nature of the
evidence
What methods should be used?
Is there a hierarchy of evidence?
Are some kinds of evidence better than others?
Are randomized controlled trials the gold standard?
How do we move from evidence to policy?
Rigorous evidence is of limited value if the step to policy is
not well-justified
Two steps: developing evidence, adapting to policy, and
outcome depends on weakest link
2
Running examples
Building dams
Do dams lead to poverty reduction?
Sanitation
Total sanitation campaign (TSC) and its effects on child mortality
and child health
How should such schemes be implemented?
Microfinance
Is MF an effective tool for poverty reduction?
Food subsidies
In kind versus cash? PDS versus CCTs
In general: “finding out what works”
“Rigorous evaluation of CCTs has shown that they work”
Is this true, and if so, what does it mean for India? Or anywhere
else?
3
Background
The “failure” of development economics and the
whole development project
Cycling fashions at the World Bank
Infrastructure, structural adjustment, education, health,
women., political economy, governance . . . infrastructure
Not just the Bank, but the development community (or at least
the community of “developers”)
Unconstrained by evidence
Bank unable to document its contribution, if any
Deep skepticism about its own internal evaluations
Many argued that there had been little or no
progress
Much less so now, though remains unclear whether the
development effort by rich countries was positive
4
Diagnosing the problem
Many possible stories for this state of affairs
One story is a failure to learn from experience
No systematic, “rigorous,” evaluation procedure for
projects
Casual empirical evaluation does not give credible answers
We need “rigorous” and “credible” evidence on what works
If the Bank had done this on all of its projects in the past,
we would know what works by now, and poverty would be
history
Is this just the latest turn of the wheel of fashion, or
is there some truth to this?
5
Better empirical analysis
Certainly true that the quality of empirical
analysis was often weak
Correlations that were obviously not causation
Chinese railways
Randomized controlled trials seem to offer
solutions to these issues
They establish causality
Solution to the statistical problems of bias, selection,
omitted variables (confounding) etc.
These arguments have been very successful
In World Bank, among foundations
J-PAL and others doing many experiments
6
Chorus of approval
“The World Bank is finally embracing science” Lancet
editorial, 2004
“Creating a culture in which rigorous randomized
evaluations are promoted, encouraged, and
financed has the potential to revolutionize social
policy during the 21st century, just as randomized
trials revolutionized medicine during the 20th.”
Esther Duflo, 2004
Did RCTs revolutionize medicine?
“Britain has given the world Shakespeare,
Newtonian physics, the theory of evolution,
parliamentary democracy—and the randomized
trial” BMJ editorial, 2001.
7
What is an RCT?
Trial population is randomly divided into two groups,
experimentals and controls
Experimentals get treatment
Controls get none
Average outcome in experimental group minus average outcome
in control group tells us if the treatment works, and by how much
on average
An RCT estimates an average treatment effect
In general, each person (unit) will have a different treatment
effect
We cannot observe these for each individual
But RCT gives us the average for the group, which is a lot!
Minimal assumptions, absence of bias, establishing
causality are big advantages
But is this really the only “rigorous” evaluation?
8
Examples again
CCTs in Mexico (Progresa), some villages got CCTs, some did not
Better average outcomes for treatment villages
Random selection means it must have been the CCT, not something else
What do we learn?
Will it work in India? External validity.
Will it work for a specific village in Mexico?
Why did it work? If we knew, we could answer two questions?
Controls knew they were going to get CCTs later? Does that matter?
Mexico had a system of clinics: hard to take kids to a non-existent clinic
Big issue today for Santiago Levy at IADB today
Dams: not possible to do randomized dam construction!!
So RCTs cannot be done in all cases
Some have argued that policies should not be implemented in these
cases
Do many things routinely for which there have been no RCTs!
9
Alternative methods
Rohini Pande and Esther Duflo’s work on dams used
placement of dams and NSS data on poverty
Dean Spears’ work on TSC uses NFHS and other survey data
on health in conjunction with administrative data
Alternative methods of estimating average treatment
effects
Weaker than RCTs in some respects
Causality, selection, bias are not automatic and must be argued
More assumptions
Stronger in other respects
Access distribution of treatment effects, not just the average
Usually much larger samples
Triangulation helps to pin down mechanisms at work
RCTs good at saying what happened, not good at saying why
Ex post fairy stories (just-so stories) without evidence
10
Small RCTs
Are often not large enough to be reliable
Expensive to do, so this is not a matter that is easily fixed
In a small trial, a few outliers can wreak havoc
Example might be microfinance, where one or two women might be able
to do really well, and the rest not at all
Get lots of weird and counterintuitive results
No idea if they are real, or method is just broken
Doubt one can learn anything from a trial of 10 experimental villages and
10 control villages in CCT experiment
Experiment is often conducted on a convenience sample
Not easy to get cooperation from all relevant units: e.g. in looking at
CCT, those opposed to the idea might be less willing to cooperate
Results are correct only for the convenience population
Not for population that would be affected by the policy
Gold standard rhetoric protects results from questioning
11
Large scale RCTs
Use all of the units in a country
PDS/CCT experiment for all of rural India
Comparable to large social experiments in the US in the 70s
NJ income tax experiment, SIME/DIME
Rand Health experiment
Rand experiment is an important part of the debate today, others
not
Ex post data mining
Null result is never acceptable to the sponsors
Enormous pressure on investigators to find something
Usually by subgroup analysis, or looking for other outcomes
MTO has now examined thousands of outcomes
Some of the statistically significant ones are spurious
And we are back to the small sample problem again
Large experiments not decisive either
12
Dynamic effects
Many policies take time to work out
Lots of things work as intended in the short-run, fail later
People learn to “work the system”
Food rationing in Britain during the war:
Excellent at first, big nutritional benefits, solidarity
Crooks (“spivs”) learned to exploit it and create a black market
Support eventually vanished, when it was continued too long
Old age pensions in South Africa: cash transfer
Burial insurers were allowed on site to get first access to recipients
Higher level corruption: banks?
Procurement and supply effects in food policy
What would an RCT show?
It works! Expensive and unethical to continue the experiment
We get the wrong answer, or only part of the answer
Issue in medicine too
13
TAKING EVIDENCE TO POLICY
14
Using a perfect evaluation
Suppose we have a result, e.g.
Suppose also that these were all done perfectly, so there is no dispute
about the conclusions
Which, of course, never happens!
What use can we make of those results in policy?
On average, CCTs make people happier than PDS
On average, dams increase poverty
On average, reducing open defecation improves child health and reduces
mortality
Should the Planning Commission ban new dams?
Should MRD encourage better sanitation?
Should we replace PDS by CCTs?
That dams don’t work on average tells us little about any individual dam
It is an individual dam that comes up for approval, not all dams!
We needs to know more, why dams cause poverty, under what circumstances,
none of which comes from an RCT
15
What should a village do?
Or any local authority that decides
Given an RCT about CCT v PDS
Again, the average is useful but not decisive
Will it have the same effect for us?
We are not the average village
Again, we need to know why it works, not whether it works
Neighboring village tried and is happy with the outcome
Perhaps this is just an anecdote (“your uncle likes his new TV”)
But for the village, the average outcome is an anecdote too
Perhaps the authorities should visit their neighbors and see what
is going on, see if it would work for them
Average is more useful for a public health policy that will be
applied to the whole country
Sanitation?
16
Finding out what works?
A trial and error process
But T & E is NOT the same as an RCT
T & E, endless tinkering, is a good description of
the Industrial Revolution
How to invent a steam engine, or a toaster
How medical science works, on procedures and
devices
For which trials are close to irrelevant, and in many
cases have never been done
T & E using knowledge and intelligence can solve
the dimensionality problem
17
Seeing into the machine
Allows a village, the ministry, or the Planning
Commission to make a better choice
It may be able to see whether it would work for them
It may be able to see places where they could adapt it and
make it better
Hope to understand the process & how it would work in
context
Trial and error, plus local knowledge, hard thought
Experimentation but not necessarily RCTs
What are the “helping factors” that made a trial work?
E.g. clinics in Mexico!
Can teach us why things work which is generalizable
knowledge
18
Causality & helping factors
Do not RCTs reveal causality?
Is this not particularly helpful in policy? Yes and no.
Causality, by itself, is not always useful
The house burned down because the TV was left on
Causal, but not general: TVs do not usually burn down houses
RCT would show this causal effect
But TVs need “helping factors” like bad wiring, or inflammable material left nearby
We have to think about what are the helping factors, how they work, and
whether they will work for us
It was the treatment that did it! Not something else
Will a CCT work in a particular village, or during food price inflation, or in a competent v a
corrupt state
Does it need banks, or clinics to make it work?
Does it matter who gets it? Men and women: gender issues in India v Latin America
Replication of an RCT is not useful, because get different results in different
contexts with or without helping factors
Causality is “local”
19
Cartwright: Local causality
Open window A, and fly kite B, String C opens door D, which allows moths E to escape
and eat shirt F. Lighter shirt lowers shoe G on to switch H which heats iron I which burns
pants J. Smoke K enters tree L and smokes out possum M into basket N, pulling rope O,
and lifting cage P, allowing woodpecker Q to chew pencil R. (Emergency knife S in case
20
woodpecker or possum gets sick and can’t work.)
Expanding literature
We now have enough RCT papers to judge their quality and
the evidence that they claim
Some excellent, some terrible
Just like other empirical papers in development
But they must be judged case by case, like all other empirical
work
There is no free pass, just because they are RCTs
Using the word “rigorous evaluation” as a code word for RCT is
without justification
Right now, in economics, and aid literature, they are being given
a free pass.
Sometimes absurd generalizations based on small special RCTs
RCTs have no monopoly on rigour, there is no gold standard
21