Transcript Document

Building Evidence in Education:
Workshop for EEF evaluators
2nd June: York
6th June: London
www.educationendowmentfoundation.org.uk
The EEF by numbers
3,000
schools
participating
in projects
34
topics in
the Toolkit
14
6,000
members
of EEF
team
600,000
pupils involved
in EEF projects
heads
presented to
since launch
£220
16
m
estimated spend
over lifetime of
the EEF
10
reports
published
independent
evaluation
teams
83
evaluations
funded to
date
Session 1: Design
RCT design, power calculations and
randomisation
Ben Styles (NFER)
Maximising power using the NPD
John Jerrim (Institute of Education)
RCT design
Power
calculations and
randomisation
Ben Styles
Education Endowment
Foundation
June 2014
RCT design
•
•
•
•
The ideal trial
Methods of randomisation
Power calculations
Syntax exercise!
A statistician’s ideal trial
• Randomly select eligible pupils from NPD
• No consent!
• Simple randomisation of pupils to intervention
and control groups
• No attrition
• No data matching problems
• No measurement error
1. Trial registration: specification of primary and secondary outcomes in addition to subgroup analyses
2. Recruit participants and explain method to stakeholders
BEFORE YOU START !
3. Select participants according to fixed eligibility criteria
4. Obtain consent
5. Baseline outcome measurement (or use existing administrative data)
6. Randomise eligible participants into groups (evaluator carries out randomisation)
7. Intervention runs in experimental group; control receives ‘business-as-usual’/an
alternative activity
8. Administer follow-up measurement (evaluator)
9. Intention-to-treat analysis followed by reporting as per CONSORT guidelines
10. Control receives intervention (under what circumstances?)
Why we depart from the ideal
• Schools manage pupils!
• Nature of the intervention
• Contamination – how serious is the risk?
Restricted randomisation?
• Use simple randomisation where you can
• Timetable considerations in a pupil-randomised
trial → stratify by school
• Important predictor variable with small and
important category → stratify by predictor
• Fewer than 20 schools → minimise
http://minimpy.sourceforge.net/
• Multiple recruitment tranches → blocked
• Pairing → BAD IDEA!
Restricted randomisation
Simple randomisation
Restricted randomisation
Restricted randomisation more complicated and can go
wrong. Take strata into account in analysis:
http://www.bmj.com/content/345/bmj.e5840
To remember!
If you have restricted your randomisation using
a factor that is associated with the outcome
(e.g. school) THEN
INCLUDE THE FACTOR AS A COVARIATE IN
YOUR ANALYSIS
Chance imbalance at baseline
• As distinct from bias induced by
measurement attrition
• Can be quite large in small trials e.g. on
baseline measure
• Include covariate in final analysis
Sample size calculations
•
•
•
•
•
•
School or pupil-randomised?
Intra-cluster correlation
Correlation between covariate and outcome
Expected effect size
p(type I error)=0.05; power=0.8
Attrition
Rule of thumb
Lehr, 1992
Pupil randomised
• ICC = 0
• Correlation between baseline and outcome:
http://educationendowmentfoundation.org.uk/
uploads/pdf/Pre-testing_paper.pdf and your
previous work
• Effect size: previous evidence; costeffectiveness; EEF security ratings
• Attrition: EEF allow recruitment to be 15%
above sample size after attrition
Cluster-randomised
• Same as for pupils aside from ICC
• Proportion of total variance that is due to
between cluster variance
• EEF pre-testing paper has some useful
guidance
• Pre-test also reduces ICC e.g. from 0.2 to
0.15 for KS2 baseline, GCSE outcome
MDES
• Minimum detectable effect size
• EEF require this on the basis of real
parameters for the security rating
• (avoid retrospective power calculation)
• How good were my estimates?
Sample size spreadsheet
(fill in the highlighted boxes)
Scenario 1
Expected number of pupils per school being sampled
180
ROH (Intra-class correlation - percentage of variance in
outcome being studied attributable to school attended)
0.15
Deff (adjustment for nested design)
27.85
Confidence level (of test we will use to assess effect)
95.0%
Critical T-value
1.96
Correlation between before and after scores
0.70
SD of residuals in scores (if scores have SD of 1)
0.71
Expected effect size (in terms of absolute outcome scores)
0.2
Expected effect size (in terms of residual outcome scores)
0.28
n(schools) in intervention
31
n(schools) in control
31
n(pupils) in intervention
5580
n(pupils) in control
5580
Expected SE of difference between groups (in SDs)
Power
0.10
80.0%
100%
90%
80%
70%
Power
60%
50%
40%
30%
20%
10%
0%
0.00
0.05
0.10
0.15
Effect size
0.20
n(intervention)=31; n(control)=31
0.25
0.30
Running the randomisation
SYNTAX EXERCISE
• In pairs, explain what each of the steps does
• How many schools were randomised in this
block?
Conclusions
• Always think of any RCT (any quantitative
impact evaluation) as a departure from the
ideal trial
• The design, power calculations, method of
randomisation and analysis all interrelate and
need to be consistent
Maximising power using the NPD
John Jerrim (Institute of Education)
Structure
How much power do EEF trials currently have?
• PISA, power, star ratings and current EEF trials
Exercise
• Work in groups to design an EEF trial
• Goal = Maximise power at minimal cost
My answers
• How might I try to maximise power?
Your answers! / Discussion
Power in context
Effect sizes, PISA rankings and EEF
padlock ratings
How powerful are EEF trials thus far?
EEF secondary
school trials
As of 01 / 05 / 2014
Detectable effect
size
Mean = 0.276
Median = 0.25
0
0.1
0.2
0.3
0.4
Detectable effect
Between 4* and 5*
by EEF guidelines….
0.5
Power and the PISA reading rankings
Shanghai-China
Hong Kong-China
Singapore
Japan
Korea
Finland
Chinese Taipei
Ireland
Canada
Poland
Liechtenstein
Estonia
New Zealand
Australia
Netherlands
Macao-China
Switzerland
Belgium
Viet Nam
Germany
France
Norway
United Kingdom
-0.20
0.00
Effect size = 0.50 (EEF
2*)
Effect size = 0.40 (EEF
3*)
Effect size = 0.30 (EEF
4*)
MEDIAN EEF TRIAL =
0.25
Effect size = 0.20 (EEF
5*)
Effect size = 0.10
IMPLICATION
Effect sizes of 0.20 are
damn big
0.20
UK’s current
position 0.60
0.40
0.80
… particularly given
pretty small doses we are
giving
Do we currently have a power
problem?
- Quite possibly!
- So trying to get more power in future
trials very important…..
Exercise
Exercise
Task: In groups, discuss how you would design the
following trial
Intervention = Teaching children how to play chess
Maximum number of treatment schools = 20 secondary schools
Year group = Year 7
Level of randomisation = School level
Test = One-to-one non-verbal IQ assessment with trained
educationalist (end of year 7)
Control condition = ‘Business as usual’
Study type = ‘Efficacy’ study (proof of concept)
Objective: Maximise power at minimum cost
How would you design this trial to meet these twin
objectives?
What could you do to increase power in this trial
E.g. Would you use a baseline test? If so, what?
My answers
The usual suspects…..
…and less obvious options
The usual suspects…..
1. Use a regression model and include baseline
covariates…..
- Adding controls explains variance. Boosts power
2. Use Key stage 2 test scores as “pre-test”….
- Point of baseline covariates is to explain variance
- KS 2 scores in maths likely to be reasonably correlated with
outcome (non-verbal IQ)
- CHEAP! From NPD.
3. Stratify the sample prior to randomisation
- Potentially reduces error variance. Thus boosts power.
- Additional advantages. Balance of baseline characteristics.
4. Really engage with control schools
- Make sure we minimise loss of sample through attrition
Less ‘obvious’ options….
Don’t test every child……..
There are around
200 children per
secondary school…..
0.60
…. One-to-one
testing is expensive
0.55
Detectable effect
…Testing more than
50 pupils buys you
little additional
power
0.50
0.45
RANDOMLY SAMPLE
PUPILS WITHIN
SCHOOLS!
0.40
0.35
0
50
100
Cluster size
150
200
Assumptions
20 schools
Pre/post corr of 0.
80% power
Rho = 0.15
…..use an unequal sampling fraction
• We all know that ↑ clusters (k) means ↑ power
• This example: limited to only a small number of treatment
schools (20)
• ….but control condition was non-intrusive and cheap
• So don’t just recruit 20 control schools as well – recruit
more!
• Nothing about RCT’s mean we need equal k for
treatment and control
• Power calculation becomes more complex (anybody know
it!?)
Use more homogenous selection of
schools….
PISA 2009 data
0.30
0.25
All UK schools: 𝜌 ≈
0.30
0.20
RHO
0.15
0.10
0.05
0.00
100
80
60
40
20
0
Percentage of all UK schools in population
ALL UK
SCHOO
LOW
PERFORMING
“Worst” 25% of schools
only: 𝜌 ≈ 0.05
Why does rho decline??
100
Within school
variation
80
The within school
variation barely
changes …..
60
sigma
40
20
0
100
80
60
40
20
0
Percentage of all UK schools in population
…. While the between
school variation
declines substantially
Implications
• As example is an efficacy study why not restrict attention to low
performing schools only?
- Boosts power!
- Fits with EEF mandate (close performance gap)
- Not worried about generalisability
• We implicitly do this anyway (e.g. by doing trials in just one or two
LA’s)……
• …..but can we do it in a smarter way???
• Little appreciated trade-off between POWER and
GENERALISBILITY
- Long-term implications for EEF
- Trial representative of England population very hard to achieve
Conclusions
Do we have a “power problem”?
• Quite possibly
• Median detectable effect size = 0.25 in EEF secondary school
trials
• If were to boost UK reading PISA scores by this amount, we
would move above Canada, Taiwan and Finland in the
rankings…..
Ways to potentially increase power
•
•
•
•
•
Include baseline covariates (from NPD where possible)
Stratify the sample prior to randomisation
Engage with control schools!
Do you need to test every child? Practical alternatives?
Could you increase number of control schools without adding
much to cost (unequal randomisation fraction)
• Could you restrict your focus to a narrower population? (e.g.
low performing schools only)?