Evidence in medicine: getting back to the Hill top
Download
Report
Transcript Evidence in medicine: getting back to the Hill top
Scientific commonsense and statistics
in medicine: getting back to the Hill top
John Worrall
Philosophy, Logic and Scientific Method
London School of Economics
Statistical Science & Philosophy of Science, LSE June
2010
EBM
• 1. Sketch evolution of EBM
• 2. Seemed initially to endorse a very rigid view on what
counts as evidence
• 3. Show how it has retreated/been clarified to bring it
more into line with scientific commonsense (aka some
basic principles from philosophy of science)
• 4. Add as a historical footnote that EBM seems to have
been gradually rediscovering the much more nuanced
and commonsensical views of Austin Bradford Hill,
whom it venerates as one of its founders.
EBM
EBM
• Not critical of RCTs
• But only of inflated claims about their epistemic power
• Indeed …
Evidence based everything!
• “A wise man proportions his belief to the
evidence.” David Hume Enquiry Concerning
Human Understanding
The evolution of EBM
• Word association: EBM → RCT → ‘Gold standard’
• Initial impression: EBM says only evidence from RCTs
counts scientifically
• Blood letting, e.g., may not be just an historical
phenomenon
• Should trust neither ‘patho-physiologic’ rationale nor socalled clinical expertise
• Only trial evidence counts, and only RCTs are free from
bias
The evolution of EBM
• Misinterpretation??
• Certainly but:
• ‘[if] the study was not randomized we’d suggest that you
stop reading it and go on to the next article in your
search’ (Sackett et al Evidence Based Medicine, 3rd
edition, p.108)
The evolution of EBM
• Some pro-EBM cases (grommets for glue ear)
• But more measured voices immediately pointed out:
• Lots of contrary cases:
The evolution of EBM
• thyroxine for myxoedema
• insulin for diabetic ketoacidosis
• vitamin B12 for pernicious anaemia, etc,etc
• appendicectomy for acute appendicitis etc.
The evolution of EBM
• Retreat/clarification
• Clinical expertise/ ‘patho-physiologic rationale’ to be
incorporated not ignored..
• (How?)
• Other types of evidence have some weight
• But RCT retains a very special role
• Evidence Hierarchy
An Evidence Hiearchy
The evolution of EBM
•
•
•
•
But only one of many
2002 study found 40 different hierarchies
2006 study added 20 more
All agree that the RCT remains king (trump, no overall
evaluation)
• But also differences
• a. some put meta-analyses top, others omit them
• b.cohort/case control
The evolution of EBM
• Also
• 1. Explicit concession that RCTs are not needed for
‘dramatic’ effects (Glaziou et al)
• 2. Fleeting recognition that it seems odd to hold that
some of the most clearly efficacious treatments do not
have ‘best’ evidence
• 3. And odd that various unfortunate a priori judgments
are endorsed
The evolution of EBM
•
•
•
•
Also also
Swing in the frequentist/Bayesian balance
Finally
One very influential voice:
Sir Michael Rawlins
The evolution of EBM
• Rawlins certainly pro evidence in general sense
• But scathing about ‘anorak EBM’
The evolution of EBM
• 1. Evidence hiearchies internally unjustified since they
overrate RCTs:
• “The notion that evidence can be reliably placed in
hierarchies is illusory. Hierarchies place RCTs on an
undeserved pedestal for .. although the technique has
advantages it also has significant disadvantages.
Observational studies too have defects but they also
have merit.”
The evolution of EBM
• 2. Whole idea a mistake:
• “Hierarchies attempt to replace judgement with an
oversimplistic, pseudo-quantitative, assessment of the
quality of the available evidence. [In fact d]ecision
makers have to incorporate judgements, as part of their
appraisal of the evidence, in reaching their conclusions.”
Philosophy of Science to the rescue?
• Groundhog day?
• Whole rationale for EBM distrust of judgment…
• Flux →
Philosophy of Science to the rescue?
• Need to start from a more fundamental perspective
• RCT is not the embodiment of the scientific method
• But is resorting to philosophy of science likely to resolve
controversy?
• Many differences in logic of evidence but only need one
and a half fundamental and agreed principles
Philosophy of Science to the rescue?
•
•
•
•
Everyone agrees that “real” evidence for T not only
(i) accords with T; but also
(ii) is ‘unlikely otherwise’
So in particular evidence is the stronger the more
plausible alternatives it rules out
• Popper, Bayes, Mill, Mayo .. all agree that support
depends on p(e,b)
• Popper (and Mayo) – ‘severe test’
• Bayes factor = p(e,h & b)/p(e,b)
Philosophy of Science to the rescue?
• Principle 1 entails:
• (i) When confronted with an apparently positive trial
result, the question always is: ‘is there a plausible
alternative explanation for this outcome other than the
(superior) effectiveness of the treatment?’
• (ii) You can’t be in a better epistemic position in a trial
than if background knowledge supplies no reason to
think that the experimental and control group are
significantly different.
Philosophy of Science to the rescue?
• These in turn entail:
• 1. Randomizing is neither sufficient nor necessary for
genuine support from a trial outcome
• 2. These plausibility judgments are going (often) to
depend on effect size
• 3. The only solid reason to randomize is that it controls –
not for all possible confounders – but instead for the
specific possible confounder ‘selection bias’
• (BUT if the selection bias can be eliminated by other
means – or at any rate reduced – and the effect size is
large …. )
Randomizing not sufficient
• Despite claims like these:
• “In a randomised trial, the only difference between the
two groups being compared is that of most interest: the
intervention under investigation.” (Mike Clarke)
• “As their name suggests, RCTs involve the random
allocation of different interventions (treatments or
conditions) to subjects. As long as numbers of subjects
are sufficient, this ensures that both known and unknown
confounding factors are evenly distributed between
treatment groups.” (Wikipedia)
Randomizing not sufficient
• Everyone knows the sufficiency claim is false
• Amusing example
• Leibovici et al “Effects of remote, retroactive,
intercessory prayer on outcomes in patients with
bloodstream infection: randomised controlled trial” BMJ
2001
Randomizing not sufficient
• 3393 patients having a bloodstream infection while being
an inpatient at the Rabin Medical Centre during 1990-6
were identified.
• In July 2000 a random number generator was used to
divide these patients into two groups and which of these
two became the treatment group was decided by a coin
toss.
• 1691 were randomized to the intervention group and
1702 to the control.
• Checked for ‘baseline imbalances’ with regard to main
risk factors for death and severity of illness.
Randomizing not sufficient
• The names of those in the intervention group were given
to a person ‘who said a short prayer for the well being
and full recovery of the group as a whole.’
• Results: both length of stay in hospital and duration of
fever were significantly shorter in the intervention group
(p = 0.01 and p = 0.04)!
• Conclusion: ‘Remote, retroactive intercessory prayer
said for a group is associated with a shorter stay in
hospital and shorter duration of fever in patients with
bloodstream infection and should be considered for use
in clinical practice.’
• Somewhat tongue-in-cheek of course (‘No patients were
lost to follow up’!!)
Randomizing not sufficient
• Natural reaction shows we are all Bayesian:
• “If the pre-trial probability is infinitesimally low, the results
of the trial will not really change it, and the trial should
not be performed. This, to my mind, turns the article into
a non-study, though the details provided (randomization
done only once, statement of a prayer, analysis, etc) are
correct.”
Randomizing not necessary
• As now acknowledged by Glaziou et al:
• If you have an effect size so ‘dramatic’, and no reason to
think that the patients you are treating now are greatly
different from those treated earlier, then there is no
plausible alternative to the theory that the effect is
produced by the treatment.
Philosophy of Science to the rescue?
• Other ‘half’ principle:
• Make sure you’re testing the theory you want to be
tested.
11/2. What is evidence, evidence for?
• Typical research reports:
• ‘Efficacy and safety of ustekinumab .. in patients with
psoriasis..’
• ‘Active symptom control with or without chemotherapy in
the treatment of patients with malignant pleural
mesothelioma ..’
11/2. What is evidence, evidence for?
• Report (usually randomized) trials on some selected
group of patients
• involving a number of exclusion criteria
• generally using some very precise treatment regimen;
and where
• the treatment is given for some relatively brief period
• Rawlins ‘ Most RCTs, even for interventions that are
likely to be used by patients for many years, are of only
six to 24 months duration.’
• (N.B. However this is also true of non-RCTs)
11/2. What is evidence, evidence for?
• Result may be: D is ‘more effective’ than the comparator
for condition C
• What exact theory has been tested?
• Not really this (vague) claim but rather:
• D when administered in a very particular way to a very
particular set of patients for a particular length of time is
more effective than some comparator treatment
(perhaps placebo).
• RCT provides – let’s say impeccable- evidence for this
11/2. What is evidence, evidence for?
• But this is not the claim that the practitioner is interested
in.
• She wants evidence for:
• D is effective (in a wide sense) when prescribed to the
sorts of patients she would like to prescribe it to.
• ‘Target population’ will include the excluded
• Dosage may be adjusted
• For chronic conditions prescribed for a long time
11/2. What is evidence, evidence for?
•
•
•
•
Usually run in terms of ‘external validity’
But better as: wrong theory being tested
NB not a Humean ‘purely philosophical’ issue
Specific grounds for thinking the target and study
populations different
• Bartlett et al looked at RCTs on NSAIDs and Statins and
found older people, women and ethnic minorities all
consistently underrepresented
11/2. What is evidence, evidence for?
•
•
•
•
When looked at in my phil of sci way clear that
RCT gives strongest evidence for wrong theory (…)
By no means entails
RCT gives strongest evidence for right theory
11/2. What is evidence, evidence for?
•
•
•
•
Truog argued this in case of ECMO
(largely on grounds of changing technologies)
Bluhm extended argument to chronic diseases
(largely on grounds of short term nature of trials
compared to long term nature of treatment)
• NB not a case of ‘no RCTs so go down the hierarchy to
the “next best”’
• Rather observational studies arguably give best
evidence - once the right theory has been identified
How much of this was anticipated by Hill
How much of this was anticipated by Hill
•
•
•
•
•
“Criteria”
But actually ‘commonsense’ is more important
A 1950 paper ‘On the controlled trial’
2 1965 papers in particular
“The Environment and Disease: Association or
Causation?”
• *“Reflections on the Controlled Trial”*
• (Continuing discussion in epidemiology of the “criteria”
and some periodic ‘rediscovery’.. But ..)
How much of this was anticipated by Hill
• Widely regarded as the prime mover re the randomized
clinical trial
• BUT:
• “The history of science … shows that frequently with a new
discovery … the pendulum at first swings too far. Has this been so
with the clinical trial? Is it true, as Cromie (1963) has suggested, that
‘little or no credence is now given to clinical observations even by
experienced investigators’ while there is ‘a blind acceptance of
double-blind trails without a clinical evaluation of their shortcomings
and their ability to mislead as well as to lead.”
• And he concludes in fact that
• “Any belief that the controlled trial is the only way would mean not
that the pendulum had swung too far but that it had come right off its
hook.”
How much of this was anticipated by Hill
• 1. We have to use ‘background knowledge’ in order to
interpret any result:
• Need to look at any putatively positive result in whatever
trial with ‘the fundamental question’ in mind
• “That fundamental question [is] – is there any other way of
[plausibly] explaining the set of facts before us, is there any other
answer equally, or, more likely, than cause and effect?”
• The Leibovici example shows that sometimes the
answer is ‘ there must be but we can’t specify what it is’
How much of this was anticipated by Hill
• Same fundamental question whether the trial was
randomized or not and even if there was no formal trial
at all!
• Enthusiastically endorses Claude Bernard:
• “.. it is imperative that we draw no precise line between observation
and experiment. It is just 100 years since the great experimentalist
Claude Bernard (1865) wrote: ‘a physician observing a disease in
different circumstances reasoning about the influence of these
circumstances and deducing consequences which are controlled by
other observations – this physician reasons experimentally even
though he makes no experiments.”
How much of this was anticipated by Hill
• RCTs do have one - but only one – advantage:
• “Faithfully adhered to [randomizing] offers three great advantages:
(1) it ensures that our personal feelings or judgments, applied
consciously or unconsciously, have not played any part in building
up the various treatment groups; from that aspect, therefore, the
groups are unbiased; (2) it removes the very real danger, inherent in
any allocation which is based upon personal judgments, that
believing our judgments may be biased, we endeavour to allow for
that bias in so doing may ‘lean over backwards’ and thus introduce a
lack of balance from the other direction; (3) having used such a
random allocation we cannot be accused by critics of having set up
personally biased groups for comparison.”
How much of this was anticipated by Hill
• And this is plausible only if apparent effect is small
• Indeed where it is large we not only do not need
randomization we don’t need any formal statistical tests.
• Relates his investigation into sickness patterns in card
rooms in cotton mills in Lancashire.
• The illnesses the card room workers suffered from were
so much worse and so different from workers in other
parts of the mill, that Hill argued, the evidence
established a causal link with the environmental
conditions.
How much of this was anticipated by Hill
• He reports:
• ‘My results were set out for men and women separately and for half
a dozen age groups in 36 tables. So there were plenty of sums. Yet I
cannot find anywhere I thought it necessary to use a test of
significance. The evidence was so clear cut, the differences
between the groups were mainly so large, the contrast between
respiratory and non-respiratory causes of illness so specific, that no
formal tests could really contribute anything of value to the
argument. So why use them?’
How much of this was anticipated by Hill
• Statistics has uses, of course, but again a pendulum
swinging too far:
• “To decline to draw conclusions without standard errors can surely
… be silly? Fortunately I believe we have not yet gone so far as our
friends from the USA where, I am told, some editors of journals will
return an article because tests of significance have not been applied
[!!]. Yet there are innumerable situations in which they are totally
unnecessary – because the difference is grotesquely obvious,
because it is negligible, or because, whether it be formally
significant or not, it is too small to be of any practical importance.”
How much of this was anticipated by Hill
• So always same fundamental question – essentially
“what else can it be?” – whatever results you are looking
at.
• Basic philosophy underlying famous “criteria” of causality
• strength, consistency, specificity, temporality, biological
gradient, plausibility, coherence, experiment, and
analogy
• Want here just to articulate and endorse underlying view
of evidence
How much of this was anticipated by Hill
•
•
•
•
1. Have ‘the facts before us’
2.Is the relationship causal or ‘merely’ an association?
3. Both answers are deductively consistent with the facts
4. Need some further evidential input if we are to be on
as safe ground as possible
• 5 Of course if we had rock solid evidence of a
deterministic mechanism wouldn’t even need stats.
How much of this was anticipated by Hill
• 6. But such clear cut ‘background’ evidence not available
in cases of interest
• 7 Hill’s main point is we should not give up:
• Background knowledge may at least assist us by
providing certain general constraints
• 8. E.g. background knowledge tells us that many causal
mechanisms are linear
• Hence more heavy smokers suffer more cancers
supplies further evidence for the causal link (doseresponse)
How much of this was anticipated by Hill
• 9. Similarly background knowledge makes certain causal
links plausible and others implausible without supplying
detailed knowledge of the links
• (combination of plausibility and coherence)
• Of course everything defeasible – but no excuse for
failing to make decisions in the light of current evidence
How much of this was anticipated by Hill
• Exercising judgment (Rawlins)?
• Certainly not unanalysable judgment
• We know a lot about the world ahead of any statistical
trial
• Fisher’s understandable but egregious error
• Bayesians don’t help by talking about ‘subjective priors’
• Nothing subjective, e.g., about the view that remote
intercessory prayer can have no effect!
How much of this was anticipated by Hill
• Very much aware of 2nd issue too.
• Indeed his very first criticism of many RCTs is that they
address the wrong question:
• “many controlled trials that are published fall lamentably short of
what is really required .. The authors do not appear to have asked
themselves at the outset the deceptively simple but dominating
question ‘what precisely am I trying to find out?’”
• Always need to have an eye to generalising and
generalising to the target population
• Makes a number of challenging points
How much of this was anticipated by Hill
• End by highlighting two:
• 1. Double blinding may be a problem rather than a virtue
• 2. Data mining is often good.
• “There is one feature of the modern controlled trial that frequently
hampers the clinician in making acute and discriminating
observations of his patient – and that is the double-blind procedure.”
• Of course he recognises that not blinding introduces the
possibility of bias: especially in estimating subjective
outcomes.
How much of this was anticipated by Hill
• But even where you can eliminate that by a division of
labour
• In some situations double blinding is ‘injurious to the trial’
• “Such situations arise when it is important, for the sake of a realistic
trial, that the doctor in charge of the patient be able to adjust the
dose of a drug according to the patient’s reactions and according to
his judgment of the patient’s requirements….It may well be asked
therefore in the planning of a trial, which is the more important – for
the doctor to be ignorant of the treatment and unbiased in his
judgment or for him to know what he is doing and to be able to
adjust what he is doing so as to observe closely the results and then
make unbiassed judgments to the best of his ability and conscious
mind?”
How much of this was anticipated by Hill
• NO reason why a controlled trial cannot be aimed at the
question:
• “ If competent clinicians in charge of defined types of patients use
drug X in such varying amounts and for such varying durations of
time, and so forth, as they think advisable for each patient, what
happens?”
How much of this was anticipated by Hill
• 2. Data mining may not be an epistemic sin but is often
essential.
• Essential even within trials that show an overall marginal
effect at best to try to “identify some sub-group of
patients who do tend to respond favourably to the
treatment.”
• Of course not just any sub-group – that would be
epistemically sinful – but any sub-group that background
knowledge tells you might plausibly react differently.
How much of this was anticipated by Hill
• Disagrees with Sir John McMichael who wrote
• “the aim of a statistical trial is to include all the unpredictable
multitude of factors which can influence the outcome by a
comprehensive sample. Unless the treatment shows a convincing
difference in outcome in the whole group it is not permissible to
separate out afterwards a sub-division of better results. Any subdivisions should be done on other criteria before the trial begins.”
• BUT, Hill:
• “McMichael is .. criticizing the analysis of the MRC report on longterm treatment with anti-coagulants in terms of age and sex; two
features in prognosis which invariably and so often call for divided
attention that there could never be any question of before and
afterwards.”
How much of this was anticipated by Hill
• But in any event
• “.. I can see no argument in favour of his view, either statistical or
logical. If there is an ‘unpredictable multitude of factors which can
influence the outcome’,then it is surely our job, and duty, to see
whether in the analysis we can identify them and thus make them
predictable.”
Conclusion
• Hill had a much more sophisticated (and yet
commonsensical) view of evidence than that initially held
by EBM-ers who thought of themselves as his followers
• His insights are only gradually (and partially) being
rediscovered.
• EBM needs to keep on trying to get back to the Hill top.