Transcript Document

GRADE
Toni Tan, Centre for Clinical Practice
GRADE
The Grading of Recommendations
Assessment, Development and
Evaluation
GRADE
“A systematic and explicit approach to making
judgements about the quality of evidence, and
the strength of recommendations can help to
prevent errors, facilitate critical appraisal of these
judgements, and can help to improve
communication of this information.”
Organisations that have adopted GRADE
methodology
Agency for Healthcare Research and Quality (USA)
Agenzia Sanitaria Regionale (Italy)
American College of Chest Physicians (USA)
American College of Physicians (USA)
American Thoracic Society (USA)
Arztliches Zentrum fur Qualitat in der Medizin
(Germany)
British Medical Journal (United Kingdom)
BMJ Clinical Evidence (United Kingdom)
COMPUS at The Canadian Agency for Drugs
and Technologies in Health (Canada)
The Cochrane Collaboration (International)
EMB Guidelines (Finland/International)
The Endocrine Society (USA)
European Respiratory Society (Europe)
European Society of Thoracic Surgeons
(International)
Evidence-based Nursing Su¨ dtirol (Italy)
German Center for Evidence-based Nursing
‘‘sapere aude’’ (Germany)
Infectious Diseases Society of America (USA)
Japanese Society for Temporomandibular
Joint (Japan)
Journal of Infection in Developing Countries
(International)
Kidney Disease: Improving Global Outcome
(International)
Ministry for Health and Long-Term Care,
Ontario (Canada)
National Board of Health and Welfare (Sweden)
National Institute for Health and Care
Excellence (United Kingdom)
Norwegian Knowledge Centre for the Health
Services (Norway)
Polish Institute for EBM (Poland)
SIGN (UK, Scotland)
Society for Critical Care Medicine (USA)
Society for Vascular Surgery (USA)
Spanish Society for Family and Community
Medicine (Spain)
Surviving Sepsis Campaign (International)
University of Pennsylvania Health System
Center for Evidence-Based Practice (USA)
UpToDate (USA)
World Health Organization (International)
‘Traditional’ approach
Checklist system
•
•
•
•
Selection bias: randomisation, concealment of allocation, comparable at
baseline
Performance bias: blinding (patients & care providers), the comparison
groups received the same care apart from the intervention studied.
Attrition bias: systematic differences between the comparison groups
with respect to participants lost
Detection bias: appropriate length of follow-up, definition of outcome,
blinding (investigators)
++
All or most of the criteria have been fulfilled. Where they have not been fulfilled the
conclusions of the study or review are thought very unlikely to alter.
+
Some of the criteria have been fulfilled. Those criteria that have not been fulfilled or not
adequately described are thought unlikely to alter the conclusions.

Few or no criteria fulfilled. The conclusions of the study are thought likely or very
likely to alter.
‘Traditional’ approach
Narrative summary
For example, AIP guideline
Mortality rates
One cluster RCT from the UK investigated the effectiveness of CCOT on
hospital mortality using PAR score……found a significant reduction in
hospital mortality in patients in the intervention wards at cluster level (OR =
0.523, 95% CI 0.322 to 0.849). The cluster RCT from Australia found no
difference in unexpected death (without do-not-resuscitate order)
(secondary outcome) between control group and intervention group (per
1000 admissions: control = 1.18, intervention = 1.06, difference = −0.093
[−0.423 to 0.237], 95% CI: −0.423 to 0.237; adjusted p = 0.752, adjusted
OR = 1.03, 95% CI 0.84 to 1.28).
Evidence statement:
(1+) There were conflicting findings in the two included studies on mortality
rates: the Priestley and coworkers study found a significant reduction in
mortality (but failed to report do-not-resuscitate orders), but MERIT
found no difference between the two arms of the study for this outcome.
GRADE
• Interventional studies of effectiveness
• Currently in development for diagnostic accuracy studies,
prognostic and qualitative studies
• Makes sequential appraisal about:
– The quality of evidence across studies for each critical/important
outcome (instead of individual study)
– Which outcomes are critical to a decision
– The overall quality of evidence across these critical outcomes
– The balance between benefits and harms
• Result is an assessment of:
– quality of the evidence for an outcome
– strength of the recommendations
• Perspective of guideline developers
GRADE profile
No of patients
Other
considerations
Imprecision
Design
Indirectness
Risk of bias
No of studies
Inconsistency
Quality assessment
SMBG
SMUG
Relative
(95%
CI)
61
63
Effect
Absolute
Quality
Importance
-
MD 0.15 higher
(0.37 lower to
0.67 higher). See
figure 14
Low
CRITICAL
Change in Hba1c (%; Better indicated by lower values)
3 (Allen 1990,
Lu 2011,
Fontbonne
1989*)
RCTs
S1
N
N
S2
none
Change in Fasting Blood Glucose (FBG mmol/L; Better indicated by lower values)
2 (Allen 1990,
Lu 2011)
RCTs
N
N
N
S2
none
61
63
-
MD 0.35 lower
(1.45 lower to
0.74 higher). See
figure 15
Moderate
CRITICAL
none
27
27
-
MD 2 higher (0.3
to 3.7 higher)
Low
IMPORTANT
Change in weight (Kg; Better indicated by lower values)
1 (Allen 1990)
1
RCT
S1
NA
N
S2
Downgrade by one level: studies conducted before 1995 when the management of diabetes and other related conditions may have differed compared
with current practice
2 Downgrade by one level: The 95% confidence interval passes through the minimal important difference (MID) which is 0.5% for change in Hba1c levels, 1
mmol/L for fasting blood glucose, 1 mmol/L for postprandial blood glucose, 5kg for body weight, 3 BMI point and 3 cm for waist circumference. For all other
outcomes a change of 0.5 for continuous outcomes or a relative risk reduction or increase of 25% or more for binary outcomes were considered clinically
important.
Why do we use GRADE in
NICE clinical guidelines?
• Concerns about the sometimes inappropriate
direct link between study design and
recommendation strength
• Anecdotal evidence that recommendations not
based on evidence from trials were being
ignored
• WHO evaluation of the NICE clinical guidelines
programme
• Just being explicit about what we had been
doing anyway!
How GRADE works?
Critical
Outcome
Critical
Outcome
Important
Outcome
Not
High
Moderate
Low
Very low
Summary of findings
& estimate of effect
for each outcome
Evidence synthesis (systematic review)
Grade down
P
I
C
O
Outcome
1.
2.
3.
4.
5.
Grade up
RCT start high,
obs. data start low
Risk of bias
Inconsistency
Indirectness
Imprecision
Other
consideration
1. Large effect
2. Dose
response
3. Confounders
Making recommendations (guidelines)
Develop recommendations:
•For or against (direction)
•Strong or weak (strength)
By considering:
Relative value of different
outcomes
Quality of evidence
Trade off - benefits/harms
Health economics
Other considerations
Present evidence profile(s)
to GDG
• “Offer xyz…”
• “Consider xyz…”
• “Do not use xyz…”
GRADE concept of quality of evidence
– The quality of evidence reflects the extent to
which our confidence (certainty) in an
estimate of the effect is adequate to support a
particular recommendation.
– Guideline panels must make judgements
about the quality of evidence relative to the
specific context for which they are using the
evidence.
How is this achieved?
• Transparent framework to consider confidence
(certainty) of an effect estimate through assessing
o Systematic errors (bias)
o Chance errors (random errors)
• Using criteria
o Systematic errors (bias)
o Limitations, Indirectness, Inconsistency
– Chance errors (random error)
o Imprecision
o Other considerations (any other factors)
GRADE Definitions
High
Further research is very unlikely to change our
confidence in the estimate of effect.
Moderate
Further research is likely to have an important impact
on our confidence in the estimate of effect and may
change the estimate.
Low
Further research is very likely to have an important
impact on our confidence in the estimate of effect and
is likely to change the estimate.
Very low
Any estimate of effect is very uncertain.
GRADE diagram
Grading quality of
evidence
What is the methodology of the best
available evidence?
RCT
Observational study
Uncontrolled studies
Assume “high
quality”
Assume “low quality”
“very low quality”
Factors lowering the
quality present?
Factors lowering the
quality present?
If YES –
Downgrades
Become “moderate, low
or very low quality”
If NO –
Stay “high quality”
If NO –
Factors increasing the
quality?
If NO –
Stay “low quality”
If YES – downgrade to
“very low quality”
If YES –
Upgrades
Become “moderate
or high quality”
Determining the quality of evidence
•
•
•
•
•
Limitations
Inconsistent results
Indirectness
Imprecision
Other considerations
–
–
–
–
Large or very large effect
Plausible biases underestimate true effect
Dose-response gradient
All of above can upgrade 1 level
(2 for large magnitude of effect)
Limitations or ‘risk of bias’ - RCTs
limitation
explanation
Lack of allocation
concealment
Those enrolling patients are aware of the group to which the
next enrolled patient will be allocated (major problem in
“pseudo” or “quasi” randomized trials with allocation by day of
week, birth date, chart number etc.)
Lack of blinding
Patient, caregivers, those recording outcomes, those
adjudicating outcomes, or data analysts are aware of the arm
to which patients are allocated
Incomplete accounting of
Loss to follow-up and failure to adhere to the intention to treat
patients and outcome events principle when indicated
Selective outcome reporting
Other limitations
Reporting of some outcomes and not others on the basis of
the results
For example:
stopping early for benefit observed in randomized trials, in
particular in the absence of adequate stopping rules
use of unvalidated patient-reported outcomes
carry-over effects in cross-over trials
recruitment bias in cluster-randomized trials
Risk of bias – observational studies
limitation
explanation
Failure to develop and apply
appropriate eligibility criteria
(inclusion of control population)

Flawed measurement of both
exposure and outcome



Failure to adequately control
confounding


Incomplete or inadequately
short follow-up
under- or over-matching in case-control studies
selection of exposed and unexposed in cohort
studies from different populations
differences in measurement of exposure (e.g.
recall bias in case- control studies)
differential surveillance for outcome in exposed
and unexposed in cohort studies
failure of accurate measurement of all known
prognostic factors
failure to match for prognostic factors and/or
adjustment in statistical analysis
Inconsistency
• When heterogeneity exists, but no plausible explanation is
identified, the quality of evidence should be downgraded by
one or two levels, depending on the magnitude of the
inconsistency in the results.
• Inconsistency may arise from differences in:
– populations (e.g. drugs may have larger relative effects
in sicker populations)
– interventions (e.g. larger effects with higher drug doses)
– outcomes (e.g. diminishing treatment effect with time).
• Account for this where possible
Indirectness
Indirect
Question of interest
Source of indirectness
Comparison Relative effectiveness of
alendronate and
risedronate in
osteoporosis.
Randomized trials compared alendronate to
placebo and risedronate to placebo, but trials
comparing alendronate to risedronate are
unavailable.
Population
Oseltamivir for prophylaxis
of avian flu caused by
influenza A (H5N1) virus.
Intervention Sigmoidoscopic screening
for prevention of colon
cancer mortality.
Randomized trials of oseltamivir are available for
patients with seasonal influenza, but not for avian
influenza.
Randomized trials of fecal occult blood screening
provide indirect evidence bearing on the potential
effectiveness of sigmoidoscopy.
Comparator Choice of medication for
schizophrenia.
A series of trials comparing newer generation
neuroleptic agents to fixed doses of 20 mg of
haloperidol provide indirect evidence of how the
newer agents would compare to the lower,
flexible doses of haloperidol clinicians typically
use.
Indirectness
Condition
Patient-important outcome(s)
Surrogate outcome(s)
Diabetes
Diabetic symptoms, admission,
complications (cardiovascular, eye,
renal, neuropathic etc.)
Glucose, HbA1C
Dementia
Patient function, behaviour,
caregiver burden
Cognitive function
Osteoporosis
Fractures
Bone density
ARDS
Mortality
Oxygenation
End-stage renal
disease
Quality of life, mortality
Hemoglobin
Venous thrombosis
Symptomatic venous thrombosis
Asymptomatic venous
thrombosis
Chronic respiratory
disease
Quality of life, exacerbations,
mortality
Pulmonary function, exercise
capacity
Cardiovascular
disease/risk
Vascular events, mortality
Serum lipids
Imprecision
• Our estimates of the population value are
uncertain/imprecise because we use samples
• GRADE extended the term uncertainty in the
context of whether the effect estimate reaches
the ‘clinical minimal important difference’ (MID)
Example of MID:
Drug X compared to placebo to reduce severe migraine.
Pain on migraine: measured on a 10-point scale
Mean baseline = 9.5; Mean reduction from baseline = 1.7 (95%CI: 1.2 to 2.3)
But survey on migraine patients said pain reduction less than 3 points is
meaningless because it does not improve their overall QoL and daily function.
Confidence intervals - summary
• Easiest way to approach effect of random
error on evidence quality
• In frequentist approach, 95% CI
represents
– A range constructed so that in repeated
experiments 95% would include the
population value
– Usually interpreted as p=0.95 that the
population value is in the CI
Confidence interval width
• Wide confidence intervals imply uncertainty over whether
our observed effect is close to or far away from the real
effect
• Examples
– An RCT of supervised exercise for patellofemoral pain
– Self reported recovery at 12 months
– T: 9/500 vs SC: 2/500 RR=4.50 (1.00 to 20.77)
– We’d probably agree that’s imprecise
–
–
–
–
An RCT of drug A for patellofemoral pain
Self reported recovery at 12 months
T: 350/500 vs PC: 150/500 RR=2.33 (2.20 to 2.72)
We’d probably agree that’s precise
What affects imprecision?
• Having larger samples, but particularly
where there is more ‘information’
– Complex relationship between sample size,
numbers of events
• Easiest to play with an example
Control
event rate
2/4
10/20
20/40
50/100
500/1000
Treatment
event rate
1/4
5/20
10/40
25/100
250/1000
RR, %
RRR, %
50
50
50
50
50
50
50
50
50
50
Calculated 95%CI
-174 to 92
-14 to 79.5
9.5 to 73.4
26.8 to 66.4
43.5 to 55.9
Remember CIs can mislead
• True values will be outside a 95%CI 5/100 times
• CI based on small numbers of events are
unstable
• Early trials tend to be more positive
• Trials stopped early likely to be biased
• So, if you have small trials with a positive effect
and apparently narrow CI, be sceptical
• It would be helpful to have an objective idea of
when we have ‘enough’ information
Optimal information size (OIS)
• We want at least as many observations in a trial as we
would calculate in a sample size calculation
of 0.2
6000
Figure 4: Optimal information size given alpha of 0.05 and beta
for varying control event rates and RRR of 20%, 25% and 30%
4000
2000
3000
RRR=20%
RRR=25%
RRR=30%
0
1000
Total sample size required
5000
For any chosen line, evidence meets
optimal information size criterion if
above the line
0.2
0.4
0.6
0.8
1.0
Control group event rate
Warning – ‘Power-based’ sample size calculation is for ‘hypothesis
testing’ using p-value, not for estimation of true effect
OIS continued
• Thinking of numbers of events may be easier, and could just
use arbitrarynumber if don’t have resources to calculate OIS
400
RRR=20%
For any chosen line, evidence meets
optimal information size criterion if
above the line
RRR=25%
300 events
200
RRR=30%
0
Total number of events required
600
Figure 5: Optimal information size presented as number of events
given alpha of 0.05 and beta of 0.2
for varying control event rates and RRR of 20%, 25% and 30%
0.0
0.2
0.4
0.6
Control group event rate
0.8
1.0
Summary of suggested approach to imprecision
3
1
2
Mean pain
reduction
-4
-3
-2
-1
0
+1
+2
+3
+4
Mean pain
increase
Red: mean -1 as MID
1 = ‘no effect’ and precise; 2 = ‘no effect’ but not precise; 3 = ‘effective’ and precise
Green: mean -2 as MID
1 = ‘no effect’ and precise; 2 = ‘no effect’ and precise; 3 = ‘effective’ and precise
Blue: mean -3 as MID
1 = ‘no effect’ and precise; 2 = ‘no effect’ and precise; 3 = ‘effective’ and not precise
What if we don’t know a threshold?
• Can use an arbitrary threshold
– For example, GRADE suggests RRR or RRI
of 25%
– Often used in NICE guidelines
Two things to remember about GRADE
• Many judgements are made in appraising
evidence, and there will always be
disagreement. The important thing is to make
the areas of disagreement transparent.
• The concepts we are judging e.g. imprecision
are continuous, and dichotomising it (downgrade
or not) can be a close call. Where it is, the
evidence to recommendations section should
discuss it
PDE-5 inhibitor vs. placebo
Other
considerations
Imprecision
Indirectness
Design
Risk of bias
No of studies
Number of patients
Inconsistency
Quality assessment
Intervention
Placebo
Effect/ outcome
Quality
Importance
Erectile Function-International Index of Erectile Function [IIEF] mean score on EF domain (better efficacy is indicated by higher values)
9 (Goldstein 2003, Ishii 2006, Zieglar
2006, Boulton 2001, Rendell 1999,
Safarinejad 2004, Stuckey 2003,
Hatzichristou 2008, Saenz 2002)
5
RCTs
S
RCTs
S
RCTs
S
N*
S ,
1 2
S
3
none
1855
1006
Pooled MD 5.82
higher at endpoint
(95% CI 4.75 to
6.89).
Very
low
Critical
2
N
none
160/1763
(9.1%)
41/948
(4.3%)
Pooled RR 2.70
(1.16 to 6.28)
Very
low
Important
1 2
N
none
42/1753
(2.4%)
14/1037
(1.4%)
Pooled RR 1.59
(0.84 to 3.02)
Very
low
Important
Adverse events (headache)
8 (Boulton 2001, Goldstein 2003, Ishii
2006, Rendell 1999, Saenz 2002,
Safarinejad 2004, Stuckey 2003,
Ziegler 2006)
5
4
S
S
N
S ,
(Discontinuation for AE)
8 (Goldstein 2003, Hatzichristou
2008, Ishii 2006, Rendell 1999,
Saenz 2002, Safarinejad 2004,
Stuckey 2003, Ziegler 2006)
5
Abbreviations: 95%CI, 95% confidence interval; IIEF, International Index of Erectile Function questionnaire; EF, Erectile function domain of IIEF; SEP, Sexual
Encounter Profile (diary questions regarding sexual encounter); GEQ, Global Efficacy Question; QoL Quality of Life; RR, risk ratio
1
Downgrade by 1 level: 1 study (Hatzichristou 2008) used low doses (2.5mg and 5mg) of tadalafil, which are licensed for use but are recommended in people who
anticipate frequent use of the drug. 10mg is generally recommended (but not for continuous daily use). The other study examining Tadalafil (Saenz 2002) used
10mg and 20mg, therefore these arms combined represent a wide range of different doses.
2
Downgrade by 1 level: 2 studies (Stuckey 2003, Zieglar 2006) were conducted solely in men with type 1 diabetes and the mean age in these studies were
generally lower in comparison to the other included studies. One study (Ishii 2006) did not report the proportion of men with type 2 diabetes.
3
Downgrade by 1 level: SDs were not reported in the paper and were calculated using p-values
4
Downgrade by 1 level: pairwise comparisons of the included studies (direct comparisons) showed an I² of 75% (p=0.0002) for headaches, 68% (p=0.009) for upper
respiratory tract infection and 58% (p<0.00001) for any adverse event. These values indicate substantial heterogeneity which cannot be fully accounted for
5
Downgrade by 1 level: 2 studies (Saenz 2002, Ishii 2006) do not report allocation concealment to determine if performance bias was present
*
pairwise comparisons of the included studies (direct comparisons) showed an I² of 46%. Although this may indicate moderate heterogeneity, this inconsistency was
not considered to be important as overall the effect estimates and the confidence intervals were favouring the PDE-5 group
Critical
Outcome
Critical
Outcome
Important
Outcome
Not
High
Moderate
Low
Very low
Summary of findings
& estimate of effect
for each outcome
Evidence synthesis (systematic review)
Grade down
P
I
C
O
Outcome
1.
2.
3.
4.
5.
Grade up
RCT start high,
obs. data start low
Risk of bias
Inconsistency
Indirectness
Imprecision
Other
consideration
1. Large effect
2. Dose
response
3. Confounders
Making recommendations (guidelines)
Develop recommendations:
•For or against (direction)
•Strong or weak (strength)
By considering:
Relative value of different
outcomes
Quality of evidence
Trade off - benefits/harms
Health economics
Other considerations
Present evidence profile(s)
to GDG
• “Offer xyz…”
• “Consider xyz…”
• “Do not use xyz…”
Evidence to recommendations
• Structured discussion of
– Relative value placed on outcomes
– Trade off between clinical benefits and harms
– Trade off between net health benefits and resource use
– Quality of the evidence
– Other considerations
• Place within pathway of care
• Equalities issues
• Practicalities of implementation e.g. need for training
Strength of recommendation
• Stronger: ‘the GDG is confident that the desirable effects of
adherence to a recommendation outweigh the undesirable effects’
‘Should do ...’
• Weaker: the GDG concludes that the desirable effects of adherence
to a recommendation probably outweigh the undesirable effects, but
is not confident’
‘Should consider ...’
Further information
• http://www.gradeworkinggroup.org/
• Ongoing series of papers in Journal of
Clinical Epidemiology addressing all of
these issues
• [email protected]