Transcript Document
GRADE Toni Tan, Centre for Clinical Practice GRADE The Grading of Recommendations Assessment, Development and Evaluation GRADE “A systematic and explicit approach to making judgements about the quality of evidence, and the strength of recommendations can help to prevent errors, facilitate critical appraisal of these judgements, and can help to improve communication of this information.” Organisations that have adopted GRADE methodology Agency for Healthcare Research and Quality (USA) Agenzia Sanitaria Regionale (Italy) American College of Chest Physicians (USA) American College of Physicians (USA) American Thoracic Society (USA) Arztliches Zentrum fur Qualitat in der Medizin (Germany) British Medical Journal (United Kingdom) BMJ Clinical Evidence (United Kingdom) COMPUS at The Canadian Agency for Drugs and Technologies in Health (Canada) The Cochrane Collaboration (International) EMB Guidelines (Finland/International) The Endocrine Society (USA) European Respiratory Society (Europe) European Society of Thoracic Surgeons (International) Evidence-based Nursing Su¨ dtirol (Italy) German Center for Evidence-based Nursing ‘‘sapere aude’’ (Germany) Infectious Diseases Society of America (USA) Japanese Society for Temporomandibular Joint (Japan) Journal of Infection in Developing Countries (International) Kidney Disease: Improving Global Outcome (International) Ministry for Health and Long-Term Care, Ontario (Canada) National Board of Health and Welfare (Sweden) National Institute for Health and Care Excellence (United Kingdom) Norwegian Knowledge Centre for the Health Services (Norway) Polish Institute for EBM (Poland) SIGN (UK, Scotland) Society for Critical Care Medicine (USA) Society for Vascular Surgery (USA) Spanish Society for Family and Community Medicine (Spain) Surviving Sepsis Campaign (International) University of Pennsylvania Health System Center for Evidence-Based Practice (USA) UpToDate (USA) World Health Organization (International) ‘Traditional’ approach Checklist system • • • • Selection bias: randomisation, concealment of allocation, comparable at baseline Performance bias: blinding (patients & care providers), the comparison groups received the same care apart from the intervention studied. Attrition bias: systematic differences between the comparison groups with respect to participants lost Detection bias: appropriate length of follow-up, definition of outcome, blinding (investigators) ++ All or most of the criteria have been fulfilled. Where they have not been fulfilled the conclusions of the study or review are thought very unlikely to alter. + Some of the criteria have been fulfilled. Those criteria that have not been fulfilled or not adequately described are thought unlikely to alter the conclusions. Few or no criteria fulfilled. The conclusions of the study are thought likely or very likely to alter. ‘Traditional’ approach Narrative summary For example, AIP guideline Mortality rates One cluster RCT from the UK investigated the effectiveness of CCOT on hospital mortality using PAR score……found a significant reduction in hospital mortality in patients in the intervention wards at cluster level (OR = 0.523, 95% CI 0.322 to 0.849). The cluster RCT from Australia found no difference in unexpected death (without do-not-resuscitate order) (secondary outcome) between control group and intervention group (per 1000 admissions: control = 1.18, intervention = 1.06, difference = −0.093 [−0.423 to 0.237], 95% CI: −0.423 to 0.237; adjusted p = 0.752, adjusted OR = 1.03, 95% CI 0.84 to 1.28). Evidence statement: (1+) There were conflicting findings in the two included studies on mortality rates: the Priestley and coworkers study found a significant reduction in mortality (but failed to report do-not-resuscitate orders), but MERIT found no difference between the two arms of the study for this outcome. GRADE • Interventional studies of effectiveness • Currently in development for diagnostic accuracy studies, prognostic and qualitative studies • Makes sequential appraisal about: – The quality of evidence across studies for each critical/important outcome (instead of individual study) – Which outcomes are critical to a decision – The overall quality of evidence across these critical outcomes – The balance between benefits and harms • Result is an assessment of: – quality of the evidence for an outcome – strength of the recommendations • Perspective of guideline developers GRADE profile No of patients Other considerations Imprecision Design Indirectness Risk of bias No of studies Inconsistency Quality assessment SMBG SMUG Relative (95% CI) 61 63 Effect Absolute Quality Importance - MD 0.15 higher (0.37 lower to 0.67 higher). See figure 14 Low CRITICAL Change in Hba1c (%; Better indicated by lower values) 3 (Allen 1990, Lu 2011, Fontbonne 1989*) RCTs S1 N N S2 none Change in Fasting Blood Glucose (FBG mmol/L; Better indicated by lower values) 2 (Allen 1990, Lu 2011) RCTs N N N S2 none 61 63 - MD 0.35 lower (1.45 lower to 0.74 higher). See figure 15 Moderate CRITICAL none 27 27 - MD 2 higher (0.3 to 3.7 higher) Low IMPORTANT Change in weight (Kg; Better indicated by lower values) 1 (Allen 1990) 1 RCT S1 NA N S2 Downgrade by one level: studies conducted before 1995 when the management of diabetes and other related conditions may have differed compared with current practice 2 Downgrade by one level: The 95% confidence interval passes through the minimal important difference (MID) which is 0.5% for change in Hba1c levels, 1 mmol/L for fasting blood glucose, 1 mmol/L for postprandial blood glucose, 5kg for body weight, 3 BMI point and 3 cm for waist circumference. For all other outcomes a change of 0.5 for continuous outcomes or a relative risk reduction or increase of 25% or more for binary outcomes were considered clinically important. Why do we use GRADE in NICE clinical guidelines? • Concerns about the sometimes inappropriate direct link between study design and recommendation strength • Anecdotal evidence that recommendations not based on evidence from trials were being ignored • WHO evaluation of the NICE clinical guidelines programme • Just being explicit about what we had been doing anyway! How GRADE works? Critical Outcome Critical Outcome Important Outcome Not High Moderate Low Very low Summary of findings & estimate of effect for each outcome Evidence synthesis (systematic review) Grade down P I C O Outcome 1. 2. 3. 4. 5. Grade up RCT start high, obs. data start low Risk of bias Inconsistency Indirectness Imprecision Other consideration 1. Large effect 2. Dose response 3. Confounders Making recommendations (guidelines) Develop recommendations: •For or against (direction) •Strong or weak (strength) By considering: Relative value of different outcomes Quality of evidence Trade off - benefits/harms Health economics Other considerations Present evidence profile(s) to GDG • “Offer xyz…” • “Consider xyz…” • “Do not use xyz…” GRADE concept of quality of evidence – The quality of evidence reflects the extent to which our confidence (certainty) in an estimate of the effect is adequate to support a particular recommendation. – Guideline panels must make judgements about the quality of evidence relative to the specific context for which they are using the evidence. How is this achieved? • Transparent framework to consider confidence (certainty) of an effect estimate through assessing o Systematic errors (bias) o Chance errors (random errors) • Using criteria o Systematic errors (bias) o Limitations, Indirectness, Inconsistency – Chance errors (random error) o Imprecision o Other considerations (any other factors) GRADE Definitions High Further research is very unlikely to change our confidence in the estimate of effect. Moderate Further research is likely to have an important impact on our confidence in the estimate of effect and may change the estimate. Low Further research is very likely to have an important impact on our confidence in the estimate of effect and is likely to change the estimate. Very low Any estimate of effect is very uncertain. GRADE diagram Grading quality of evidence What is the methodology of the best available evidence? RCT Observational study Uncontrolled studies Assume “high quality” Assume “low quality” “very low quality” Factors lowering the quality present? Factors lowering the quality present? If YES – Downgrades Become “moderate, low or very low quality” If NO – Stay “high quality” If NO – Factors increasing the quality? If NO – Stay “low quality” If YES – downgrade to “very low quality” If YES – Upgrades Become “moderate or high quality” Determining the quality of evidence • • • • • Limitations Inconsistent results Indirectness Imprecision Other considerations – – – – Large or very large effect Plausible biases underestimate true effect Dose-response gradient All of above can upgrade 1 level (2 for large magnitude of effect) Limitations or ‘risk of bias’ - RCTs limitation explanation Lack of allocation concealment Those enrolling patients are aware of the group to which the next enrolled patient will be allocated (major problem in “pseudo” or “quasi” randomized trials with allocation by day of week, birth date, chart number etc.) Lack of blinding Patient, caregivers, those recording outcomes, those adjudicating outcomes, or data analysts are aware of the arm to which patients are allocated Incomplete accounting of Loss to follow-up and failure to adhere to the intention to treat patients and outcome events principle when indicated Selective outcome reporting Other limitations Reporting of some outcomes and not others on the basis of the results For example: stopping early for benefit observed in randomized trials, in particular in the absence of adequate stopping rules use of unvalidated patient-reported outcomes carry-over effects in cross-over trials recruitment bias in cluster-randomized trials Risk of bias – observational studies limitation explanation Failure to develop and apply appropriate eligibility criteria (inclusion of control population) Flawed measurement of both exposure and outcome Failure to adequately control confounding Incomplete or inadequately short follow-up under- or over-matching in case-control studies selection of exposed and unexposed in cohort studies from different populations differences in measurement of exposure (e.g. recall bias in case- control studies) differential surveillance for outcome in exposed and unexposed in cohort studies failure of accurate measurement of all known prognostic factors failure to match for prognostic factors and/or adjustment in statistical analysis Inconsistency • When heterogeneity exists, but no plausible explanation is identified, the quality of evidence should be downgraded by one or two levels, depending on the magnitude of the inconsistency in the results. • Inconsistency may arise from differences in: – populations (e.g. drugs may have larger relative effects in sicker populations) – interventions (e.g. larger effects with higher drug doses) – outcomes (e.g. diminishing treatment effect with time). • Account for this where possible Indirectness Indirect Question of interest Source of indirectness Comparison Relative effectiveness of alendronate and risedronate in osteoporosis. Randomized trials compared alendronate to placebo and risedronate to placebo, but trials comparing alendronate to risedronate are unavailable. Population Oseltamivir for prophylaxis of avian flu caused by influenza A (H5N1) virus. Intervention Sigmoidoscopic screening for prevention of colon cancer mortality. Randomized trials of oseltamivir are available for patients with seasonal influenza, but not for avian influenza. Randomized trials of fecal occult blood screening provide indirect evidence bearing on the potential effectiveness of sigmoidoscopy. Comparator Choice of medication for schizophrenia. A series of trials comparing newer generation neuroleptic agents to fixed doses of 20 mg of haloperidol provide indirect evidence of how the newer agents would compare to the lower, flexible doses of haloperidol clinicians typically use. Indirectness Condition Patient-important outcome(s) Surrogate outcome(s) Diabetes Diabetic symptoms, admission, complications (cardiovascular, eye, renal, neuropathic etc.) Glucose, HbA1C Dementia Patient function, behaviour, caregiver burden Cognitive function Osteoporosis Fractures Bone density ARDS Mortality Oxygenation End-stage renal disease Quality of life, mortality Hemoglobin Venous thrombosis Symptomatic venous thrombosis Asymptomatic venous thrombosis Chronic respiratory disease Quality of life, exacerbations, mortality Pulmonary function, exercise capacity Cardiovascular disease/risk Vascular events, mortality Serum lipids Imprecision • Our estimates of the population value are uncertain/imprecise because we use samples • GRADE extended the term uncertainty in the context of whether the effect estimate reaches the ‘clinical minimal important difference’ (MID) Example of MID: Drug X compared to placebo to reduce severe migraine. Pain on migraine: measured on a 10-point scale Mean baseline = 9.5; Mean reduction from baseline = 1.7 (95%CI: 1.2 to 2.3) But survey on migraine patients said pain reduction less than 3 points is meaningless because it does not improve their overall QoL and daily function. Confidence intervals - summary • Easiest way to approach effect of random error on evidence quality • In frequentist approach, 95% CI represents – A range constructed so that in repeated experiments 95% would include the population value – Usually interpreted as p=0.95 that the population value is in the CI Confidence interval width • Wide confidence intervals imply uncertainty over whether our observed effect is close to or far away from the real effect • Examples – An RCT of supervised exercise for patellofemoral pain – Self reported recovery at 12 months – T: 9/500 vs SC: 2/500 RR=4.50 (1.00 to 20.77) – We’d probably agree that’s imprecise – – – – An RCT of drug A for patellofemoral pain Self reported recovery at 12 months T: 350/500 vs PC: 150/500 RR=2.33 (2.20 to 2.72) We’d probably agree that’s precise What affects imprecision? • Having larger samples, but particularly where there is more ‘information’ – Complex relationship between sample size, numbers of events • Easiest to play with an example Control event rate 2/4 10/20 20/40 50/100 500/1000 Treatment event rate 1/4 5/20 10/40 25/100 250/1000 RR, % RRR, % 50 50 50 50 50 50 50 50 50 50 Calculated 95%CI -174 to 92 -14 to 79.5 9.5 to 73.4 26.8 to 66.4 43.5 to 55.9 Remember CIs can mislead • True values will be outside a 95%CI 5/100 times • CI based on small numbers of events are unstable • Early trials tend to be more positive • Trials stopped early likely to be biased • So, if you have small trials with a positive effect and apparently narrow CI, be sceptical • It would be helpful to have an objective idea of when we have ‘enough’ information Optimal information size (OIS) • We want at least as many observations in a trial as we would calculate in a sample size calculation of 0.2 6000 Figure 4: Optimal information size given alpha of 0.05 and beta for varying control event rates and RRR of 20%, 25% and 30% 4000 2000 3000 RRR=20% RRR=25% RRR=30% 0 1000 Total sample size required 5000 For any chosen line, evidence meets optimal information size criterion if above the line 0.2 0.4 0.6 0.8 1.0 Control group event rate Warning – ‘Power-based’ sample size calculation is for ‘hypothesis testing’ using p-value, not for estimation of true effect OIS continued • Thinking of numbers of events may be easier, and could just use arbitrarynumber if don’t have resources to calculate OIS 400 RRR=20% For any chosen line, evidence meets optimal information size criterion if above the line RRR=25% 300 events 200 RRR=30% 0 Total number of events required 600 Figure 5: Optimal information size presented as number of events given alpha of 0.05 and beta of 0.2 for varying control event rates and RRR of 20%, 25% and 30% 0.0 0.2 0.4 0.6 Control group event rate 0.8 1.0 Summary of suggested approach to imprecision 3 1 2 Mean pain reduction -4 -3 -2 -1 0 +1 +2 +3 +4 Mean pain increase Red: mean -1 as MID 1 = ‘no effect’ and precise; 2 = ‘no effect’ but not precise; 3 = ‘effective’ and precise Green: mean -2 as MID 1 = ‘no effect’ and precise; 2 = ‘no effect’ and precise; 3 = ‘effective’ and precise Blue: mean -3 as MID 1 = ‘no effect’ and precise; 2 = ‘no effect’ and precise; 3 = ‘effective’ and not precise What if we don’t know a threshold? • Can use an arbitrary threshold – For example, GRADE suggests RRR or RRI of 25% – Often used in NICE guidelines Two things to remember about GRADE • Many judgements are made in appraising evidence, and there will always be disagreement. The important thing is to make the areas of disagreement transparent. • The concepts we are judging e.g. imprecision are continuous, and dichotomising it (downgrade or not) can be a close call. Where it is, the evidence to recommendations section should discuss it PDE-5 inhibitor vs. placebo Other considerations Imprecision Indirectness Design Risk of bias No of studies Number of patients Inconsistency Quality assessment Intervention Placebo Effect/ outcome Quality Importance Erectile Function-International Index of Erectile Function [IIEF] mean score on EF domain (better efficacy is indicated by higher values) 9 (Goldstein 2003, Ishii 2006, Zieglar 2006, Boulton 2001, Rendell 1999, Safarinejad 2004, Stuckey 2003, Hatzichristou 2008, Saenz 2002) 5 RCTs S RCTs S RCTs S N* S , 1 2 S 3 none 1855 1006 Pooled MD 5.82 higher at endpoint (95% CI 4.75 to 6.89). Very low Critical 2 N none 160/1763 (9.1%) 41/948 (4.3%) Pooled RR 2.70 (1.16 to 6.28) Very low Important 1 2 N none 42/1753 (2.4%) 14/1037 (1.4%) Pooled RR 1.59 (0.84 to 3.02) Very low Important Adverse events (headache) 8 (Boulton 2001, Goldstein 2003, Ishii 2006, Rendell 1999, Saenz 2002, Safarinejad 2004, Stuckey 2003, Ziegler 2006) 5 4 S S N S , (Discontinuation for AE) 8 (Goldstein 2003, Hatzichristou 2008, Ishii 2006, Rendell 1999, Saenz 2002, Safarinejad 2004, Stuckey 2003, Ziegler 2006) 5 Abbreviations: 95%CI, 95% confidence interval; IIEF, International Index of Erectile Function questionnaire; EF, Erectile function domain of IIEF; SEP, Sexual Encounter Profile (diary questions regarding sexual encounter); GEQ, Global Efficacy Question; QoL Quality of Life; RR, risk ratio 1 Downgrade by 1 level: 1 study (Hatzichristou 2008) used low doses (2.5mg and 5mg) of tadalafil, which are licensed for use but are recommended in people who anticipate frequent use of the drug. 10mg is generally recommended (but not for continuous daily use). The other study examining Tadalafil (Saenz 2002) used 10mg and 20mg, therefore these arms combined represent a wide range of different doses. 2 Downgrade by 1 level: 2 studies (Stuckey 2003, Zieglar 2006) were conducted solely in men with type 1 diabetes and the mean age in these studies were generally lower in comparison to the other included studies. One study (Ishii 2006) did not report the proportion of men with type 2 diabetes. 3 Downgrade by 1 level: SDs were not reported in the paper and were calculated using p-values 4 Downgrade by 1 level: pairwise comparisons of the included studies (direct comparisons) showed an I² of 75% (p=0.0002) for headaches, 68% (p=0.009) for upper respiratory tract infection and 58% (p<0.00001) for any adverse event. These values indicate substantial heterogeneity which cannot be fully accounted for 5 Downgrade by 1 level: 2 studies (Saenz 2002, Ishii 2006) do not report allocation concealment to determine if performance bias was present * pairwise comparisons of the included studies (direct comparisons) showed an I² of 46%. Although this may indicate moderate heterogeneity, this inconsistency was not considered to be important as overall the effect estimates and the confidence intervals were favouring the PDE-5 group Critical Outcome Critical Outcome Important Outcome Not High Moderate Low Very low Summary of findings & estimate of effect for each outcome Evidence synthesis (systematic review) Grade down P I C O Outcome 1. 2. 3. 4. 5. Grade up RCT start high, obs. data start low Risk of bias Inconsistency Indirectness Imprecision Other consideration 1. Large effect 2. Dose response 3. Confounders Making recommendations (guidelines) Develop recommendations: •For or against (direction) •Strong or weak (strength) By considering: Relative value of different outcomes Quality of evidence Trade off - benefits/harms Health economics Other considerations Present evidence profile(s) to GDG • “Offer xyz…” • “Consider xyz…” • “Do not use xyz…” Evidence to recommendations • Structured discussion of – Relative value placed on outcomes – Trade off between clinical benefits and harms – Trade off between net health benefits and resource use – Quality of the evidence – Other considerations • Place within pathway of care • Equalities issues • Practicalities of implementation e.g. need for training Strength of recommendation • Stronger: ‘the GDG is confident that the desirable effects of adherence to a recommendation outweigh the undesirable effects’ ‘Should do ...’ • Weaker: the GDG concludes that the desirable effects of adherence to a recommendation probably outweigh the undesirable effects, but is not confident’ ‘Should consider ...’ Further information • http://www.gradeworkinggroup.org/ • Ongoing series of papers in Journal of Clinical Epidemiology addressing all of these issues • [email protected]