Transcript Slide 1
Use and abuse of P values Clinical Research Methodology Course Randomized Clinical Trials and the “REAL WORLD” Emmanuel Lesaffre Biostatistical Centre, K.U.Leuven, Leuven, Belgium Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands NY, 14 December 2007 2 3 4 Contents 1. P-value: What is it? 2. Type I error 3. Multiple testing 4. Type II error 5. Sample size calculation 6. Negative studies 7. Testing at baseline 8. Statistical significance clinical relevance 9. Confidence interval P-value 10. P-value of clinical trial of epidemiological study 11. Take home messages 5 1. P-value: What is it? 6 1. P-value: What is it? Etoricoxib Placebo – WOMAC Pain Subscale: difference in means = -15.07 – What does this result mean? – What do you expect if etoricoxib=placebo? difference 0 – But even if etoricoxib=placebo, result will vary around 0 – What is a large/small difference? – What is the play of chance? The same questions for the other scores & comparisons 7 1. P-value: What is it? Etoricoxib Placebo – Suppose H0: E=P – P=0.05 result belongs to the 5% extreme results that could happen under H0 (if H0 is true) – P=0.01 result belongs to the 5% extreme results that could happen under H0 (if H0 is true) and only 1% is MORE EXTREME – P<0.0001 result belongs to the 5% extreme results that could happen under H0 (if H0 is true) and IS VERY EXTREME 8 1. P-value: What is it? GENERAL RULE – When P < 0.05 (= significance level ): Result is considered to be TOO EXTREME to believe that H0 is true H0 is rejected we do NOT believe that E=P Significant at 0.05 (*, **, ***) – When P 0.05: Result could have happened when H0 is true H0 is NOT rejected it is possible that E=P Result is 0, but we believe that this is due to PLAY OF CHANCE NOT significant at 0.05 (NS) 9 1. P-value: What is it? Results ECP – E P, WOMAC Pain P < 0.0001 Significant at 0.05 (***) We do NOT believe that E=P – E C, WOMAC Physical Function P = 0.367 NS It could be that E=C, result is PLAY of CHANCE – E C, Patient Global Assessment P = 0.051 NS It could be that E=C, result is PLAY of CHANCE 10 1. P-value: What is it? Previous decision rule = hypothesis testing – Test H0: E=P versus HA: E≠P – Using a statistical test (t-test, ²-test, etc) – With 2-sided significance level = = 0.05 – In clinical trial setting: Above test is interpreted as: H0: E P versus HA: E > P And at 1-sided significance level = /2 = 0.05/2 = 0.025 (2.5%) When result is on the wrong side (E < P) with P < 0.05, then efficacy of E over P is not demonstrated 11 1. P-value: What is it? What if H0: E=P is true & P=0.023? – We will reject H0 – We will make an ERROR = Type I error P(Type I error) = False-positive rate = Probability that result belongs to 5% extreme results if H0 is true = 0.05 12 2. Type I error Type I error: Practical implications – Suppose H0 is TRUE – Risk = 5% implications: 100 studies on average 5 studies wrong conclusion Prob(at least 1 study wrong conclusion) 1 Regulatory agencies mandate a strict control of the overall false-positive rate False positive trial findings could lead to approval of inefficacious drugs 13 3. Multiple testing Multiple testing: Definition – Suppose H0 is TRUE – Test 1 (WOMAC pain subscale): risk = 5% – Test 2 (WOMAC Physical Function Subscale): risk = 5% – Test 1 & Test 2: risk 5% + 5% = 10% of claiming that 2 treatments (on one of the tests) are different when they are not – If no adjustment: multiple testing problem 14 3. Multiple testing Multiple testing: Typical cases – 2 treatments are compared for several endpoints – More than 2 treatments are compared – 2 treatments are compared in several subgroups – 2 treatments are compared at several time points 15 3. Multiple testing: example 2 treatments are compared for several endpoints 16 3. Multiple testing: example More than 2 treatments are compared 17 3. Multiple testing: example 2 treatments are compared in several subgroups – Treatments were not significantly different overall – Then, treatments were compared in subgroups: Males & Females < 60 yrs & 60 yrs Diabetes & no-diabetes .... – Suppose in 1 subgroup: P < 0.05, meaning???? Significant result will be a play of chance 18 3. Multiple testing: example 2 treatments are compared at several time points Comparison at each time point: PLAY OF CHANCE! 19 3. Multiple testing: example Protocol specified: 2.2 Administration of visits Patients will be examined at baseline (day 0), day 7, day 14 and day 28. At each visit the systolic BP, etc... will be measured. 9.4 Primary endpoint The primary endpoint for the comparison of treatment A B is systolic BP. 20 3. Multiple testing: example This “scientific finding” was printed in the Belgian newspapers! It was even stated that those who awake before 7.21 AM, have a statistically significant higher stress level during the day, than those who awake after 7.21 AM! 21 3. Multiple testing: example Signs of the times: Feb 22nd 2007 | SAN FRANCISCO From The Economist print edition Interesting finding? PEOPLE born under the astrological sign of Leo are 15% more likely to be admitted to hospital with gastric bleeding than those born under the other 11 signs. Sagittarians are 38% more likely than others to land up there because of a broken arm. Those are the conclusions that many medical researchers would be forced to make from a set of data presented to the American Association for the Advancement of Science by Peter Austin of the Institute for Clinical Evaluative Sciences in Toronto. At least, they would be forced to draw them if they applied the lax statistical methods of their own work to the records of hospital admissions in Ontario, Canada, used by Dr Austin. 22 3. Multiple testing Multiple testing: Solution?? – Choose 1 primary endpoint risk = 5% – What if more than one endpoint is needed? Construct combined endpoint based on clinical/statistical reasoning Correct for multiple testing – What for other (secondary+ tertiary) endpoints? Call analyses EXPLORATORY Correct for multiple testing 23 3. Multiple testing Multiple testing: Solution?? – Test 1 (WOMAC pain subscale): risk = 5% 2.5% – Test 2 (WOMAC Physical Function Subscale): risk = 5% 2.5% – Test 1 & Test 2: risk = 10% 5% – Both tests claim significance if P < 0.05 – Bonferroni adjustment: significance if P < 0.05/2=0.025 Family-wise error rate = 0.05 – More sophisticated approaches of Simes, Holm, Hochberg and Hommel, Closed Testing procedures, ... 24 3. Multiple testing CPMP guidance document “Points to consider on multiplicity issues in clinical trials” (Sept 19, 2002) “A clinical study that requires no adjustment of the Type I error is one that consists of two treatment groups, that uses a single primary variable, and has a confirmatory statistical strategy that pre-specifies just one single null hypothesis relating to the primary variable and no interim analysis” 25 4. Type II error Type I error: – Result is statistically significant (P < 0.05) – Risk of making an error when H0 is true= 5% – (We do NOT know if H0 is true) Type II error: – Result is NOT statistically significant (P 0.05) – Risk of making an error when H0 is NOT true= ??? – (We do NOT know if H0 is NOT true) 26 5. Sample size calculation P(Type II error): 1- = 1- Power – LARGE(R) in small studies – Can be controlled by adapting study (sample) size – Calculation sample size: Determine clinically important difference Search for information – % rate control group – SD of measurements Fix P(Type II) 0.20 Power 0.80 (80%) Look for statistician ((s)he will look for computer program) Pray Let computer work sample size 27 5. Sample size calculation: example power = 0.95 = 0.05 = 20% n = 2x300 28 5. Sample size calculation: example?? 29 6. Negative studies Negative study: Not significant study – Sample size calculation done (power at least 80%) ? – Yes: Difference between treatments is probably smaller than – No: Message ???? DOES NOT imply: NO difference between treatments 30 6. Negative studies: example Sample size calculation???? Message???? 31 6. Negative studies: “Trend” Trend in the data: – P > 0.05, but difference is in the good direction – One speaks of a “trend in the data” – OK? No, for confirmatory study Perhaps, for pilot study or exploratory studies 32 7. Testing at baseline Why no P-values? How many significant (at 0.05) tests would you expect? 33 8. Statistical significance clinical relevance Statistical significance: – P < 0.05 – Message: two treatments are (probably/possibly) different Clinical relevance: – Difference is clinically relevant 34 8. Statistical significance clinical relevance: Example Compare two treatments – Response = 10-year mortality – 2 x 200 patients – A: 2%, B: 10% – Chi-square test: P < 0.001 Measures of effect – ar = 10%-2% = 8% (abs risk reduction) – rr = 10%/2% = 5 (risk ratio) 35 8. Statistical significance clinical relevance: Example Compare two treatments – Response = 10-year mortality – 2 x 100,000 patients – A: 0.002%, B: 0.0010% – Chi-square test: P < 0.001 Measures of effect – ar = 0.0010%-0.002% = 0.008% (abs risk reduction) – rr = 0.0010%/0.002% = 5 (risk ratio) 36 8. Statistical significance clinical relevance: Conclusion Conclusion – For each (small) (≠0), there is a sample size such that H0 is rejected with high probability Implications – Clinical trials are often too small to detect rare safety issues – When registered and on the market, after several years a safety issue appears (VIOX story) 37 8. Statistical significance clinical relevance: Further reflections Reality Treatments =0 0 Conclusion same OK type II from sample different type I OK Practical conclusions – Even if result is not significant, we will NOT conclude that H0 is true – Why doing the significance test, if we don’t believe in it? Classical table indicating two types of errors (Decision-theoretic approach of Neyman-Pearson). Indicates that we can conclude in practice that the 2 treatments are equally good It is not possible in statistics to show that 2 treatments are equally good (non-inferiority talk). We even DO NOT BELIEVE that H0 is TRUE in practice! – Better estimate difference in treatment effect + uncertainty 38 9. Confidence interval P-value 39 9. Confidence interval P-value 95% confidence interval – Expresses uncertainty about true difference – When small good idea about true treatment effect Examples – WOMAC Pain Subscale: E C: 95% CI = [-7.02, 0.77] 0 is possible E P: 95% CI = [-19.72, -10.41] E is better C P: 95% CI = [-16.57, -7.32] C is better GENERAL RESULT: P<0.05 95% CI does not contain 0 40 9. Confidence interval P-value Two anti-hypertensive drugs medication Medicatie medication 95%confidence betrouwbaarheidsinterval 95% interval mmHg study study Studie -6 -3 0 3 6 9 12 P A 1 NS A 2 NS A 3 * A 4 ** A 5 *** 95% CI gives a clearer message 41 10. P-value clinical trial epi study Clinical trial – Randomized – No confounding – P < 0.05 causal effect of treatment on patient’s condition Epidemiological study – Observatory – Possible confounding – P < 0.05 at most association, correction for confounding 42 10. P-value clinical trial epi study 43 11. Biased set up & reporting 44 11. Biased setup & reporting Bias in set up of studies, e.g. inappropriate doses of competing drug Choice of patient populations, e.g. exclusion of patients who were previously nonresponder to treatment Noninferiority designs with different thresholds Biased reporting, e.g. minimal information on negative aspects of drug of sponsor 45 12. Take home messages If possible, take 1 primary endpoint Always determine necessary sample size Always WATCH OUT for problem of multiple testing Always and ONLY interpret NS as NOT possible to show “difference” Always be careful when talking about “trend” Always determine 95% confidence intervals 46 Thank you for your attention