Transcript Slide 1

Use and abuse of P values
Clinical Research Methodology Course
Randomized Clinical Trials and the “REAL WORLD”
Emmanuel Lesaffre
Biostatistical Centre, K.U.Leuven, Leuven, Belgium
Dept of Biostatistics, Erasmus MC, Rotterdam, the Netherlands
NY, 14 December 2007
2
3
4
Contents
1.
P-value: What is it?
2.
Type I error
3.
Multiple testing
4.
Type II error
5.
Sample size calculation
6.
Negative studies
7.
Testing at baseline
8.
Statistical significance  clinical relevance
9.
Confidence interval  P-value
10.
P-value of clinical trial  of epidemiological study
11.
Take home messages
5
1. P-value: What is it?
6
1. P-value: What is it?

Etoricoxib  Placebo
– WOMAC Pain Subscale: difference in means = -15.07
– What does this result mean?
– What do you expect if etoricoxib=placebo?
 difference  0
– But even if etoricoxib=placebo, result will vary around 0
– What is a large/small difference?
– What is the play of chance?

The same questions for the other scores & comparisons
7
1. P-value: What is it?

Etoricoxib  Placebo
– Suppose H0: E=P
– P=0.05  result belongs to the 5% extreme results that
could happen under H0 (if H0 is true)
– P=0.01  result belongs to the 5% extreme results that
could happen under H0 (if H0 is true) and only 1% is MORE
EXTREME
– P<0.0001  result belongs to the 5% extreme results that
could happen under H0 (if H0 is true) and IS VERY EXTREME
8
1. P-value: What is it?

GENERAL RULE
– When P < 0.05 (= significance level ):
 Result is considered to be TOO EXTREME to believe that H0 is true
 H0 is rejected  we do NOT believe that E=P
 Significant at 0.05 (*, **, ***)
– When P  0.05:
 Result could have happened when H0 is true
 H0 is NOT rejected  it is possible that E=P
 Result is  0, but we believe that this is due to PLAY OF CHANCE
 NOT significant at 0.05 (NS)
9
1. P-value: What is it?

Results ECP
– E P, WOMAC Pain
 P < 0.0001  Significant at 0.05 (***)
 We do NOT believe that E=P
– E C, WOMAC Physical Function
 P = 0.367  NS
 It could be that E=C, result is PLAY of CHANCE
– E C, Patient Global Assessment
 P = 0.051  NS
 It could be that E=C, result is PLAY of CHANCE
10
1. P-value: What is it?

Previous decision rule = hypothesis testing
– Test H0: E=P versus HA: E≠P
– Using a statistical test (t-test, ²-test, etc)
– With 2-sided significance level =  = 0.05
– In clinical trial setting:
 Above test is interpreted as: H0: E  P versus HA: E > P
 And at 1-sided significance level = /2 = 0.05/2 = 0.025 (2.5%)
When result is on the wrong side (E < P) with P < 0.05,
then efficacy of E over P is not demonstrated
11
1. P-value: What is it?

What if H0: E=P is true & P=0.023?
– We will reject H0
– We will make an ERROR
= Type I error

P(Type I error) = False-positive rate
= Probability that result belongs to 5% extreme results
if H0 is true
= 0.05
12
2. Type I error

Type I error: Practical implications
– Suppose H0 is TRUE
– Risk = 5% implications:
 100 studies  on average 5 studies wrong conclusion

 Prob(at least 1 study wrong conclusion)  1

Regulatory agencies mandate a strict control of the
overall false-positive rate

False positive trial findings could lead to approval of
inefficacious drugs
13
3. Multiple testing

Multiple testing: Definition
– Suppose H0 is TRUE
– Test 1 (WOMAC pain subscale):
risk = 5%
– Test 2 (WOMAC Physical Function Subscale):
risk = 5%
– Test 1 & Test 2: risk  5% + 5% = 10% of claiming that 2
treatments (on one of the tests) are different when they are not
– If no adjustment: multiple testing problem
14
3. Multiple testing

Multiple testing: Typical cases
– 2 treatments are compared for several endpoints
– More than 2 treatments are compared
– 2 treatments are compared in several subgroups
– 2 treatments are compared at several time points
15
3. Multiple testing: example

2 treatments are compared for several endpoints
16
3. Multiple testing: example

More than 2 treatments are compared
17
3. Multiple testing: example

2 treatments are compared in several subgroups
– Treatments were not significantly different overall
– Then, treatments were compared in subgroups:
 Males & Females
 < 60 yrs &  60 yrs
 Diabetes & no-diabetes
 ....
– Suppose in 1 subgroup: P < 0.05, meaning????
 Significant result will be a play of chance
18
3. Multiple testing: example

2 treatments are compared at several time points
Comparison at each time point: PLAY OF CHANCE!
19
3. Multiple testing: example

Protocol specified:
2.2 Administration of visits
Patients will be examined at baseline (day 0), day 7, day 14
and day 28. At each visit the systolic BP, etc... will be
measured.
9.4 Primary endpoint
The primary endpoint for the comparison of treatment A  B
is systolic BP.
20
3. Multiple testing: example

This “scientific finding” was printed in the Belgian newspapers!
It was even stated that those who awake before 7.21 AM,
have a statistically significant higher stress level during the
day, than those who awake after 7.21 AM!
21
3. Multiple testing: example
Signs of the times: Feb 22nd 2007 | SAN FRANCISCO
From The Economist print edition
Interesting finding?
PEOPLE born under the astrological sign of Leo are 15% more likely
to be admitted to hospital with gastric bleeding than those born
under the other 11 signs. Sagittarians are 38% more likely than
others to land up there because of a broken arm.
Those are the conclusions that many medical researchers would be
forced to make from a set of data presented to the American
Association for the Advancement of Science by Peter Austin of the
Institute for Clinical Evaluative Sciences in Toronto. At least, they
would be forced to draw them if they applied the lax statistical
methods of their own work to the records of hospital admissions in
Ontario, Canada, used by Dr Austin.
22
3. Multiple testing

Multiple testing: Solution??
– Choose 1 primary endpoint  risk = 5%
– What if more than one endpoint is needed?
 Construct combined endpoint based on
clinical/statistical reasoning
 Correct for multiple testing
– What for other (secondary+ tertiary) endpoints?
 Call analyses EXPLORATORY
 Correct for multiple testing
23
3. Multiple testing

Multiple testing: Solution??
– Test 1 (WOMAC pain subscale):
risk = 5%
2.5%
– Test 2 (WOMAC Physical Function Subscale):
risk = 5%
2.5%
– Test 1 & Test 2:
risk = 10%
5%
– Both tests claim significance if P < 0.05
– Bonferroni adjustment: significance if P < 0.05/2=0.025
 Family-wise error rate = 0.05
– More sophisticated approaches of Simes, Holm, Hochberg and
Hommel, Closed Testing procedures, ...
24
3. Multiple testing

CPMP guidance document
“Points to consider on multiplicity issues in clinical trials”
(Sept 19, 2002)
“A clinical study that requires no adjustment of the Type I
error is one that consists of two treatment groups, that uses
a single primary variable, and has a confirmatory statistical
strategy that pre-specifies just one single null hypothesis
relating to the primary variable and no interim analysis”
25
4. Type II error

Type I error:
– Result is statistically significant (P < 0.05)
– Risk of making an error when H0 is true= 5%
– (We do NOT know if H0 is true)

Type II error:
– Result is NOT statistically significant (P  0.05)
– Risk of making an error when H0 is NOT true= ???
– (We do NOT know if H0 is NOT true)
26
5. Sample size calculation

P(Type II error): 1- = 1- Power
– LARGE(R) in small studies
– Can be controlled by adapting study (sample) size
– Calculation sample size:
 Determine clinically important difference 
 Search for information
– % rate control group
– SD of measurements
 Fix P(Type II)  0.20  Power  0.80 (80%)
 Look for statistician ((s)he will look for computer program)
 Pray
 Let computer work  sample size
27
5. Sample size calculation: example
power = 0.95
 = 0.05
 = 20%
n = 2x300
28
5. Sample size calculation: example??
29
6. Negative studies

Negative study: Not significant study
– Sample size calculation done (power at least 80%) ?
– Yes:
 Difference between treatments is probably smaller than 
– No:
 Message ????
 DOES NOT imply: NO difference between treatments
30
6. Negative studies: example
Sample size calculation????
Message????
31
6. Negative studies: “Trend”

Trend in the data:
– P > 0.05, but difference is in the good direction
– One speaks of a “trend in the data”
– OK?
 No, for confirmatory study
 Perhaps, for pilot study or exploratory studies
32
7. Testing at baseline
Why no P-values?
How many significant (at 0.05) tests would you expect?
33
8. Statistical significance
 clinical relevance

Statistical significance:
– P < 0.05
– Message: two treatments are (probably/possibly) different

Clinical relevance:
– Difference is clinically relevant
34
8. Statistical significance
 clinical relevance: Example

Compare two treatments
– Response = 10-year mortality
– 2 x 200 patients
– A: 2%, B: 10%
– Chi-square test: P < 0.001

Measures of effect
– ar
= 10%-2% = 8%
(abs risk reduction)
– rr
= 10%/2% = 5
(risk ratio)
35
8. Statistical significance
 clinical relevance: Example

Compare two treatments
– Response = 10-year mortality
– 2 x 100,000 patients
– A: 0.002%, B: 0.0010%
– Chi-square test: P < 0.001

Measures of effect
– ar
= 0.0010%-0.002% = 0.008% (abs risk reduction)
– rr
= 0.0010%/0.002% = 5
(risk ratio)
36
8. Statistical significance
 clinical relevance: Conclusion

Conclusion
– For each (small)  (≠0),
there is a sample size
such that H0 is rejected with high probability

Implications
– Clinical trials are often too small to detect rare safety issues
– When registered and on the market,
after several years a safety issue appears (VIOX story)
37
8. Statistical significance
 clinical relevance: Further reflections
Reality

Treatments
=0
 0
Conclusion
same
OK
type II
from
sample
different
type I
OK
Practical conclusions
– Even if result is not significant,
we will NOT conclude that H0 is true
– Why doing the significance test,
if we don’t believe in it?
Classical table indicating two
types of errors
(Decision-theoretic approach
of Neyman-Pearson).
Indicates that we can
conclude in practice
that the 2 treatments are
equally good
It is not possible in
statistics to show that 2
treatments are equally
good (non-inferiority talk).
We even
DO NOT BELIEVE that
H0 is TRUE in practice!
– Better estimate difference in treatment effect
+ uncertainty
38
9. Confidence interval  P-value
39
9. Confidence interval  P-value

95% confidence interval
– Expresses uncertainty about true difference
– When small  good idea about true treatment effect

Examples
– WOMAC Pain Subscale:
 E  C: 95% CI = [-7.02,

0.77]
 0 is possible
 E  P: 95% CI = [-19.72, -10.41]
 E is better
 C  P: 95% CI = [-16.57, -7.32]
 C is better
GENERAL RESULT: P<0.05  95% CI does not contain 0
40
9. Confidence interval  P-value
Two anti-hypertensive drugs
medication
Medicatie
medication
95%confidence
betrouwbaarheidsinterval
95%
interval
mmHg
study
study
Studie
-6
-3
0
3
6
9
12
P
A
1
NS
A
2
NS
A
3
*
A
4
**
A
5
***
95% CI gives a clearer message
41
10. P-value
clinical trial  epi study

Clinical trial
– Randomized
– No confounding
– P < 0.05  causal effect of treatment on patient’s condition

Epidemiological study
– Observatory
– Possible confounding
– P < 0.05  at most association, correction for confounding
42
10. P-value
clinical trial  epi study
43
11. Biased set up & reporting
44
11. Biased setup & reporting

Bias in set up of studies, e.g. inappropriate doses of
competing drug

Choice of patient populations, e.g. exclusion of patients
who were previously nonresponder to treatment

Noninferiority designs with different thresholds

Biased reporting, e.g. minimal information on negative
aspects of drug of sponsor
45
12. Take home messages

If possible, take 1 primary endpoint

Always determine necessary sample size

Always WATCH OUT for problem of multiple testing

Always and ONLY interpret NS as NOT possible to
show “difference”

Always be careful when talking about “trend”

Always determine 95% confidence intervals
46
Thank you for your attention