Pitfalls of Hypothesis Testing + Sample Size Calculations Hypothesis Testing The Steps: 1.

Download Report

Transcript Pitfalls of Hypothesis Testing + Sample Size Calculations Hypothesis Testing The Steps: 1.

Pitfalls of Hypothesis Testing
+ Sample Size Calculations
Hypothesis Testing
The Steps:
1. Define your hypotheses (null, alternative)
2. Specify your null distribution
3. Do an experiment
4. Calculate the p-value of what you observed
5. Reject or fail to reject (~accept) the null
hypothesis
Follows the logic: If A then B; not B; therefore, not A.
Summary: The Underlying
Logic of hypothesis tests…
Follows this logic:
Assume A.
If A, then B.
Not B.
Therefore, Not A.
But throw in a bit of uncertainty…If A,
then probably B…
Error and Power

Type-I Error


Rejecting the null when the effect isn’t real.
Type-II Error


(also known as “α”):
(also known as “β “):
Note the sneaky
conditionals…
Failing to reject the null when the effect is real.
POWER (the flip side of type-II error: 1- β):

The probability of seeing a true effect if one
exists.
Think of…
Pascal’s Wager
The TRUTH
Your Decision
God Exists
God Doesn’t Exist
BIG MISTAKE
Correct
Correct—
Big Pay Off
MINOR MISTAKE
Reject God
Accept God
Type I and Type II Error in a box
Your Statistical
Decision
True state of null hypothesis
H0 True
Reject H0
(ex: you conclude that the drug
works)
(example: the drug doesn’t work)
(example: the drug works)
Type I error (α)
Correct
Correct
Type II Error (β)
Do not reject H0
(ex: you conclude that there is
insufficient evidence that the drug
works)
H0 False
Error and Power

Type I error rate (or significance level): the
probability of finding an effect that isn’t real (false
positive).



If we require p-value<.05 for statistical significance, this means
that 1/20 times we will find a positive result just by chance.
Type II error rate: the probability of missing an effect
(false negative).
Statistical power: the probability of finding an effect if
it is there (the probability of not making a type II
error).

When we design studies, we typically aim for a power of 80%
(allowing a false negative rate, or type II error rate, of 20%).
Pitfall 1: over-emphasis on pvalues


Clinically unimportant effects may be
statistically significant if a study is
large (and therefore, has a small
standard error and extreme precision).
Pay attention to effect size and
confidence intervals.
Example: effect size


A prospective cohort study of 34,079
women found that women who
exercised >21 MET hours per week
gained significantly less weight than
women who exercised <7.5 MET hours
(p<.001)
Headlines: “To Stay Trim, Women Need
an Hour of Exercise Daily.”
Physical Activity and Weight Gain Prevention. JAMA 2010;303:1173-1179.

Mean (SD) Differences in Weight Over Any 3-Year Period by Physical Activity Level, Women's
Health Study, 1992-2007a
Lee, I. M. et al. JAMA 2010;303:1173-1179.
Copyright restrictions may apply.
•What was the effect size? Those who exercised the least
0.15 kg (.33 pounds) more than those who exercised the
most over 3 years.
•Extrapolated over 13 years of the study, the high exercisers
gained 1.4 pounds less than the low exercisers!
•Classic example of a statistically significant effect that is not
clinically significant.
A picture is worth…
Authors explain: “Figure 2 shows the trajectory of weight gain over time by baseline
physical activity levels. When classified by this single measure of physical activity, all 3
groups showed similar weight gain patterns over time.”
A picture is worth…
But baseline physical activity should predict weight gain in the first
three years…do those slopes look different to you?
Another recent headline
Drinkers May Exercise More Than Teetotalers
Activity levels rise along with alcohol use, survey shows
“MONDAY, Aug. 31 (HealthDay News) -- Here's something to toast:
Drinkers are often exercisers”…
“In reaching their conclusions, the researchers examined data from
participants in the 2005 Behavioral Risk Factor Surveillance
System, a yearly telephone survey of about 230,000
Americans.”…
For women, those who imbibed exercised 7.2 minutes more per
week than teetotalers. The results applied equally to men…
Pitfall 2: association does not
equal causation


Statistical significance does not imply a
cause-effect relationship.
Interpret results in the context of the
study design.
Pitfall 3: multiple comparisons


A significance level of 0.05 means that
your false positive rate for one test is
5%.
If you run more than one test, your
false positive rate will be higher than
5%.
data dredging/multiple
comparisons





In 1980, researchers at Duke randomized 1073 heart disease
patients into two groups, but treated the groups equally.
Not surprisingly, there was no difference in survival.
Then they divided the patients into 18 subgroups based on
prognostic factors.
In a subgroup of 397 patients (with three-vessel disease and an
abnormal left ventricular contraction) survival of those in “group
1” was significantly different from survival of those in “group 2”
(p<.025).
How could this be since there was no treatment?
(Lee et al. “Clinical judgment and statistics: lessons from a simulated randomized
trial in coronary artery disease,” Circulation, 61: 508-515, 1980.)
Multiple comparisons

The difference resulted from the
combined effect of small imbalances in
the subgroups
Multiple comparisons



By using a p-value of 0.05 as the criterion for
significance, we’re accepting a 5% chance of
a false positive (of calling a difference
significant when it really isn’t).
If we compare survival of “treatment” and
“control” within each of 18 subgroups, that’s
18 comparisons.
If these comparisons were independent, the
chance of at least one false positive would
be…
18
1  (.95)  .60
Multiple comparisons
With 18 independent
comparisons, we have
60% chance of at least 1
false positive.
Multiple comparisons
With 18 independent
comparisons, we expect
about 1 false positive.
Results from a previous class
survey…



My research question was to test whether or not
being born on odd or even days predicted
anything about people’s futures.
I discovered that people who born on odd days
got up later and drank more alcohol than people
born on even days; they also had a trend of
doing more homework (p=.04, p<.01, p=.09).
Those born on odd days woke up 42 minutes
later (7:48 vs. 7:06 am); drank 2.6 more drinks
per week (1.1 vs. 3.7); and did 8 more hours of
homework (22 hrs/week vs. 14).
Results from Class survey…


I can see the NEJM article title now…
“Being born on odd days predisposes
you to alcoholism and laziness, but
makes you a better med student.”
Results from Class survey…


Assuming that this difference can’t be
explained by astrology, it’s obviously an
artifact!
What’s going on?…
Results from Class survey…



After the odd/even day question, I
asked 25 other questions…
I ran 25 statistical tests (comparing the
outcome variable between odd-day
born people and even-day born
people).
So, there was a high chance of finding
at least one false positive!
P-value distribution for the 25
tests…
Recall: Under the null
hypothesis of no
associations (which we’ll
assume is true here!), pvalues follow a uniform
distribution…
My significant pvalues!
Compare with…
Next, I generated 25 “p-values”
from a random number generator
(uniform distribution). These
were the results from three
runs…
In the medical literature…

Researchers examined the relationship between
intakes of caffeine/coffee/tea and breast cancer
overall and in multiple subgroups (50 tests)



Overall, there was no association
Risk ratios were close to 1.0 (ranging from 0.67 to 1.79),
indicated protection (<1.0) about as often harm (>1.0), and
showed no consistent dose-response pattern
But they found 4 “significant” p-values in subgroups:



coffee intake was linked to increased risk in those with benign breast
disease (p=.08)
caffeine intake was linked to increased risk of estrogen/progesterone
negative tumors and tumors larger than 2 cm (p=.02)
decaf coffee was linked to reduced risk of BC in postmenopausal
hormone users (p=.02)
Ishitani K, Lin J, PhD, Manson JE, Buring JE, Zhang SM. Caffeine consumption and the risk of breast cancer in a large prospective cohort of women. Arch Intern Med. 2008;168:2022-2031.
Distribution of the p-values
from the 50 tests
Likely
chance
findings!
Also, effect sizes
showed no
consistent pattern.
The risk ratios:
-were close to 1.0
(ranging from 0.67
to 1.79)
-indicated
protection (<1.0)
about as often
harm (>1.0)
-showed no
consistent doseresponse pattern.
Hallmarks of a chance finding:





Analyses are exploratory
Many tests have been performed but only a few are
significant
The significant p-values are modest in size (between
p=0.01 and p=0.05)
The pattern of effect sizes is inconsistent
The p-values are not adjusted for multiple
comparisons
Conclusions



Look at the totality of the evidence.
Expect about one marginally significant
p-value (.01<p<.05) for every 20 tests
run.
Be wary of unplanned comparisons
(e.g., subgroup analyses).
Pitfall 4: high type II error
(low statistical power)



Results that are not statistically significant should
not be interpreted as "evidence of no effect,” but
as “no evidence of effect”
Studies may miss effects if they are insufficiently
powered (lack precision).
Example: A study of 36 postmenopausal women failed to find a
significant relationship between hormone replacement therapy and
prevention of vertebral fracture. The odds ratio and 95% CI were: 0.38
(0.12, 1.19), indicating a potentially meaningful clinical effect. Failure
to find an effect may have been due to insufficient statistical power for
this endpoint.
Ref: Wimalawansa et al. Am J Med 1998, 104:219-226.
Example


“There was no significant effect of treatment
(p =0.058), nor treatment by velocity
interaction (p = 0.19), indicating that the
treatment and control groups did not
differ in their ability to perform the task.”
P-values >.05 indicate that we have
insufficient evidence of an effect; they do not
constitute proof of no effect.
Smoking cessation trial

Weight-concerned women smokers
were randomly assigned to one of four
groups:


Weight-focused or standard counseling
plus bupropion or placebo
Outcome: biochemically confirmed
smoking abstinence
Levine MD, Perkins KS, Kalarchian MA, et al. Bupropion and Cognitive Behavioral Therapy for Weight-Concerned Women
Smokers. Arch Intern Med 2010;170:543-550.
The Results…
Rates of biochemically verified prolonged abstinence at 3, 6, and 12
months from a four-arm randomized trial of smoking cessation*
Months
Weight-focused counseling
after Bupropion Placebo
P-value,
quit
group
group
bupropion
target
(n=106)
(n=87)
vs.
date
placebo
Standard counseling group
Bupropion
group
(n=89)
Placebo
group
(n=67)
P-value,
bupropion
vs.
placebo
3
41%
18%
.001
33%
19%
.07
6
34%
11%
.001
21%
10%
.08
12
24%
8%
.006
19%
7%
.05
The Results…
Rates of biochemically verified prolonged abstinence at 3, 6, and 12
months from a four-arm randomized trial of smoking cessation*
Months
Weight-focused counseling
after Bupropion Placebo
P-value,
quit
group
group
bupropion
target
(n=106)
(n=87)
vs.
date
placebo
Standard counseling group
Bupropion
group
(n=89)
Placebo
group
(n=67)
P-value,
bupropion
vs.
placebo
3
41%
18%
.001
33%
19%
.07
6
34%
11%
.001
21%
10%
.08
12
24%
8%
.006
19%
7%
.05
Counseling methods appear equally effective in the
placebo group.
The Results…
Rates of biochemically verified prolonged abstinence at 3, 6, and 12
months from a four-arm randomized trial of smoking cessation*
Months
Weight-focused counseling
after Bupropion Placebo
P-value,
quit
group
group
bupropion
target
(n=106)
(n=87)
vs.
date
placebo
Standard counseling group
Bupropion
group
(n=89)
Placebo
group
(n=67)
P-value,
bupropion
vs.
placebo
3
41%
18%
.001
33%
19%
.07
6
34%
11%
.001
21%
10%
.08
12
24%
8%
.006
19%
7%
.05
Clearly, bupropion improves quitting rates in the
weight-focused counseling group.
The Results…
Rates of biochemically verified prolonged abstinence at 3, 6, and 12
months from a four-arm randomized trial of smoking cessation*
Months
Weight-focused counseling
after Bupropion Placebo
P-value,
quit
group
group
bupropion
target
(n=106)
(n=87)
vs.
date
placebo
Standard counseling group
Bupropion
group
(n=89)
Placebo
group
(n=67)
P-value,
bupropion
vs.
placebo
3
41%
18%
.001
33%
19%
.07
6
34%
11%
.001
21%
10%
.08
12
24%
8%
.006
19%
7%
.05
What conclusion should we draw about the effect of
bupropion in the standard counseling group?
Authors’ conclusions/Media
coverage…


“Among the women who received
standard counseling, bupropion did not
appear to improve quit rates or time to
relapse.”
“For the women who received standard
counseling, taking bupropion didn't
seem to make a difference.”
Correct take-home message…

Bupropion improves quitting rates over
counseling alone.



Main effect for drug is significant.
Main effect for counseling type is NOT
significant.
Interaction between drug and counseling
type is NOT significant.
Pitfall 5: the fallacy of comparing
statistical significance

“the effect was significant in the
treatment group, but not significant in
the control group” does not imply that
the groups differ significantly
Example



In a placebo-controlled randomized trial of
DHA oil for eczema, researchers found a
statistically significant improvement in the
DHA group but not the placebo group.
The abstract reports: “DHA, but not the
control treatment, resulted in a significant
clinical improvement of atopic eczema.”
However, the improvement in the treatment
group was not significantly better than the
improvement in the placebo group, so this is
actually a null result.
Misleading “significance
comparisons”
The improvement in
the DHA group (18%)
is not significantly
greater than the
improvement in the
control group (11%).
Koch C, Dölle S, Metzger M, et al. Docosahexaenoic acid (DHA) supplementation in atopic
eczema: a randomized, double-blind, controlled trial. Br J Dermatol 2008;158:786-792.
Within-group vs. between-group
tests
Examples of statistical tests used to evaluate within-group effects versus statistical
tests used to evaluate between-group effects
Statistical tests for within-group effects
Statistical tests for between-group effects
Paired ttest
Two-sample ttest
Wilcoxon sign-rank test
Wilcoxon sum-rank test (equivalently,
Mann-Whitney U test)
Repeated-measures ANOVA, time effect
ANOVA; repeated-measures ANOVA,
group*time effect
McNemar’s test
Difference in proportions, Chi-square test, or
relative risk
Also applies to interactions…

Similarly, “we found a significant effect
in subgroup 1 but not subgroup 2” does
not constitute prove of interaction

For example, if the effect of a drug is
significant in men, but not in women, this
is not proof of a drug-gender interaction.
Within-subgroup significance vs.
interaction
Rates of biochemically verified prolonged abstinence at 3, 6, and 12 months from a
four-arm randomized trial of smoking cessation*
Weight-focused counseling
Month
s after
quit
target
date
Bupropion
group
abstinence
(n=106)
Placebo
group
abstinence
(n=87)
P-value,
bupropion
vs.
placebo
Bupropion
group
abstinence
(n=89)
Placebo
group
abstinence
(n=67)
P-value,
bupropion
vs.
placebo
P-value for
interaction
between
bupropion and
counseling
type**
3
41%
18%
.001
33%
19%
.07
.42
6
34%
11%
.001
21%
10%
.08
.39
12
24%
8%
.006
19%
7%
.05
.79
*From
Standard counseling group
Tables 2 and 3: Levine MD, Perkins KS, Kalarchian MA, et al. Bupropion and Cognitive Behavioral Therapy for Weight-Concerned
Women Smokers. Arch Intern Med 2010;170:543-550.
**Interaction p-values were newly calculated from logistic regression based on the abstinence rates and sample sizes shown in this table.
Statistical Power

Statistical power is the probability of
finding an effect if it’s real.
Can we quantify how much
power we have for given
sample sizes?
study 1: 263 cases, 1241 controls
Null
Distribution:
difference=0.
Rejection region.
Any value >= 6.5
(0+3.3*1.96)
For 5% significance level,
one-tail area=2.5%
(Z/2 = 1.96)
Power= chance of being in the
Clinically relevant
rejection region if the alternative
alternative: is true=area to the right of this
difference=10%.
line (in yellow)
study 1: 263 cases, 1241 controls
Rejection region.
Any value >= 6.5
(0+3.3*1.96)
Power here = >80%
Power= chance of being in the
rejection region if the alternative
is true=area to the right of this
line (in yellow)
study 1: 50 cases, 50 controls
Critical value=
0+10*1.96=20
Z/2=1.96
2.5% area
Power closer to
20% now.
Study 2: 18 treated, 72 controls, STD DEV = 2
Critical value=
0+0.52*1.96 = 1
Clinically relevant
alternative:
difference=4 points
Power is nearly
100%!
Study 2: 18 treated, 72 controls, STD DEV=10
Critical value=
0+2.59*1.96 = 5
Power is about
40%
Study 2: 18 treated, 72 controls, effect size=1.0
Critical value=
0+0.52*1.96 = 1
Power is about
50%
Clinically relevant
alternative:
difference=1 point
Factors Affecting Power
1.
2.
3.
4.
Size of the effect
Standard deviation of the characteristic
Bigger sample size
Significance level desired
1. Bigger difference from the null mean
Null
Clinically
relevant
alternative
average weight from samples of 100
2. Bigger standard deviation
average weight from samples of 100
3. Bigger Sample Size
average weight from samples of 100
4. Higher significance level
Rejection region.
average weight from samples of 100
Sample size calculations

Based on these elements, you can write
a formal mathematical equation that
relates power, sample size, effect size,
standard deviation, and significance
level…
Simple formula for difference
in proportions
Represents the
Sample size in
each group
(assumes equal
sized groups)
n  2
desired power
(typically .84 for
80% power).
( p)(1  p)(Z   Z/2 )
A measure of
variability (similar
to standard
deviation)
(p1  p2 )
Effect Size
(the
difference in
proportions)
2
2
Represents the
desired level of
statistical
significance
(typically 1.96).
Simple formula for difference
in means
Sample size in each
group (assumes equal
sized groups)
Represents the
desired power
(typically .84 for
80% power).
 ( Z   Z/2 )
2
n  2
Standard deviation
of the outcome
variable
2
2
difference Represents the
desired level of
Effect Size
(the difference statistical
significance
in means)
(typically 1.96).
Sample size calculators on the
web…



http://biostat.mc.vanderbilt.edu/twiki/bi
n/view/Main/PowerSampleSize
http://calculators.stat.ucla.edu
http://hedwig.mgh.harvard.edu/sample
_size/size.html
These sample size calculations are
idealized
•They do not account for losses-to-follow up
(prospective studies)
•They do not account for non-compliance (for
intervention trial or RCT)
•They assume that individuals are independent
observations (not true in clustered designs)
•Consult a statistician!
Review Question 1
Which of the following elements does not
increase statistical power?
a.
b.
c.
d.
Increased sample size
Measuring the outcome variable more
precisely
A significance level of .01 rather than .05
A larger effect size.
Review Question 2
Most sample size calculators ask you to
input a value for . What are they asking
for?
a.
b.
c.
d.
e.
The
The
The
The
The
standard error
standard deviation
standard error of the difference
coefficient of deviation
variance
Review Question 3
For your RCT, you want 80% power to detect a
reduction of 10 points or more in the
treatment group relative to placebo. What is
10 in your sample size formula?
a. Standard deviation
b. mean change
c. Effect size
d. Standard error
e. Significance level
Homework


Problem Set 5
Reading: The Problem of Multiple
Testing; Misleading Comparisons: The
Fallacy of Comparing Statistical
Significance (on Coursework)


Reading: Chapters 22-29 Vickers
Journal article/article review sheet