Transcript Document
Biostatistics in Practice
Session 4:
Study Size for Precision or Power
Peter D. Christenson
Biostatistician
http://research.LABioMed.org/Biostat
Session 4 Issue
How many subjects?
Session 4 Preparation
We have been using a recent study on
hyperactivity in children under diets with
various amounts of food additives for the
concepts in this course. The questions below
based on this paper are intended to prepare
you for session 4, which is on determining
the size of a study.
1. How many children were deemed necessary
to complete the entire study? Use the
second column on the 4th page of the paper.
Session 4 Preparation #1
Session 4 Preparation #2
2. The authors accounted for some children to
start, but not complete the study. What
percentage of "dropouts" did they build into
their calculations?
The statistical requirements are for 80
“evaluable” subjects. They decided on a study
size of 120, so they were allowing up to 40/120 =
33% of subjects to not complete.
Session 4 Preparation #3
3. The authors will perform a test similar to the
t-test we discussed last week, to conclude
whether there is evidence that hyperactivity
differs under Mix A than placebo. There are
two mistakes that they may make in this
decision. What are they?
I.
Conclude Mix A ≠ Placebo, but Mix A = Placebo
II. Conclude Mix A = Placebo, but Mix A ≠ Placebo
Session 4 Preparation #4 and #5
4. How large a difference between Mix A and
placebo do they want to detect?
5. Does the value of 0.32 in the study size
description (second column on the 4th page)
refer to a difference? They seem to imply it is a
SD. Based on what we have said about tests
comparing "signal" to "noise", do you think both
a difference and SD are relevant for
determining the study size?
Session 4 Preparation: #4 and #5
Session 4 Preparation #4 and #5
They want to detect a difference Δ of 0.32 in GHA.
[ Smallest clinically relevant Δ? ]
Both the Δ and SD need to be accounted for.
Effect size = Δ / SD = “# of SDs”.
Remember, reference range = 4 to 6 SDs.
For this study (unusual) GHA is scaled to have a SD
of 1, so Δ = effect size =0.32.
Session 4 Goals
Review estimating and testing
Δ, SD and N in estimating and testing
False positive and false negative conclusions from
tests
What is needed to determine study size
Software for study size
Review Estimation
Typically:
1. Have sample of N representing “all”.
2. Find mean and SD from the N units.
3. Expect new unit to be within mean ± 2SD.
4. Confident (95%) that mean of all is in
mean ± 2SD/√N.
May have this info for one or multiple groups.
Study Size to Achieve Precision
Precision refers to how well a measure is estimated.
Margin of error = the ± value (half-width) of the 95%
confidence interval.
Lower margin of error ↔ greater precision.
To achieve a specified margin of error, say d, solve the
CI formula for N:
For a mean, d = 2SD/√N, so N=(2SD/d)2.
For a proportion p, d = 2[p(1-p)/N]1/2 ≤ 1/√N.
Most polls use N ≈ 1000, so margin of error on % ≈ 3%
Review Statistical Tests
1. Calculate a standardized quantity for the
particular test, a “test statistic”:
• Often: t = (Mean – Expected) / SE(Mean)
If 1 group, Mean may be a change score.
If 2 groups, Mean may be the difference between
means for two groups.
Expected = 0 if no effect.
Looking for evidence to contradict “no effect”.
Review Statistical Tests
2. Compare the test statistic to the range of values it
should be if expectations are correct.
Often: The range has approx’ly normal bell curve.
3. Declare “effect” if test statistic is too extreme,
relative to this range.
Often: |test statistic| >~2 → Declare effect.
t-Test
Declare effect if test statistic is “too extreme”.
Declare:
Effect
No Effect
How extreme?
Effect
“Too extreme”
means < 5% chance
of wrongly declaring
an effect.
Expect
2.5%
Convention:
2.5%
t=
95% Chance
(mean – expected)
SD/√N
t-Test
Declare effect if test statistic is “too extreme”.
Declare:
Effect
No Effect
Effect
“Too extreme”
means < 5% chance
of wrongly declaring
an effect.
Expect
2.5%
2.5%
95% Chance
Convention:
But, what are the
chances of wrongly
declaring no effect?
t-Test
Declare effect if test statistic is “too extreme”.
Declare:
Effect
No Effect
Effect
But, what are the
chances of wrongly
declaring no effect?
Expect
2.5%
2.5%
95% Chance
To answer, we need
a similar curve for
the range of values
expected when
there is an effect.
Two Possible Errors from t-test
No real effect (0)
Red
Blue
Real effect = 3
Green
Effect in study=1.13
41%
No
Effect
Real
Effect
Consider
just one
possible
real effect,
the value 3.
5%
Δ = Effect (Difference Between Group Means)
Just Δ, not t = Δ/SE(Δ)
Conclude effect.
\\\ = Probability: Conclude Effect, But no Real Effect (5%).
/// = Probability: Conclude No Effect, But Real Effect (41%).
Graphical Representation of t-test
No real effect (0)
Red
Blue
Real effect = 3
Green
Effect in study=1.13
41%
No
Effect
Real
Effect
Suppose
we need
stronger
proof;
i.e., shift
cutoff to
right.
5%
Δ = Effect (Difference Between Group Means)
Just Δ, not t = Δ/SE(Δ)
Conclude effect.
Then, chance of false positive is reduced to ~1%, but
false negative is increased to ~60%.
Power of a Study
Statistical power is the sensitivity of a study
to detect real effects, if they exist.
It is 100-41=59% two slides back.
Two Possible Errors in a Diagnostic Test
Truth:
No Disease
Disease
Diagnosis:
No Disease
Correct
Error
Specificity
Sensitivity
Disease
Need high in
follow-up test
Error
Correct
Specificity ↓ as
Sensitivity↑
Want high for a
screening test
Analogy with Diagnostic Testing
Truth:
No Effect
Study
Claims:
No Effect
Correct
Effect
Error (Type II)
Specificity
Sensitivity
Effect
Error (Type I)
Correct
Set α=0.05
Specificity=95%
← Typical →
Power: Maximize.
Choose N for 80%
Summary: Factors Related to Study Size
Five factors are inter-related. Fixing four of these specifies the
fifth:
1. Study size, N.
2. Power (often 80% is desirable).
3. p-value cutoff (level of significance, e.g., 0.05).
4. Magnitude of the effect to be detected (Δ).
5. Heterogeneity among subjects (SD).
The next slide shows how these factors (except SD) are
typically presented in a study protocol.
Quote from Local Protocol Example
The following table presents detectable differences, with p=0.05 and
80% power, for different study sizes.
Total
Number
of
Subjects
Detectable
Difference in
Change in
Mean MAP
(mm Hg)(1)
Detectable
Difference in
Change in
Mean Number
of
Vasopressors(2)
20
40
60
80
100
120
10.9
7.4
6.0
5.2
4.6
4.2
0.77
0.49
0.39
0.34
0.30
0.27
Thus, with a total of the planned 80 subjects, we are 80% sure to detect
(p<0.05) group differences if treatments actually differ by at least 5.2 mm
Hg in MAP change, or by a mean 0.34 change in number of vasopressors.
Comments on the Previous Table
•
Typically power=80% and almost always p<0.05.
•
SD was not mentioned. There may be several estimates
from other studies (different populations, intervention
characteristics such as dosage, time, etc). Here, a pilot
study exactly like the trial was performed by the same
investigators.
•
Detectable difference refers to the unknown true difference
for “all”, not the difference that will be seen eventually in the
N study subjects.
•
N ↑ as detectable difference ↓.
•
So, the major consideration is usually a tradeoff between N
and the detectable difference.
Free Study Size Software
www.stat.uiowa.edu/~rlenth/Power
Local Protocol Example: Calculations
Pilot data: SD=8.16 for ΔMAP in 36 subjects.
For p-value<0.05, power=80%, N=40/group, the
detectable Δ of 5.2 in the previous table is found as:
Hyperactivity Study Size
Study is 1-sample or
paired (for each age
group).
SD=1 Δ=0.32
Use p-value<0.05.
Want power=80%.
Solve for N in
software to get N=79.
Study Size for Some Other Study Types
1. Phase I: Dose escalation. Safety, not efficacy. No
power. Use N=3 low dose; if safe N=3 in higher
dose, etc.
2. Phase II: Small, primarily safety; look for enough
evidence of efficacy to go on to Phase III. Often
staged: e.g., if 3/10 respond, test 10 more, etc.
3. Mortality studies: Patterns of deaths over time can
be used in sample size calculations. Software not
in the online package.
Approximate Formulas for Study Size
1. Two-sample t-test:
Total N ~ 4 x 7.85 x (SD/Δ)2
MAP Example: 4 x 7.85 x (8.16/5.2)2 = 77 ~ 80
2. Paired t-test:
N ~ 7.85 x (SD/Δ)2
Hyperactivity Example:
7.85 x (1/0.32)2 = 77 ~ 80
Summary: Study Size and Power
1. Power analysis assures that effects of a specified
magnitude can be detected.
2. Five factors including power are inter-related.
Fixing four of these specifies the fifth.
3. For comparing means, need pilot or data from
other studies to estimate SD for the outcome
measure. Comparing %s does not require SD.
4. Helps support the believability of studies if the
conclusions turn out to be negative.