Multiple Comparisons and Multiple Tests: State of the Practice

Download Report

Transcript Multiple Comparisons and Multiple Tests: State of the Practice

A Course in Multiple
Comparisons and Multiple Tests
Peter H. Westfall, Ph.D.
Professor of Statistics, Department of Inf.
Systems and Quant. Sci.
Texas Tech University
1
Learning Outcomes
 Elucidate reasons that multiple comparisons procedures (MCPs)
are used, as well as their controversial nature
 Know when and how to use classical interval-based MCPs
including Tukey, Dunnett, and Bonferroni
 Understand how MCPs affect power
 Elucidate the definition of closed testing procedures (CTPs)
 Understand specific types of CTPs, benefits and drawbacks
 Distinguish false discovery rate (FDR) from familywise error rate
(FWE)
 Understand general issues regarding Bayesian MCPs
2
Outline of Material
Introduction. Overview of Problems, Issues, and Solutions, Regulatory and Ethical
Perspectives, Families of Tests, Familywise Error Rate, Bonferroni. (pp. 5-21)
Interval-Based Multiple Inferences in the standard linear models framework. One-way
ANOVA and ANCOVA, Tukey, Dunnett, and Monte Carlo Methods, Adjusted p-values, general
contrasts, Multivariate T distribution, Tight Confidence Bands, TreatmentxCovariate
Interaction, Subgroup Analysis (pp. 22-55)
Power and Sample Size Determinations for multiple comparisons. (pp. 56-65)
Stepwise and Closed Testing Procedures I: P-value-Based Methods. Closure Method,
Global Tests; Holm, Hommel, Hochberg and Fisher combined methods for p-Values; (pp. 6690)
Stepwise and Closed Testing Procedures II: Fixed Sequences, Gatekeepers and I-U tests:
Fixed Sequence tests, Gatekeeper procedures, Multiple hypotheses in a gate, Intersection-union
tests; with application to dose response, primary and secondary endpoints, bioequivalence and
combination therapies (pp. 91-101)
3
Outline (Continued)
Stepwise and Closed Testing Procedures III: Methods that use logical constraints and
correlations. Lehmacher et al. Method for Multiple endpoints; Range-Based and F-based
ANOVA Tests, Fisher’s protected LSD, Free and Restricted Combinations, Shaffer-Type
Methods for dose comparisons and subgroup analysis (pp. 102-118)
Multiple nonparametric and semiparametric tests: Bootstrap and Permutation-based
Closed tesing. PROC MULTTEST, examples with multiple endpoints, genetic associations, gene
expression, binary data and adverse events (pp. 119-139)
More complex models and FWE control: Heteroscedasticity, Repeated measures, and large
sample methods. Applications: multiple treatment comparisons, crossover designs, logistic
regression of cure rates (pp. 140-152)
False Discovery Rate: Benjamini and Hochberg’s method, comparison with FWE – controlling
methods (153-158)
Bayesian methods: Simultaneous credible intervals, ranking probabilities and loss functions,
PROC MIXED posterior sampling, Bayesian testing of multiple endpoints (pp. 159-178)
Conclusion, discussion, references (179-184)
4
Sources of Multiplicity
 Multiple variables (endpoints)
 Multiple timepoints
 Subgroup analysis
 Multiple comparisons
 Multiple tests of the same hypothesis
 Variable and Model selection
 Interim analysis
 Hidden Multiplicity: File Drawers, Outliers
5
The Problem:
“Significant” results may fail to replicate.
Documented cases:

Ioannidis (JAMA 2005)
6
An Example
 Phase III clinical trial
 Three arms – Placebo, AC, Drug
 Endpoints: Signs and symptoms
 Measured at weekly visits
 Baseline covariates
7
Example-Continued
 ‘Features’ displayed at trial conclusion:
Trends
 Baseline adjusted comparisons of raw data
 Baseline adjusted % changes
 Nonparametric and parametric tests
 Specific endpoints and combinations of
endpoints
 Particular week results
 AC and Placebo comparisons
 Fact: The features that “look the best” are
biased.

8
Example Continued –
Feature Selection
 ‘Effect Size’ is a feature
 Effect size = (mean difference)/sd
 Dimensionless
 .2=‘small’, .5=‘medium’, .8=‘large’
 Estimated effect sizes : F1, F2,…,Fk
 What if you select (max{F1,F2,…,Fk}) and
publish it?
9
The Scientific Concern
Effect Sizes in 1000 Replicated Studies
Replicated Effect Size
1.5
1
0.5
0
0
0.5
1
1.5
Effect Size of Selected Effect in First Study
10
Feature Selection Model




Clinical Trials Simulation
Real data used
Conservative!
If you must know more:




Fj = mj + ej, j=1,…,20.
Error terms or N(0,.22)
True effect sizes mj are N(.3,.12)
Features Fj are highly correlated.
11
Key Points:
(i) Multiplicity invites Selection
(ii) Selection has an EFFECT
Just like effects due to
• Treatment
• Confounding
• Learning
• Nonresponse
• Placebo
12
Published Guidelines
 ICH Guidelines
 CPMP Points to consider
 CDRH Statistical Guidance
 ASA Ethical Guidelines
13
Regulatory/Journal/Ethical/Professional
Concerns
 Replicability (good science)
 Fairness
 Regulatory report:
The drug company reported efficacy at p=.047.
We repeated the analysis in several different
ways that the company might have done. In
20 re-analyses of the data, 18 produced pvalues greater than .05. Only one of the 20
re-analyses produced a p-value smaller than
.047.
14
Multiple Inferences: Notation
 There is a “family” of k inferences
 Parameters are q1,…, qk
 Null hypotheses are
H01: q1=0, …, H0k: qk=0
15
Comparisonwise Error Rate (CER)
Intervals:
CERj = P(Intervalj incorrect)
Tests:
CERj = P(Reject H0j | H0j is true)
Usually CER = a =.05
16
Familywise Error Rate (FWE)
Intervals:
FWE = 1 - P(all intervals are correct)
Tests:
FWE = P(reject at least one true null)
17
False Discovery Rate
• FDR = E(proportion of rejections that are incorrect)
• Let R = total # of rejections
• Let V = # of erroneous rejections
• FDR = E(V/R)
(0/0 defined as 0).
• FWE = P(V>0)
18
Bonferroni Method
 Identify Family of inferences
 Identify number of elements (k) in the
Family
 Use a/k for all inferences.
 Ex: With k=36, p-values must be less than
0.05/36 = 0.0014 to be “significant”
19
FWE Control for Bonferroni
FWE =
P(p0j .05/36 or … or p0j  .05/36 | H0j1,..., H0jmtrue)
1
m
 P(p0j .05/36) + … + P( p0j  .05/36)
1
m
= (.05)m/36  .05
B
A
P(AB)  P(A) + P(B)
20
“Families” in clinical trials1
Efficacy
Main Interest - Primary & Secondary
Approval and Labeling depend on these.
Tight FWE control needed.
Lesser Interest - Depending on goals
and reviewers, FWE controlling methods
might be needed.
Supportive Tests - mostly descriptive
FWE control not needed.
Safety
Serious and
known treatmentrelated AEs
FWE control not needed
All other AEs
Reasonable to
control FWE (or FDR)
Exploratory Tests - investigate new indications future trials needed to confirm - do what makes sense.
21
1Westfall,
P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed.,
Marcel Decker Inc., New York, pp. 666-673.
Classical Single-Step Testing and
Interval Methods to Control FWE
Simultaneous confidence intervals;
Adjusted p-values
Dunnett method
Tukey’s method
Simulation-based methods for general comparisons
22
“Specificity” and “Sensitivity”
…then use
If you want ...
 Estimates of effect sizes  Simultaneous
& error margins
Confidence Intervals
 Confident inequalities
 Stepwise or closed
tests
 Overall Test
 F-test, O’Brien, etc.
23
The Model
where
Y = Xb + e
e ~ N(0, s2 I )
 Includes ANOVA, ANCOVA, regression
 For group comparisons, covariate adjustment
 Not valid for survival analysis, binary data,
multivariate data
24
Example: Pairwise Comparisons
against Control
Table 3.2: M eans and St andar d D eviat ions of M ouse Gr owt h D at a
Control
1
2
3
4
5
6
Mean
105.38 95.9 80.48 72.14 91.88 84.68 74.24
Std. Dev. 13.44 23.89 12.68 8.41 9.44 18.35 7.81
Goal: Estimate all mean differences from control
and provide simultaneous 95% error margins:
What ca to use?
25
Comparison of Critical Values
Mouse Growth Data
ni = 4 mice per group (6 doses and one control)
df = 21.
There are 6 comparisons.
Method
a
a'
ca
Bonferroni
Sidak
Dunnett*
0.05
0.05
0.05
0.0083
0.0085
0.0110
2.912
2.903
2.790
Unadjusted**
0.05
0.0500
2.080
* a' calculated from c a
** Does not control the FWE
26
Results - Dunnett
The GLM Procedure
Dunnett's t Tests for gain
NOTE: This test controls the Type I experimentwise error for comparisons
of all treatments againstba control.
Alpha
0.05
Error Degrees of Freedom
21
Error Mean Square
210.0048
Critical Value of Dunnett's t
2.78972
Minimum Significant Difference
28.586
Comparisons significant at the 0.05 level are indicated by ***.
g
Comparison
1
4
5
2
6
3
-
0
0
0
0
0
0
Difference
Between
Means
-9.48
-13.50
-20.70
-24.90
-31.14
-33.24
Simultaneous
95% Confidence
Limits
-38.07
-42.09
-49.29
-53.49
-59.73
-61.83
19.11
15.09
7.89
3.69
-2.55
-4.65
***
***
27
ca is the 1-a quantile of the distribution of
maxi |Zi-Z0|/(2c2/df)1/2, called
Dunnett’s two-sided range distribution.
28
Adjusted p-Values
Definition:
Adjusted p-value =
smallest FWE at which the hypothesis is rejected.
or
The FWE for which the confidence interval has
“0” as a boundary.
29
Adjusted p-values for Dunnett
proc glm data=tox;
class g;
model gain=g;
lsmeans g/adjust=dunnett pdiff;
run;
The GLM Procedure
Least Squares Means
Adjustment for Multiple Comparisons: Dunnett
G
GAIN
LSMEAN
0
1
2
3
4
5
6
105.380000
95.900000
80.480000
72.140000
91.880000
84.680000
74.240000
Pr > |T| H0:
LSMEAN=CONTROL
0.8608
0.1034
0.0188
0.6090
0.2196
0.0294
(0.3654)
(0.0242)
(0.0039)
(0.2019)
(0.0563)
(0.0062)
30
Example: All Pairwise Comparisons
Table 3.2: M eans and St andar d D eviat ions of M ouse Gr owt h D at a
Control
1
2
3
4
5
6
Mean
105.38 95.9 80.48 72.14 91.88 84.68 74.24
Std. Dev. 13.44 23.89 12.68 8.41 9.44 18.35 7.81
Goal: Estimate all mean differences and provide
simultaneous 95% error margins:
What ca to use?
31
Comparison of Critical Values
Mouse Growth Data
ni = 4 mice per group (6 doses and one control)
df = 21.
There are 21 pairwise comparisons.
Method
a
a'
ca
Bonferroni
Sidak
Tukey*
0.05
0.05
0.05
0.0024
0.0024
0.0038
3.453
3.443
3.251
Unadjusted**
0.05
0.0500
2.080
* a' calculated from c a
** Does not control the FWE
SAS file:
data;
qval = probmc("RANGE",.,.95,21,7);
c_alpha = qval/sqrt(2); run;
proc print; run;
32
Tukey Comparisons
Alpha= 0.05 df= 21 MSE= 210.0048
Critical Value of Studentized Range= 4.597
Minimum Significant Difference= 33.311
Means with the same letter are not significantly different.
Tukey Grouping
Mean
N
G
A
A
A
A
A
A
A
A
A
A
A
A
A
105.38
4
0
95.90
4
1
91.88
4
4
84.68
4
5
80.48
4
2
74.24
4
6
72.14
4
3
33
Tukey Adjusted p-Values
General Linear Models Procedure
Least Squares Means
Adjustment for multiple comparisons: Tukey
G
0
1
2
3
4
5
6
GAIN
LSMEAN
105.380
95.900
80.480
72.140
91.880
84.680
74.240
Pr > |T| H0: LSMEAN(i)=LSMEAN(j)
i/j
1
2
3
4
1
2
3
4
5
6
7
.
0.9641
0.2351
0.0507
0.8364
0.4319
0.0769
0.9641
.
0.7391
0.2810
0.9996
0.9227
0.3806
0.2351
0.7391
.
0.9808
0.9172
0.9995
0.9958
0.0507
0.2810
0.9808
.
0.4860
0.8771
1.0000
5
6
7
0.8364
0.9996
0.9172
0.4860
.
0.9910
0.6102
0.4319
0.9227
0.9995
0.8771
0.9910
.
0.9438
0.0769
0.3806
0.9958
1.0000
0.6102
0.9438
.
34
Tukey Simultaneous Intervals
i
j
Simultaneous
Lower
Confidence
Limit
1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
4
4
4
5
5
6
2
3
4
5
6
7
3
4
5
6
7
4
5
6
7
5
6
7
6
7
7
-23.831013
-8.411013
-0.071013
-19.811013
-12.611013
-2.171013
-17.891013
-9.551013
-29.291013
-22.091013
-11.651013
-24.971013
-44.711013
-37.511013
-27.071013
-53.051013
-45.851013
-35.411013
-26.111013
-15.671013
-22.871013
Difference
Between
Means
9.480000
24.900000
33.240000
13.500000
20.700000
31.140000
15.420000
23.760000
4.020000
11.220000
21.660000
8.340000
-11.400000
-4.200000
6.240000
-19.740000
-12.540000
-2.100000
7.200000
17.640000
10.440000
Simultaneous
Upper
Confidence
Limit
42.791013
58.211013
66.551013
46.811013
54.011013
64.451013
48.731013
57.071013
37.331013
44.531013
54.971013
41.651013
21.911013
29.111013
39.551013
13.571013
20.771013
31.211013
40.511013
50.951013
43.751013
35
ca is (1/) {the 1-a quantile
of the distribution of maxi,i’ |Zi-Zi’|/(c2/df)1/2}, which is
called the Studentized range distribution.
36
Unbalanced Designs and/or
Covariates
 Tukey method is conservative when the
design is unbalanced and/or there are
covariates; otherwise exact
 Dunnett method is conservative when there
are covariates; otherwise exact
 “Conservative” means
{True FWE} < {Nominal FWE} ;
also means “less powerful”
37
Tukey-Kramer Method for all
pairwise comparisons
 Let ca be the critical value for the balanced case
using Tukey’s method and the correct df.
 Intervals are
 Conservative (Hayter, 1984 Annals)
38
Exact Method for General
Comparisons of Means
39
Multivariate T-Distribution Details
4040
Calculation of “Exact” ca
•Edwards and Berry: Simple simulation
•Hsu and Nelson: Factor analytic control variate (better)
•Genz and Bretz: Integration using lattice methods (best)
Even with simple simulation, the value ca can be obtained
with reasonable precision.
Edwards, D., and Berry, J. (1987) The efficiency of simulation-based multiple comparisons. Biometrics,
43, 913-928.
Hsu, J.C. and Nelson, B.L. (1998) Multiple comparisons in the general linear model. Journal of Computational
and Graphical Statistics, 7, 23-41.
Genz, A. and Bretz, F. (1999), Numerical Computation of Multivariate t Probabilities with Application
to Power Calculation of Multiple Constrasts, J. Stat. Comp. Simul. 63, pp. 361-378.
41
Example: ANCOVA with two
covariates
Y
Group
X1
X2
=
=
=
=
Diastolic BP
Therapy (Control, D1, D2, D3)
Baseline Diastolic BP
Baseline Systolic BP
Goal: Compare all therapies, controlling for baseline
proc glm data=research.bpr;
class therapy;
model dbp10 = therapy dbp7 sbp7;
lsmeans therapy/pdiff cl adjust=simulate(nsamp=
10000000 cvadjust seed=121011 report);
run;
quit;
42
Results From ANCOVA
Source
DF
Type III SS
Mean Square
F Value
Pr > F
3
1
1
677.429172
6832.878653
51.123459
225.809724
6832.878653
51.123459
6.05
183.06
1.37
0.0006
<.0001
0.2435
THERAPY
DBP7
SBP7
Least Squares Means for Effect THERAPY
i
1
1
1
2
2
3
j
2
3
4
3
4
4
Difference
Between
Means
2.832658
1.328481
-2.536262
-1.504178
-5.368920
-3.864743
Note: “4” is control
Simultaneous 95%
Confidence Limits for
LSMean(i)-LSMean(j)
-0.424816
6.09013
-2.099566
4.756527
-5.981471
0.908947
-4.846403
1.838047
-8.734744
-2.003097
-7.398994
-0.330491
43
Details for Quantile Simulation
Random number seed
Comparison type
Sample size
Target alpha
Accuracy radius (target)
Accuracy radius (actual)
Accuracy confidence
121011
All
9999938
0.05
0.0002
437E-7
99%
Simulation Results
Method
Simulated
Tukey-Kramer
Bonferroni
Sidak
GT-2
Scheffe
T
95% Quantile
Estimated
Alpha
2.594159
2.594637
2.669484
2.662029
2.660647
2.823701
1.974017
0.0500
0.0499
0.0411
0.0419
0.0420
0.0270
0.2017
99% Confidence
Limits
0.0500
0.0499
0.0410
0.0418
0.0420
0.0270
0.2016
NOTE: PROCEDURE GLM used: real time 21.23 seconds
0.0500
0.0500
0.0411
0.0419
0.0421
0.0270
0.2018
44
Results from ANCOVA-Dunnett
THERAPY
Dose 1
Dose 2
Dose 3
Placebo
DBP10 LSMEAN
88.8171113
85.9844529
87.4886307
91.3533732
H0:LSMean=
Control
Pr > |t|
0.1407
0.0002
0.0140
Least Squares Means for Effect THERAPY
i
1
2
3
j
4
4
4
Difference
Between
Means
-2.536262
-5.368920
-3.864743
Simultaneous 95%
Confidence Limits for
LSMean(i)-LSMean(j)
-5.675847
0.603323
-8.436161
-2.301679
-7.085470
-0.644015
45
Details for Quantile SimulationDunnett
Random number seed
Comparison type
Sample size
Target alpha
Accuracy radius (target)
Accuracy radius (actual)
Accuracy confidence
121011
Control, two-sided
9999938
0.05
0.0002
139E-7
99%
Simulation Results
Method
Simulated
Dunnett-Hsu, two-sided
Bonferroni
Sidak
GT-2
Scheffe
T
95% Quantile
Estimated
Alpha
2.364031
2.364084
2.417902
2.411491
2.410664
2.823701
1.974017
0.0500
0.0500
0.0437
0.0444
0.0445
0.0145
0.1229
NOTE: PROCEDURE GLM used: real time 19.00 seconds
99% Confidence
Limits
0.0500
0.0500
0.0437
0.0444
0.0445
0.0145
0.1229
0.0500
0.0500
0.0437
0.0444
0.0445
0.0145
0.1230
46
More General Inferences
Question: For what values of the covariate is
treatment A better than treatment B?
47
Discussion of (Treatment 
Covariate) Interaction Example
48
The GLIMMIX Procedure
Computes MC-exact simultaneous confidence intervals
and adjusted p-values for any set of linear functions in a
linear model
49
GLIMMIX syntax
proc glimmix data=research.tire;
class make;
model cost = make mph make*mph;
estimate "10" make 1 -1 make*mph 10 -10,
"15" make 1 -1 make*mph 15 -15,
"20" make 1 -1 make*mph 20 -20,
"25" make 1 -1 make*mph 25 -25,
"30" make 1 -1 make*mph 30 -30,
"35" make 1 -1 make*mph 35 -35,
"40" make 1 -1 make*mph 40 -40,
"45" make 1 -1 make*mph 45 -45,
"50" make 1 -1 make*mph 50 -50,
"55" make 1 -1 make*mph 55 -55,
"60" make 1 -1 make*mph 60 -60,
"65" make 1 -1 make*mph 65 -65,
"70" make 1 -1 make*mph 70 -70
/adjust=simulate(nsamp=10000000 report) cl;
run;
50
Output from PROC GLIMMIX
Simultaneous intervals are Estimate +Label
10
15
20
25
30
35
40
45
50
55
60
65
70
Estimate
-4.1067
-3.4539
-2.8011
-2.1483
-1.4956
-0.8428
-0.1900
0.4628
1.1156
1.7683
2.4211
3.0739
3.7267
StdErr
0.9143
0.8084
0.7101
0.6230
0.5524
0.5054
0.4887
0.5054
0.5524
0.6230
0.7101
0.8084
0.9143
tValue
-4.49
-4.27
-3.94
-3.45
-2.71
-1.67
-0.39
0.92
2.02
2.84
3.41
3.80
4.08
2.648 * StdErr
AdjLower
-6.5279
-5.5947
-4.6815
-3.7981
-2.9585
-2.1812
-1.4842
-0.8756
-0.3474
0.1185
0.5407
0.9331
1.3054
Bonferroni – critical value is t_{16,.05/2*13} = 3.377.
AdjUpper
-1.6854
-1.3131
-0.9207
-0.4985
-0.03260
0.4956
1.1042
1.8012
2.5785
3.4181
4.3015
5.2147
6.1479
51
Other Applications of Linear
Combinations
 Multiple Trend Tests
(0,1,2,3), (0,1,2,4), (0,4,6,7)
(carcinogenicity)
 (0,0,1), (0,1,1), (0,1,2)
(recessive/dominant/ordinal genotype effects)

 Subgroup Analysis

Subgroups define linear combinations (more
on next slide)
52
Subgroup Analysis Example
Data: Yijkl , where i=Trt,Cntrl ; j=Old, Yng; k = GoodInit, PoorInit.
Model: Yijkl = mijk + eijkl, where mijk=m+ai+bj+gk+(ab)ij+(ag)ik+(bg)jk
Subgroup Contrasts:
m111
m112
Overall
¼
¼
Older
½
½
Younger
0
0
GoodInit
½
0
PoorInit
0
½
OldGood
1
0
OldPoor
0
1
YoungGood 0
0
YoungPoor 0
0
m121
¼
0
½
½
0
0
0
1
0
m122
¼
0
½
0
½
0
0
0
1
m211
-¼
-½
0
-½
0
-1
0
0
0
m212
-¼
-½
0
0
-½
0
-1
0
0
m221
-¼
0
-½
-½
0
0
0
-1
0
m222
-¼
0
-½
0
-½
0
0
0
-1
53
Subgroup Analysis Results
Label
Estimate
StdErr
tValue
Probt
Adjp
AdjLower
AdjUpper
Overall
0.7075
0.1956
3.62
0.0002
0.0015
0.2460
I
Older
0.9952
0.2673
3.72
0.0002
0.0011
0.3646
I
Younger
0.4197
0.2847
1.47
0.0717
0.2605
-0.2521
I
GoodInitHealth
0.5871
0.2878
2.04
0.0219
0.0984
-0.09197
I
PoorInitHealth
0.8279
0.2644
3.13
0.0011
0.0068
0.2039
I
OldGood
0.8748
0.3387
2.58
0.0056
0.0295
0.07566
I
OldPoor
1.1157
0.3231
3.45
0.0004
0.0026
0.3532
I
YoungGood
0.2993
0.3562
0.84
0.2014
0.5494
-0.5413
I
YoungPoor
0.5401
0.3338
1.62
0.0544
0.2091
-0.2476
I
(SAS code available upon request)
54
Summary
 Include only comparisons of interest.
 Utilize correlations to be less conservative.
 The critical values can be computed exactly only in
balanced ANOVA for all pairwise comparisons, or in
unbalanced ANOVA for comparisons with control.
 Simulation-based methods are “exact” if you let the
computer run for a while. This is my general
recommendation.
55
Power Analysis
 Sample size - Design of study
 Power is less when you use multiple
comparisons  larger sample sizes
 Many power definitions
 Bonferroni & independence are convenient
(but conservative) starting points
56
Power Definitions
“Complete Power” = P(Reject all H0i that are false)
“Minimal Power” = P(Reject at least one H0i that is false)
“Individual Power” = P(Reject a particular H0i that is false)
“Proportional Power” = Average proportion of false H0i
that are rejected
57
Power Calculations.
Example: H1 and H2 powered individually at 50%; H3 and H4
powered individually at 80%, all tests independent.
Complete Power = P(reject H1 and H2 and H3 and H4)
= .5  .5  .8  .8 = 0.16.
Minimal Power = P(reject H1or H2 or H3 or H4)
= 1-P(“accept” H1 and H2 and H3 and H4)
=1- (1-.5)  (1 -.5)  (1-.8) (1-.8) = 0.99.
Individual Power = P(reject H3 (say)) = 0.80. (depends on the test)
Proportional Power = (.5 + .5 + .8 + .8)/4 = 0.65
58
Sample Size for Adequate
Individual Power - Conservative
Estimate
59
Individual power of two-tail twosample Bonferroni t-tests
%let MuDiff = 5;
%let Sigma = 10.0 ;
%let alpha = .05 ;
%let k
= 4;
/* Smallest meaningful difference MUx-MUy
that you want to detect
*/
/* A guess of the population std. dev. */
/* Familywise Type I error probability
of the test
*/
/* Number of tests */
options ls=76;
data power;
cer = &alpha/&k;
do n = 2 to 100 by 2;
*n=sample size for each group*;
df = n + n - 2;
ncp = (&Mudiff)/(&Sigma*sqrt(2/n)); * The noncentrality
parameter *;
tcrit = tinv(1-cer/2, df);
* The critical t value * ;
power = 1 - probt(tcrit, df, ncp) + probt(-tcrit,df,ncp) ;
output;
end;
proc print data=power;
run;
proc plot data=power;
plot power*n/vpos=30;
run;
60
Graph of Power Function
Plot of power*n.
Legend: A = 1 obs, B = 2 obs, etc.
power ‚
‚
‚
1.0 ˆ
‚
‚
‚
‚
AAA
0.8 ˆ
AAAA
‚
AAA
‚
AAA
‚
AAA
‚
AA
0.6 ˆ
AAA
n=92 for 80%
‚
AA
power
‚
AA
‚
AA
‚
AA
0.4 ˆ
A
‚
AA
‚
AA
‚
AA
‚
AA
0.2 ˆ
AA
‚
AA
‚
AA
‚
AA
‚
AAA
0.0 ˆ
A
‚
Šƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒ
0
20
40
60
80
100
n
61
%IndividualPower macro*
•Uses PROBMC and PROBT (noncentral)
•Assumes that you want to use the single-step
(confidence interval based) Dunnett (oneor two-sided) or Range (two-sided) test
•Less conservative than Bonferroni
•Conservative compared to stepwise procedures
•%IndividualPower(MCP=DUNNETT2,g=4,d=5,s=10);
*Westfall et al (1999), Multiple Comparisons and Multiple Tests Using SAS
62
%IndividualPower Output
63
More general Power- Simulate!
Invocation:
%SimPower(method
=
TrueMeans =
s
=
n
=
seed=12345
dunnett
,
(10, 10, 13, 15, 15) ,
10
,
87
,
);
Output:
Method=DUNNETT, Nominal FWE=0.05, nrep=1000
True means = (10, 10, 13, 15, 15), n=87, s=10
Quantity
Complete Power
Minimal Power
Proportional Power
True FWE
Directional FWE
Estimate
---95% CI----
0.28800
0.92900
0.65133
0.01900
0.01900
(0.260,0.316)
(0.913,0.945)
(0.633,0.669)
(0.011,0.027)
(0.011,0.027)
64
Concluding Remarks - Power
 Need a bigger n
 Like to avoid bigger n (see sequential,
gatekeepers methods later)
 Which definition?
 Bonferroni and independence useful
 Simulation useful – especially for the more
complex methods that follow
65
Closed and Stepwise Testing Methods I:
Standard P-Value Based Methods
If you want ...
…then use
 Estimates of effect sizes &
 Simultaneous Confidence
error margins
 Confident inequalities
Intervals
 Stepwise or closed tests
 Holm’s Method
 Hommel’s Method
 Hochberg’s Method
 Fisher Combination
Method
 Overall Test
 F-test, O’Brien, etc.
66
Closed Testing Method(s)
 Form the closure of the family by including all
intersection hypotheses.
 Test every member of the closed family by a
(suitable) a-level test. (Here, a refers to comparisonwise error rate).
 A hypothesis can be rejected provided that


its corresponding test is significant at level a, and
every other hypothesis in the family that implies it is
rejected by its a-level test.
67
Closed Testing – Multiple Endpoints
H0: d1=d2=d3=d4 =0
H0: d1=d2=d3 =0
H0: d1=d2 =0
H0: d1=d3 =0
H0: d1=0
p = 0.0121
H0: d1=d2=d4 =0
H0: d1=d4 =0
H0: d2=0
p = 0.0142
H0: d1=d3=d4 =0
H0: d2=d3 =0
H0: d2=d3=d4 =0
H0: d2=d4 =0
H0: d3=0
p = 0.1986
Where dj = mean difference, treatment -control, endpoint j.
H0: d3=d4 =0
H0: d4=0
p = 0.0191
68
Closed Testing – Multiple
Comparisons
m1=m2=m3=m4
m1=m2, m3=m4
m1=m2
m1=m3, m2=m4 m1=m4, m2=m3
m1=m3
m1=m4
m1=m2=m3
m2=m3
m1=m2=m4
m1=m3=m4
m=m4
m2=m3=m4
m3=m4
Note: Logical implications imply that there are only 14 nodes,
not 26 -1 = 63 nodes.
69
Control of FWE with Closed Tests
Suppose H0j1,..., H0jm all are true (unknown to you which
ones).
{Reject at least one of H0j1,..., H0jmusing CTP} 
{Reject H0j1...  H0jm }
Thus, P(reject at least one of H0j1,..., H0jm |
H0j1,..., H0jm all are true)
 P(reject H0j1...  H0jm |
H0j1,..., H0jm all are true) = a
70
Examples of Closed Testing Methods
When the Composite Test is…
 Bonferroni MinP
 Resampling-Based
Then the Closed Method is …
 Holm’s Method
 Westfall-Young method
MinP
 Simes
 Hommel’s method
 O’Brien
 Lehmacher’s method
 Simple or weighted test
 Fixed sequence test (a-
 …
priori ordered)
 …
71
P-value Based Methods
 Test global hypotheses using p-value
combination tests
 Benefit – Fewer model assumptions: only
need to say that the p-values are valid
 Allows for models other than homoscesdastic
normal linear models (like survival analysis).
72
Holm’s Method is Closed Testing
Using the Bonferroni MinP Test
 Reject H0j H0j ...  H0j if
Min (p0j , p0j ,... , p0j )  a/m.
1
2
1
 Or,
m
2
m
Reject H0j H0j ...  H0j if
p* = m  Min (p0j , p0j ,... , p0j )  a.
1
2
m
1
2
m
(Note that p* is a valid p-value for the joint null, comparable to
p-value for Hotellings T2 test.)
73
Holm’s Stepdown Method
H0: d1=d2=d3=d4 =0
minp=0.0121
p*=0.0484
H0: d1=d2=d3 =0
minp=0.0121
p*=0.0363
H0: d1=d2 =0
minp=0.0121
p*=0.0242
H0: d1=d3 =0
minp=0.0121
p*=0.0242
H0: d1=0
p = 0.0121
H0: d1=d2=d4 =0
minp=0.0121
p*=0.0363
H0: d1=d4 =0
minp=0.0121
p*=0.0242
H0: d2=0
p = 0.0142
H0: d1=d3=d4 =0
minp=.0121
p*=0.0363
H0: d2=d3 =0
minp=0.0142
p*=0.0284
H0: d2=d3=d4 =0
minp=0.0142
p*=0.0426
H0: d2=d4 =0
minp=0.0142
p*=0.0284
H0: d3=0
p = 0.1986
H0: d3=d4 =0
minp=0.0191
p*=0.0382
H0: d4=0
p = 0.0191
Where dj = mean difference, treatment -control, endpoint j.
74
Shortcut For Holm’s Method
Let H(1) ,…,H(k) be the hypotheses corresponding
to p(1)  …  p(k)
–If p(1)  a/k, reject H(1) and continue, else stop and
retain all H(1) ,…,H(k) .
– If p(2)  a/(k-1), reject H(2) and continue, else stop
and retain all H(1) ,…,H(k) .
–…
–If p(k)  a, reject H(k)
75
Adjusted p-values for Closed Tests
 The adjusted p-value for H0j is the maximum
of all p-values over all relevant nodes
In the previous example,
pA(1)=0.0484,pA(2)=0.0484, pA(3)=0.0484, pA(4)=0.1986.


General formula for Holm: pA(j)= maxij (k-i+1)p(i) .
76
Worksheet For Holm’s Method
Holm-Bonferroni Worksheet
Ordered Unadjusted Critical
test number P-value
Value
1
0.0121
0.0125
2
0.0142 0.0166667
3
0.0191
0.025
4
0.1986
0.05
(k-i+1) * Adjusted
p-value
p-value
0.0484
0.0484
0.0426
0.0484
0.0382
0.0484
0.1986
0.1986
77
Simes’ Test for Global Hypotheses
 Uses all p-values p1, p2, …, pm not just the MinP
 Simes’ test rejects H01H02...H0m if
p(j)  ja/m for at least one j.
  p-value for the joint test is p* = min {(m/j)p(j)}
 Uniformly smaller p-value than m  MinP
 Type I error at most a under independence or positive
dependence of p-values
78
Rejection Regions
p2
1
a
a/
0
a/ a
1
P(Simes Reject) = 1 – (1- a/) + (a/) = a
P(Bonferroni Reject ) = 1 – (1- a/) = a - (a/)
p1
79
Hommel’s Method (Closed Simes)
H0: d1=d2=d3=d4 =0
p*=0.0255
H0: d1=d2=d3 =0
p*=0.0213
H0: d1=d2 =0
p*=0.0142
H0: d1=d3 =0
p*=0.0242
H0: d1=0
p = 0.0121
H0: d1=d2=d4 =0
p*=0.0191
H0: d1=d4 =0
p*=0.0191
H0: d2=0
p = 0.0142
H0: d1=d3=d4 =0
p*=0.0287
H0: d2=d3 =0
p*=0.0284
H0: d2=d3=d4 =0
p*=0.0287
H0: d2=d4 =0
p*=0.0191
H0: d3=0
p = 0.1986
H0: d3=d4 =0
p*=0.0382
H0: d4=0
p = 0.0191
Where dj = mean difference, treatment -control, endpoint j.
80
Adjusted P-values for Hommel’s
Method
 Again, take the maximum p-value over all
hypotheses that imply the given one.
 In the previous example, the Hommel adjusted p-
values are pA(1)=0.0287, pA(2)=0.0287, pA(3)=0.0382,
pA(4)=0.1986.
 These adjusted p-values are always smaller than the
Holm step-down adjusted p-values.
81
Adjusted P-values for Hommel’s
Method
 They are maxima over relevant nodes
 In example, Hommel adjusted p-values are
pA(1)=0.0287, pA(2)=0.0287, pA(3)=0.0382,
pA(4)=0.1986.
{Hommel adjusted p-value} ≤ {Holm adjusted p-value}
82
Hochberg’s Method
 A conservative but simpler approximation to
Hommel’s method
{Hommel adjusted p-value}
≤ {Hochberg adjusted p-value}
≤ {Holm adjusted p-value}
83
Hochberg’s Shortcut Method
Let H(1) ,…,H(k) be the hypotheses corresponding
to p(1)  …  p(k)
–If p(k)  a, reject all H(j) and stop, else retain H(k)
and continue.
– If p(k-1)  a/2, reject H(2) … H(k) and stop, else retain
H(k-1) and continue.
–…
–If p(1)  a/k, reject H(k)
Adjusted p-values are pA(j)= minji (k-i+1)p(i) .
84
Worksheet for Hochberg’s Method
Hochberg Adjusted P-Value Worksheet
Ordered Unadjusted
Critical
(k-i+1) *
Adjusted
test number P-value
Value
p-value
p-value
1
0.0121
0.0125
0.0484
0.0382
2
0.0142 0.0166667
0.0426
0.0382
3
0.0191
0.025
0.0382
0.0382
4
0.1986
0.05
0.1986
0.1986
85
Comparison of Adjusted P-Values
p-Values
Test
1
2
3
4
Raw
Stepdown
Bonferroni
Hochberg
Hommel
0.0121
0.0142
0.1986
0.0191
0.0484
0.0484
0.1986
0.0484
0.0382
0.0382
0.1986
0.0382
0.0286
0.0286
0.1986
0.0382
86
Fisher Combination Test
for Independent p-Values
Reject H01H02...H0m if
-2Sln(pi) > c(1-a, 2m)
87
Example: Non-Overlapping
Subgroup* p-values
The Multtest Procedure
p-Values
Test
Raw
Stepdown
Bonferroni
1
2
3
4
5
6
7
8
0.0784
0.0480
0.0041
0.0794
0.0044
0.0873
0.1007
0.1550
0.3918
0.2883
0.0325
0.3918
0.0325
0.3918
0.3918
0.3918
Hochberg
Hommel
Fisher
Combination
0.1550
0.1550
0.0305
0.1550
0.0305
0.1550
0.1550
0.1550
0.1550
0.1441
0.0285
0.1550
0.0305
0.1550
0.1550
0.1550
0.0784
0.0480
0.0053
0.0794
0.0056
0.0873
0.1007
0.1550
*Non-overlapping is required by the independence assumption.
88
Power Comparison
Liptak test stat: T = S F-1(pi) = S Zi
89
Concluding Notes
 Closed testing more powerful than single-step (a/m
rather than a/k).
 P-value based methods can be used whenever pvalues are valid
 Dependence issues:



MinP (Holm) conservative
Simes (Hommel, Hochberg) less conservative, rarely
anti-conservative
Fisher combination, Liptak require independence
90
Closed and Stepwise Testing Methods II:
Fixed Sequences and Gatekeepers
Methods Covered:
•Fixed Sequences (hierarchical endpoints, dose
response, non-inferiority superiority)
•Gatekeepers (primary and secondary analyses)
•Multiple Gatekeepers (multiple endpoints &
multiple doses)
•Intersection-Union tests*
* Doesn’t really belong in this section
91
Fixed Sequence Tests
 Pre-specify H1, H2, …, Hk, and test in this sequence,
stopping as soon as you fail to reject.
 No a-adjustment is necessary for individual tests.
 Applications:

Dose response: High vs. Control, then Mid vs. Control,
then Low vs. Control

Primary endpoint, then Secondary endpoint
92
Fixed Sequence as a Closed Procedure
H123: d1=d=d3 =0
Rej if p1 .05
H12: d1=d=0
Rej if p1 .05
H1: d1=0
Rej if p1 .05
H13: d1=d3 =0
Rej if p1 .05
H2: d =0
Rej if p2 .05
• Rej H1 if p1.05
• Rej H2 if p1.05 and p2.05
• Rej H3 if p1.05 and p2.05 and p3.05
H23: d=d3 =0
Rej if p2 .05
H3: d3 =0
Rej if p3 .05
93
A Seemingly Reasonable But Incorrect
Protocol
1. Test Dose 2 vs Pbo, and Dose 3 vs Pbo
using the Bonferroni method (0.025 level).
2. Test Dose 1 vs Pbo at the unadjusted 0.05
level only if at least one of the first two tests
is significant at the 0.025 level.
94
The problem: FWE  0.075
12
Lower
Mean
Upper
0
Pbo
Low
Mid
High
Moral: Caution needed when there are multiple hypotheses at
some point in the sequence.
95
Correcting the Incorrect Protocol: Use
Closure
H0LMH
P23 < .05
H0LM
P12<.05
H0L
P1 < .05
Where pij = 2min(pi,pj)
H0LH
P13 < .05
H0M
P2 < .05
H0MH
P23 < .05
H0H
P3 < .05
96
References –Fixed Sequence and
Gatekeeper Tests
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Bauer, P (1991) Multiple Testing in Clinical Trials, Statistics in Medicine, 10, 871-890.
O’Neill RT. (1997) Secondary endpoints cannot be validly analyzed if the primary endpoint
does not demonstrate clear statistical significance. Controlled Clinical Trials; 18:550 –556.
D’Agostino RB. (2000) Controlling alpha in clinical trials: the case for secondary
endpoints. Statistics in Medicine; 19:763–766.
Chi GYH. (1998) Multiple testings: multiple comparisons and multiple endpoints. Drug
Information Journal 32:1347S–1362S.
Bauer P, Röhmel J, Maurer W, Hothorn L. (1998) Testing strategies in multi-dose
experiments including active control. Statistics in Medicine; 17:2133 –2146.
Westfall, P.H. and Krishen, A. (2001). Optimally weighted, fixed sequence, and
gatekeeping multiple testing procedures, Journal of Statistical Planning and Inference 99,
25-40.
Chi, G. “Clinical Benefits, Decision Rules, and Multiple Inferences,”
http://www.fda.gov/cder/Offices/Biostatistics/chi_1/sld001.htm
Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials
that do not require all effects to be significant. Stat Med. 22: 2387-2400.
Chen X, Luo X, Capizzi T. (2005) The application of enhanced parallel gatekeeping
strategies. Stat Med. 24:1385-97.
Alex Dmitrienko, Geert Molenberghs, Christy Chuang-Stein, and Walter Offen (2005),
Analysis of Clinical Trials Using SAS: A Practical Guide, SAS Press.
Wiens, B, and Dmitrienko, A. (2005). The fallback procedure for evaluating a single family
of hypotheses. J Biopharm Stat.15(6):929-42.
Dmitrienko, A., Wiens, B. and Westfall, P. (2006). Fallback Tests in Dose Response
Clinical Trials, J Biopharm Stat, 16, 745-755.
97
Intersection-Union (IU) Tests
 Union-Intersection (UI): Nulls are intersections,
alternatives are unions.

H0: {d1=0 and d2=0} vs. H1: {d10 or d20}
 Intersection-Union (IU): Nulls are unions, alternatives
are intersections

H0: {d1=0 or d2=0} vs. H1: {d10 and d20}
 IU is NOT a closed procedure. It is just a single test of a
different kind of null hypothesis.
98
Applications of I-U
 Bioequivalence: The “TOST” test:



Test 1. H01: d  -d0 vs. HA1: d > -d0
Test 2. H01: d  d0 vs. HA1: d < d0
Can test both at a=.05, but must reject both.
 Combination Therapy:



Test 1. H01: m12  m1 vs. HA1: m12 > m1
Test 2. H01: m12  m vs. HA1: m12 > m
Can test both at a=.05, but must reject both.
99
Control of Type I Error for IU tests
Suppose d1=0 or d2=0. Then
P(Type I error)
= P(Reject H0)
= P(p1.05 and p2.05)
< min{P(p1.05), P(p2.05)}
=.05.
(1)
(2)
(3)
(4)
Note: The inequality at (3) becomes an approximate
equality when p2 is extremely noncentral.
100
Concluding Notes:
Fixed Sequences and Gatekeepers
• Many times, no adjustment is necessary at all!
• Other times you can gain power by specifying
gatekeeping sequences
• However, you must clearly state the method and
follow the rules
• There are many “incorrect” no adjustment methods use caution
101
Closed and Stepwise Testing Methods III:
Methods that Use Logical Constraints and
Correlations
Methods
Application
Lehmacher et al
Multiple endpoints
Westfall-TobiasShaffer-Royen
General contrasts
102
Lehmacher et al. Method
• Use O’Brien test at each node (incorporates correlations)
• Do closed testing
Note: Possibly no adjustment whatsoever; possibly big
adjustment
103
Calculations for Lehmacher’s
Method
proc standard data=research.multend1 mean=0 std=1 out=stdzd;
var Endpoint1-Endpoint4; run;
data combine;
set stdzd;
H1234 = Endpoint1+Endpoint2+Endpoint3+Endpoint4;
H123 = Endpoint1+Endpoint2+Endpoint3
;
H124 = Endpoint1+Endpoint2+
Endpoint4;
H134 = Endpoint1+
Endpoint3+Endpoint4;
H234 =
Endpoint2+Endpoint3+Endpoint4;
H12
= Endpoint1+Endpoint2
;
H13
= Endpoint1+
Endpoint3
;
H14
= Endpoint1+
Endpoint4;
H23
=
Endpoint2+Endpoint3
;
H24
=
Endpoint2+
Endpoint4;
H34
=
Endpoint3+Endpoint4;
H1
= Endpoint1
;
H2
=
Endpoint2
;
H3
=
Endpoint3
;
H4
=
Endpoint4;
run;
proc ttest;
class treatment;
var
H1234
H123 H124 H134 H234
H12 H13 H14 H23 H24 H34
H1 H2 H3 H4 ;
ods output ttests=ttests; run;
104
Output For Lehmacher’s Method
Obs
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
Variable
H1234
H123
H124
H134
H234
H12
H13
H14
H23
H24
H34
H1
H2
H3
H4
Method
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Variances
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
Equal
tValue
2.69
2.59
3.03
2.36
2.51
3.03
2.12
2.68
2.22
2.88
2.03
2.55
2.49
1.29
2.38
DF
109
109
109
109
109
109
109
109
109
109
109
109
109
109
109
Probt
0.0082
0.0108
0.0031
0.0201
0.0136
0.0030
0.0365
0.0085
0.0287
0.0047
0.0450
0.0121
0.0142
0.1986
0.0191
pA1 = max(0.0121, 0.0085, 0.0365, 0.0030, 0.0201, 0.0031, 0.0108, 0.0082) = 0.0365
pA2 = max(0.0142, 0.0047, 0.0287, 0.0030, 0.0136, 0.0031, 0.0108, 0.0082) = 0.0287
pA3 = max(0.1986, 0.0450, 0.0287, 0.0365, 0.0136, 0.0201, 0.0108, 0.0082) = 0.1986
pA4 = max(0.0191, 0.0450, 0.0047, 0.0085, 0.0136, 0.0201, 0.0031, 0.0082) = 0.0450
105
Free and Restricted Combinations
If truth of some null hypotheses logically forces
other nulls to be true, the hypotheses are restricted.
Examples
• Multiple Endpoints, one test per endpoint - free
• All Pairwise Comparisons - restricted
106
Pairwise Comparisons, 3 Groups
H0: m1=m=m3
H0: m1=m,m1=m3
H0: m1=m
H0: m1=m,m=m3
H0: m1=m3
H0: m1=m3,m=m3
H0: m=m3
Note : The entire middle layer is not needed!!!!! Fisher protected
107
LSD valid!
Pairwise Comparisons, 4 Groups
m1=m2=m3=m4
m1=m2, m3=m4
m1=m2
m1=m3, m2=m4 m1=m4, m2=m3
m1=m3
m1=m4
m1=m2=m3
m2=m3
m1=m2=m4
m1=m3=m4
m=m4
m2=m3=m4
m3=m4
Note: Logical implications imply that there are only 14 nodes,
not 26 -1 = 63 nodes. Also, Fisher protected LSD not valid.
108
Restricted Combinations Multipliers
(Shaffer* Method 1; Modified Holm)
*Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. JASA 81, 826—831.
109
Shaffer’s (1) Adjusted p-values
m1-m
m1-m3
m1-m4
m-m3
m-m4
m3-m4
RawP
Multiplier
0.3021
1
0.0435
3
0.0002
6
0.1999
2
0.0109
3
0.0088
3
Critical Shaffer
Value Adjusted p Values
0.05 0.3998*
0.016667
0.1305
0.008333
0.0012
0.025
0.3998
0.016667
0.0327
0.016667
0.0264
* Monotonicity enforced
110
Westfall/Tobias/Shaffer/Royen*
Method
• Uses actual distribution of MinP instead of
conservative Bonferroni approximation
• Closed testing incorporating logical constraints
• Hard-coded in PROC GLIMMIX
• Allows arbitrary linear functions
*Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts:
Truncated Closure and the Extended Shaffer-Royen Method,
Journal of the American Statistical Association 102: 487-494.
111
Application of Truncated Closed
MinP to Subgroup Analysis
Compare Treatment with control as follows:
• Overall
• In the Older Patients subgroup
• In the Younger Patients subgroup
• In patients with better initial health subgroup
• In patients with poorer initial health subgroup
• In each of the four (old/young)x(better/poorer) subgroups
• 9 tests overall (but better 1 gatekeeper + 8 follow-up)
112
Analysis File
ods output estimates=estimates_logicaltests;
proc glimmix data=research.respiratory;
class Treatment AgeGroup InitHealth;
model score = Treatment AgeGroup InitHealth Treatment*AgeGroup Treatment*InitHealth AgeGroup*InitHealth;
Estimate
"Overall"
treatment 4 -4 treatment*Agegroup
2 2 -2 -2 treatment*InitHealth 2 2 -2 -2 (divisor=4),
"Older"
treatment 2 -2 treatment*Agegroup
2 0 -2 0 treatment*InitHealth 1 1 -1 -1 (divisor=2),
"Younger"
treatment 2 -2 treatment*Agegroup
0 2 0 -2 treatment*InitHealth 1 1 -1 -1 (divisor=2),
"GoodInitHealth" treatment 2 -2 treatment*Agegroup
1 1 -1 -1 treatment*InitHealth 2 0 -2 0 (divisor=2),
"PoorInitHealth" treatment 2 -2 treatment*Agegroup
1 1 -1 -1 treatment*InitHealth 0 2 0 -2 (divisor=2),
"OldGood"
treatment 1 -1 treatment*Agegroup
1 0 -1 0 treatment*InitHealth 1 0 -1 0 ,
"OldPoor"
treatment 1 -1 treatment*Agegroup
1 0 -1 0 treatment*InitHealth 0 1 0 -1 ,
"YoungGood"
treatment 1 -1 treatment*Agegroup
0 1 0 -1 treatment*InitHealth 1 0 -1 0 ,
"YoungPoor"
treatment 1 -1 treatment*Agegroup
0 1 0 -1 treatment*InitHealth 0 1 0 -1
/adjust=simulate(nsamp=10000000 report seed=12321) upper stepdown(type=logical report);
run;
proc print data=estimates_logicaltests noobs;
title "Subgroup Analysis Results – Truncated Closure";
var label estimate Stderr tvalue probt Adjp;
run;
113
Results – Truncated Closure
Subgroup Analysis Results
Label
Overall
Older
Younger
GoodInitHealth
PoorInitHealth
OldGood
OldPoor
YoungGood
YoungPoor
Estimate
StdErr
tValue
Probt
0.7075
0.9952
0.4197
0.5871
0.8279
0.8748
1.1157
0.2993
0.5401
0.1956
0.2673
0.2847
0.2878
0.2644
0.3387
0.3231
0.3562
0.3338
3.62
3.72
1.47
2.04
3.13
2.58
3.45
0.84
1.62
0.0002
0.0002
0.0717
0.0219
0.0011
0.0056
0.0004
0.2014
0.0544
adjp_
logical
adjp_
interval
0.0011
0.0011
0.1049
0.0432
0.0023
0.0124
0.0011
0.2014
0.1049
0.0015
0.0011
0.2605
0.0984
0.0068
0.0295
0.0026
0.5494
0.2091
The adjusted p-values for the stepdown tests are mathematically
smaller than those of the simultaneous interval-based tests,
114
Example: Stepwise Pairwise vs.
Control Testing
Teratology data set
•Observations are litters
•Response variable = litter weight
•Treatments: 0,5,50,500.
•Covariates: Litter size, Gestation time
115
Analysis File
proc glimmix data=research.litter;
class dose;
model weight = dose gesttime number;
estimate
"5 vs 0"
dose -1 1 0 0,
"50 vs 0" dose -1 0 1 0,
"500 vs 0" dose -1 0 0 1
/ adjust=simulate(nsample=10000000 report)
stepdown(type=logical);
run; quit;
116
Results
Estimates with Simulated Adjustment
Label
5 vs 0
50 vs 0
500 vs 0
Estimate
Standard
Error
DF
t Value
Pr > |t|
Adj P
-3.3524
-2.2909
-2.6752
1.2908
1.3384
1.3343
68
68
68
-2.60
-1.71
-2.00
0.0115
0.0915
0.0490
0.0316
0.0915
0.0907
Note: 50-0 and 500-0 not significant at .10 with regular Dunnett
117
Concluding Notes:
 More power is available when combinations
are restricted.
 Power of closed tests can be improved using
correlation and other distributional
characteristics
118
Nonparametric Multiple Testing
Methods
Overview: Use nonparametric tests at each node of the
closure tree
• Bootstrap tests
• Rank-based tests
• Tests for binary data
119
Bootstrap MinP Test
(Semi-Parametric Test)
 The composite hypothesis H1H2…Hk may be
tested using the p-value
p* = P(MinP  minp | H1H2…Hk)
 Westfall and Young (1993) show


how to obtain p* by bootstrapping the residuals in a
multivariate regression model.
how to obtain all p*’s in the closure tree efficiently
120
Multivariate Regression Model
(Next Five slides are from Westfall and Young, 1993)
121
Hypotheses and Test Statistics
122
Joint Distribution of the Test Statistics
123
Testing Subset Intersection Hypotheses
Using the Extreme Pivotals
124
Exact Calculation of pK
Bootstrap Approximation:
125
Bootstrap Tests (PROC MULTTEST)
H0: d1=d2=d3=d4 =0
min p = .0121, p* = .0379
H0: d1=d2=d3 =0
min p = .0121,
p* < .0379
H0: d1=d2 =0
minp = .0121
p* < .0379
H0: d1=d3 =0
minp = .0121
p* < .0379
H0: d1=0
p = 0.0121
p* < .0379
H0: d1=d2=d4 =0
min p = .0121,
p* < .0379
H0: d1=d4 =0
minp =.0121
p* < .0379
H0: d2=0
p = 0.0142
p* < .0351
H0: d1=d3=d4 =0
min p = .0121,
p* < .0379
H0: d2=d3 =0
minp = .0142
p* < .0351
H0: d2=d3=d4 =0
min p = .0142,
p* = .0351
H0: d2=d4 =0
minp = .0142
p* < .0351
H0: d3=0
p = 0.1986
p* = .1991
H0: d3=d4 =0
minp = .0191
p* = .0355
H0: d4=0
p = 0.0191
p* < .0355
p* = P(Min P  min p | H0) (computed using bootstrap resampling)
(Recall, for Bonferroni, p* = k(MinP) )
126
Permutation Tests for Composite
Hypotheses H0K
Joint p-value = proportion of the n!/(nT!nC!) permutations
for which miniK Pi*  miniK pi .
127
Problem; Simplification
Problem: There are 2k -1 subsets K to be tested
This might take a while...
Simplification: You need only test k of the 2k-1 subsets!
Why? Because
P(miniK Pi*  c)  P(miniK’ Pi*  c) when K K’.
Significance for most lower order subsets is
determined by significance of higher order subsets.
128
MULTTEST PROCEDURE
Tests only the needed subsets (k, not 2k - 1).
Samples from the permutation distribution.
Only one sample is needed, not k distinct samples, if
the joint distribution of minP is identical under HK and
HS.
(Called the “subset pivotality” condition by
Westfall and Young, 1993, valid under location shift and
other models)
129
Great Savings are Possible with
Exact Permutation Tests!
Why?
Suppose you test H12…k using MinP.
The joint p-value is
p* = P(MinP  minp)
 P(P1  minp) + P(P2  minp) + … + P(Pk  minp)
Many summands can be zero, others much less than minp.
130
Multiple Binary Adverse Events
Variable
Contrast
ae1
ae2
ae3
ae4
ae5
ae6
ae7
ae8
ae9
ae10
ae11
ae12
ae13
ae14
ae15
ae16
ae17
ae18
ae19
ae20
ae21
ae22
ae23
ae24
ae25
ae26
ae27
ae28
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
t
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
vs
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
Raw
Stepdown
Bonferroni
Stepdown
Permutation
0.0008
0.6955
0.5000
0.7525
0.2213
0.0601
0.8165
0.0293
0.9399
0.2484
1.0000
1.0000
1.0000
1.0000
0.2484
0.7516
1.0000
1.0000
1.0000
0.5000
0.7516
1.0000
0.5000
1.0000
1.0000
1.0000
1.0000
0.4344
0.0025
1.0000
1.0000
1.0000
1.0000
0.3321
1.0000
0.1587
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.0020
1.0000
1.0000
1.0000
0.6274
0.2608
1.0000
0.1328
1.0000
0.9273
1.0000
1.0000
1.0000
1.0000
0.9273
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
0.9400
131
Example: Genetic Associatons
Phenotype: 0/1 (diseased or not).
Sample n1 from diseased, n2 from not diseased.
Compare 100’s of genotype frequencies (using
dominant and recessive codings) for diseased
and non-diseased using multiple Fisher exact tests.
132
PROC MULTTEST Code
proc multtest data=research.gen stepperm n=20000
out=pval hommel fdr;
class y;
test fisher(d1-d100 r1-r100);
contrast "dis v nondis" -1 1;
run;
proc sort data=pval;
by raw_p;
run;
proc print data=pval;
var _var_ raw_p stppermp hom_p;
where raw_p <.05;
run;
133
Results from PROC MULTTEST
Obs
1
2
3
4
5
6
7
8
_var_
raw_p
stppermp
hom_p
fdr_p
r100
r30
d78
r55
d62
r64
r37
r33
0.000000
0.000000
0.016130
0.018157
0.019220
0.019220
0.020113
0.040043
0.0000
0.0000
0.7955
0.8220
0.8480
0.8480
0.8520
0.9860
0.00000
0.00004
1.00000
1.00000
1.00000
1.00000
1.00000
1.00000
0.00000
0.00002
0.57465
0.57465
0.57465
0.57465
0.57465
1.00000
134
Application - Gene Expression
Group 1: Acute Myeloid Leukemia (AML), n1=11
Group 2: Acute Lymphoblastic Leukemia (ALL), n2=27
Data:
OBS
1
2
…
11
12
…
38
TYPE
AML
AML
…
AML
ALL
…
ALL
G1
G2
G3 … G7000
(Gene expression levels)
…
…
135
PROC MULTTEST code for
exact* closed testing
Proc multtest data=research.leuk noprint out=adjp holm
fdr stepperm n=1000;
class type;
/* AML or ALL */
test mean (gene1-gene7123);
contrast 'AML vs ALL' -1 1;
run;
proc sort data=adjp(where=(raw_p le .0005));
by raw_p;
proc print;
var _var_ raw_p stpbon_p fdr_p stppermp;
run;
* modulo Monte Carlo error
136
PROC MULTTEST Output
Variable
GENE3320
GENE4847
GENE2020
GENE1745
GENE5039
GENE1834
GENE461
GENE4196
GENE3847
GENE2288
GENE1249
GENE6201
GENE2242
GENE3258
GENE1882
GENE2111
GENE2121
GENE6200
GENE6373
GENE6539
Raw Bonferroni permutation
p-value p-value
p-value
1.4E-10
2.4E-10
6.6E-10
1.0E-08
1.0E-08
1.5E-08
3.6E-08
6.2E-08
7.2E-08
8.9E-08
1.7E-07
1.8E-07
2.0E-07
2.1E-07
3.2E-07
3.7E-07
5.8E-07
6.2E-07
8.2E-07
1.1E-06
0.0000
0.0000
0.0000
0.0001
0.0001
0.0001
0.0003
0.0004
0.0005
0.0006
0.0012
0.0013
0.0014
0.0015
0.0023
0.0026
0.0041
0.0044
0.0058
0.0080
0.0001
0.0001
0.0001
0.0002
0.0002
0.0003
0.0005
0.0007
0.0008
0.0010
0.0017
0.0017
0.0019
0.0020
0.0029
0.0033
0.0048
0.0051
0.0065
0.0086
Variable
GENE2043
GENE2759
GENE6803
GENE1674
GENE2402
GENE2186
GENE6376
GENE3605
GENE6806
GENE1829
GENE6797
GENE6677
GENE4052
GENE1394
GENE6405
GENE248
GENE2267
GENE6041
GENE6005
GENE5772
Raw Bonferroni permutation
p-value p-value
p-value
1.3E-06
1.3E-06
1.4E-06
1.5E-06
1.5E-06
1.7E-06
2.1E-06
2.6E-06
2.6E-06
2.7E-06
3.0E-06
3.4E-06
3.7E-06
4.9E-06
5.4E-06
6.4E-06
6.5E-06
7.8E-06
8.0E-06
9.0E-06
0.0090
0.0093
0.0102
0.0105
0.0108
0.0118
0.0149
0.0181
0.0184
0.0194
0.0214
0.0244
0.0263
0.0350
0.0380
0.0453
0.0460
0.0553
0.0569
0.0638
0.0095
0.0097
0.0104
0.0106
0.0109
0.0118
0.0142
0.0169
0.0170
0.0177
0.0194
0.0216
0.0231
0.0290
0.0311
0.0359
0.0364
0.0429
0.0439
0.0480
137
(1 hour on 2.8 GhZ Xeon for 200,000 samples)
Subset Pivotality, PROC MULTTEST
MULTTEST requires “subset pivotality” condition, which states
cases where resampling under the global null is valid.
Valid cases:
Multivariate Regression Model (location-shift).
Multivariate permutation multiple comparisons, one test per variable, assuming
model with exchangeable subsets.
Not Valid with:
Permutation multiple comparisons, within a variable, with three or more groups,
Heteroscedasticity.
Closed testing “by hand” works regardless.
138
Summary: Nonparametric Closed Tests
• Nonparametric closed tests are simple, in principle.
• Robustness gains and power advantages are possible.
139
Further Topics: More Complex
Situations for FWE Control
 Heteroscedasticity
 Repeated Measures
 Large Sample Methods
140
Heteroscedasticity in MCPs
Extreme Example:
data het;
do g = 1 to 5;
do rep = 1 to 10;
input y @@;
output;
end;
end;
datalines;
.01 .02 .08 .03 .04 .01 .08 .01 .01 .02
.11 .12 .13 .08 .03 .09 .11 .11 .13 .14
.21 .22 .23 .28 .23 .29 .21 .21 .23 .25
42 76 .
.
.
.
.
.
.
.
45 23 56 .
.
.
.
.
.
.
;
proc glm;
class g;
model y = g;
lsmeans g/adjust=tukey pdiff;
run; quit;
141
Bad Results from Heteroscedastic Data
Level of
g
1
2
3
4
5
--------------y-------------Mean
Std Dev
N
10
10
10
2
3
0.0310000
0.1050000
0.2360000
59.0000000
41.3333333
0.0276687
0.0320590
0.0287518
24.0416306
16.8027775
RMSE = 6.17
Least Squares Means for effect g
Pr > |t| for H0: LSMean(i)=LSMean(j)
Adjustment for Multiple Comparisons: Tukey-Kramer
i/j
1
2
3
4
5
1
1.0000
1.0000
<.0001
<.0001
2
3
4
5
1.0000
1.0000
1.0000
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
0.0290
1.0000
<.0001
<.0001
<.0001
<.0001
0.0290
142
Approximate Solution for
Heteroscedasticity Problem
proc glimmix data=het;
if (g > 3) then y2=y/20; else y2=y; /* overcomes scaling problem */
class g;
model y2 = g/noint ddfm=satterth;
random _residual_ / group=g ;
estimate
'1 -2' g 1 -1 0
0
0 ,
'1 -3' g 1 0 -1
0
0 ,
'1 -4' g 1 0 0 -20
0 ,
'1 -5' g 1 -1 0
0 -20 ,
'2 -3' g 0 1 -1
0
0 ,
'2 -4' g 0 1 0 -20
0 ,
'2 -5' g 0 1 0
0 -20 ,
'3 -4' g 0 0 1 -20
0 ,
'3 -5' g 0 0 1
0 -20 ,
'4 -5' g 0 0 0 20 -20 /adjust=simulate(nsamp = 1000000)
stepdown(type=logical) adjdfe=row;
run;
143
Heteroscedastic Results
Estimates with Simulated Adjustment
Label
Estimate
Standard
Error
1
1
1
1
2
2
2
3
3
4
-0.07400
-0.2050
-58.9690
-41.4073
-0.1310
-58.8950
-41.2283
-58.7640
-41.0973
17.6667
0.01339
0.01262
17.0000
9.7011
0.01362
17.0000
9.7011
17.0000
9.7011
19.5732
-2
-3
-4
-5
-3
-4
-5
-4
-5
-5
DF
t Value
Pr > |t|
Adj P
17.62
17.97
1
2
17.79
1
2
1
2
1.669
-5.53
-16.25
-3.47
-4.27
-9.62
-3.46
-4.25
-3.46
-4.24
0.90
<.0001
<.0001
0.1787
0.0507
<.0001
0.1789
0.0512
0.1793
0.0515
0.4778
0.0242
0.0011
0.2474
0.1374
0.0198
0.2474
0.1374
0.2474
0.1374
0.4778
Notes:
• Approximation 1: df’s
• Approximation 2: Covariance matrix involving all comparisons
is approximate
•1,2,3 different, 4-5 not. (sensible)
144
Repeated Measures and Multiple
Comparisons
 Usually considered quite complicated (wave hands, use
Bonferroni)
 PROC GLIMMIXED provides a viable solution
 The method is approximate because of its df approximation, and
because it treats estimated variance ratios as known.
145
Multiple Comparisons with Mixed Model
Crossover study: Dog heart rates
data Halothane;
do Dog =1 to 19;
do Treatment = 'HA','LA','HP','LP';
input Rate @@;
output;
end;
end;
datalines;
426 609 556 600
253 236 392 395
359 433
432 431 522 600
405 426 513 513
324 438
310 312 410 456
326 326 350 504
375 447
286 286 403 422
349 382 473 497
429 410
348 377 447 514
412 473 472 446
347 326
434 458 637 524
364 367 432 469
420 395
397 556 645 625
;
349
507
547
488
455
508
357
539
548
547
468
531
H,L = CO2 High/Low
A,P = Halothane absent/present
Source: Johnson and Wichern, Applied Multivariate Statistical
Analysis, 5th ed, Prentice Hall
146
GLIMMIX code for analyzing all pairwise
comparisons, main effects, and interactions
simultaneously
proc glimmix data=halothane order=data;
class treatment dog;
model rate = treatment/ddfm=satterth;
random treatment/ subject=dog type=chol v=1 vcorr=1;
estimate
'HA - LA'
treatment 1 -1 0 0 ,
'HA - HP'
treatment 1 0 -1 0 ,
'HA - LP'
treatment 1 0 0 -1 ,
'LA - HP'
treatment 0 1 -1 0 ,
'LA - LP'
treatment 0 1 0 -1 ,
'HP - LP'
treatment 0 0 1 -1 ,
'Co2
'
treatment 1 -1 1 -1 (divisor=2),
'Halothane'
treatment 1 1 -1 -1 (divisor=2),
'Interaction' treatment 1 -1 -1 1
/adjust=simulate(nsamp=1000000)
stepdown(type=logical) adjdfe=row;
147
Results
Estimates with Simulated Adjustment
Label
Estimate
Standard
Error
HA - LA
HA - HP
HA - LP
LA - HP
LA - LP
HP - LP
Co2
Halothane
Interaction
-36.4211
-111.05
-134.68
-74.6316
-98.2632
-23.6316
-30.0263
-104.66
-12.7895
13.8522
14.1127
12.7899
14.8794
15.7467
11.9884
8.2683
11.1412
19.9438
DF
t Value
Pr > |t|
Adj P
17.63
14.72
14.66
17.94
17.99
16.22
17.82
15.27
17.98
-2.63
-7.87
-10.53
-5.02
-6.24
-1.97
-3.63
-9.39
-0.64
0.0172
<.0001
<.0001
<.0001
<.0001
0.0660
0.0019
<.0001
0.5294
0.0172
<.0001
<.0001
0.0002
<.0001
0.0660
0.0059
<.0001
0.5294
148
Cure Rates Example: Multiple
Comparisons of Odds
Treatment
A
B
C
Diagnosis
Complicated
Uncomplicated
73.6%
88.9%
90.2%
91.5%
59.6%
85.0%
Questions:
(1) Multiple comparisons of cure rates for the Treatments
(3 comparisons)
(2) Comparison of cure rates for Complicated vs Uncomplicated
Diagnosis.
149
Method
• Use the estimated parameter vector and associated estimate
of covariance matrix from PROC GLIMMIX
• Treat the estimated (asymptotic) covariance matrix as known
• Simulate critical values and p-values (MinP-based) from the
multivariate normal distribution instead of the Multivariate T
distribution
• Controls FWE asymptotically under correct logit model
150
Results
Estimates with Simulated Adjustment
Label
A-B
A-C
B-C
Comp-Uncomp
Estimate
Standard
Error
DF
t Value
Pr > |t|
Adj P
-0.9760
0.5847
1.5608
-0.9616
0.3311
0.2641
0.3160
0.2998
Infty
Infty
Infty
Infty
-2.95
2.21
4.94
-3.21
0.0032
0.0268
<.0001
0.0013
0.0032
0.0268
<.0001
0.0027
151
Summary
• Classic, FWE-controlling MCPs that incorporate
alternative covariance structures and non-normal
distributions are easy using PROC GLIMMIX.
• However, be aware of approximations
 Plug-in variance/covariance estimates
 df
152
Further Topics: False Discovery Rate
• FDR = E(proportion of rejections that are incorrect)
• Let R = total # of rejections
• Let V = # of erroneous rejections
• FDR = E(V/R)
(0/0 defined as 0).
• FWE = P(V>0)
153
Example
30 independent tests:
20 null hypotheses are true with pj~U(0,1)
10 extremely alternative with pj = 0.
Decision rule: Reject H0j if pj  0.05
Then:
CERj = P(reject H0j | H0j true ) = 0.05.
FWE = P(reject one or more of the 20) = 1-(.95)20 =0.64
FDR = E{V/(V+10)} where V~Bin(20.05)
so FDR = 0.084.
154
Benjamini and Hochberg’s FDRControlling Method
Let H(1) ,…,H(k) be the hypotheses corresponding
to p(1)  …  p(k)
–If p(k)  a, reject all H(j) and stop, else continue.
– If p(k-1)  (k-1)a/k, reject H(1) … H(k-1) and stop, else
continue.
–…
–If p(1)  a/k, reject H(1)
Adjusted p-values: pA(j)= minji (k/i)p(i) .
155
Comparison with Hochberg’s Method
• A step-up procedure, like Hochberg’s method
• adjusted p-values are pA(j)= minji (k/i)p(i) .
• Recall for Hochberg’s method, pA(j)= minji (k-i+1)p(i) .
• FDR adjusted p-values are uniformly smaller
since k/i  k-i+1
• B-H FDR method uses Simes’ critical points.
156
Critical Values – FDR vs FWE
Alpha Levels - 4000 Tests
Alpha Levels
0.05
0.05
0.045
0.04
0.04
0.03
Hoc
BH
Critical
Critical
0.035
0.03
Hoc4000
0.025
BH4000
0.02
0.02
0.015
0.01
0.01
0.005
0
0
0
5
10
15
Ordered Test
20
25
30
0
1000
2000
3000
4000
5000
Ordered Test
157
Comments on FDR Control
• Considered better for large numbers of tests since FWE is
inconsistent
• Is adaptive
• Has a loose Bayesian correspondence
• Easy to misinterpret the results: Given 10 FDR<.10
rejections in a given study, it is tempting to claim that
only one can be in error (in an “average” sense).
However, this is incorrect, as E(V/R | R>0) > a.
158
Further Topics: Bayesian Methods
 Simultaneous Credible Intervals
 Probabilities of ranking
 Loss function approach
 Posterior probabilities of null
hypotheses
159
Bayes/Frequentist Comparisons
160
Simultaneous Credible Intervals
• Create intervals Ii for qi, so that P(qi  Ii, all i | Data) = .95
• Implementation in Westfall et al (1999) assumes
– Variance components model (includes regular GLM and
heteroscedastic GLM as special case)
– Jeffreys’ priors on variances (vague)
– Flat prior on means (also vague)
• Uses PROC MIXED to obtain sample (assume i.i.d) from
posterior distribution
• Uses %BayesIntervals to obtain simultaneous credible intervals
161
Bayesian Simultaneous Conf. Band
Obs
_NAME_
Lower
Upper
1
2
3
4
5
6
7
8
9
10
11
12
13
diff10
diff15
diff20
diff25
diff30
diff35
diff40
diff45
diff50
diff55
diff60
diff65
diff70
-6.51586
-5.58315
-4.67692
-3.79373
-2.94835
-2.17939
-1.48987
-0.87996
-0.35969
0.10077
0.52638
0.91615
1.29532
-1.68863
-1.30912
-0.91457
-0.49875
-0.02991
0.49931
1.09893
1.79373
2.57208
3.41950
4.30526
5.22872
6.16164
162
Bayes/Frequentist Correspondence
From Westfall, P.H. (2005). Comment on Benjamini and Yekutieli, ‘False Discovery Rate Adjusted Confidence Intervals for
Selected Parameters,’ Journal of the American Statistical Association 100, 85-89.
163
Bayesian Probabilities for
Rankings
Suppose you observe Ave1 > Ave2 > … > Avek.
What is the probability that m1 > m2 > … > mk ?
Bayesian Solution: Calculate proportion of posterior
samples for which the ranking holds.
164
Results: Comparing Formulations
Solution for Fixed Effects
Effect
formulation
formulation
formulation
formulation
formulation
formulation
A
B
C
D
E
Estimate
Standard
Error
12.0500
11.0200
10.2700
9.2700
12.1700
0.3152
0.3152
0.3152
0.3152
0.3152
The MEANS Procedure
Variable
N
Mean
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
rank_observed_means
500000
0.5554860
Mean5_best
500000
0.6051220
Mean1_best
500000
0.3936600
Mean2_best
500000
0.0012160
Mean3_best
500000
2E-6
Mean4_best
500000
0
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
165
Waller-Duncan Loss Function Approach
Let dij = mi-mj.
Let Li<j(dij) denote the loss of declaring mi<mj.
Let Li>j(dij) denote the loss of declaring mi>mj.
Let Li~j(dij) denote the loss of declaring mi n.s. different from mj.
W-D Loss functions*
Li~j(dij) = |dij|
Li<j(dij) = 0, dij<0, = kdij otherwise
Li>j(dij) = -kdij, dij<0, = 0 otherwise
See Pennello, G. 1997. The k-ratio multiple comparisons Bayes rule for the balanced two-way design.
Journal of the American Statistical Association 92: 675-684.
* Equivalent form; See Hochberg and Tamhane (1987, 320-330)
166
Loss
Three-Decision Loss Functions
Duncan's Method; k-ratio = 50
Loss(-)
Loss(0)
Loss(+)
0
167
delta
Implementation
Waller – Duncan in PROC GLM
More general: Simulate from posterior pdf of the mij,
calculate all three losses, average, and choose decision with
smallest average loss.
168
Sample Output
The MEANS Procedure
Variable
N
Mean
Std Error
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Loss1LT3
100000
177.9339488
0.1445577
Loss1NS3
100000
1.7793558
0.0014454
Loss1GT3
100000
0.0016329
0.000596411
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Decision:
m1 > m3
The MEANS Procedure
Variable
N
Mean
Std Error
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Loss1LT5
100000
12.7456041
0.0717432
Loss1NS5
100000
0.3746864
0.000905742
Loss1GT5
100000
24.7230371
0.0967412
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Decision:
m1 ~ m5
169
Bayesian Multiple Testing
Frequentist univariate testing: Calculate
p-value = P(data more extreme | H0)
Bayesian univariate testing: Calculate P(H0 is true | Data)
Frequentist multiple testing: if H01, H02, … , H0k are all true
(or if many are true) then we get a small p-value by chance alone.
use a more conservative rule.
Bayesian multiple testing: Express the doubt about many or all
H0i being true using prior distribution; use this to calculate
posterior probabilities P(H0i is true | Data).
170
Bayesian Multiple Testing:
Methodology
Find posterior probability for each of the 2k models where
qi is either =0 or 0. Then
P(qi = 0| Z) =
(Sum of posterior probs for all 2k-1 models where qi = 0)
(Sum of posterior probs for all 2k models)
Gopalan, R., and Berry, D.A. (1998), Bayesian Multiple Comparisons Using Dirichlet Process Priors,
Journal of the American Statistical Association 93, 1130-1139.
Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). Bayesian multiple testing for two-sample multivariate
endpoints," Biometrics 59, 76-82.
171
The %BayesTests Macro: Priors
• You can specify your level of prior doubt about individual
hypotheses. You can specify either (i) P(H0i is true)
or (ii) P(H0i is true, all i) , or both.
• You can specify (iii) the degree of prior correlation among the
individual hypotheses.
• Specify two out of three of (i), (ii), and (iii). The third is
determined by the other two.
• Specify prior expected effect sizes and prior variances of effect
sizes (default: mean effect size is 2.5, variance= 2.)
172
The %BayesTests Macro: Data
Assumptions, Inputs, and Outputs
• Assume: tests are free combinations (e.g.,multiple
endpoints); MANOVA; Large Samples.
• Inputs: t-statistics and their (conditional) large-sample correlation
matrix (this is the partial correlation matrix in the case of
multiple endpoints); priors.
• Outputs: Posterior probabilities P(H0i is true | Data).
173
%BayesTests Example: Multiple
Endpoints in Panic Disorder Study
proc glm data=research.panic;
class TX;
model AASEVO PANTOTO PASEVO PHCGIMPO = TX;
estimate "Treatment vs Control" TX 1 -1;
manova h=TX / printe;
ods output Estimates =Estimates
PartialCorr=PartialCorr;
run;
%macro Estimates;
use Estimates;
read all var {tValue} into EstPar;
use PartialCorr;
read all var {AASEVO, PANTOTO, PASEVO, PHCGIMPO} into cov;
%mend;
%BayesTests(rho=.5,Pi0 =.5);
174
Output from %BayesTests
175
The Effect of Prior Correlation:
Borrowing Strength
Posterior Null Probability as a
Function of Prior Correlation
0.6
H1
H2
H3
H4
p
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
r
176
The Bayesian Multiplicity Effect
If the multiple comparisons concern, “What if many or all
nulls are true” is valid, the Bayesian must attach a higher
probability to P(H0i is true, all i).
Here is the result of setting P(H0i is true, all i) = .5.
177
“Right” answers, See Westfall, P.H., Krishen,A. and Young, S.S.(1998).
"Using Prior Information to Allocate Significance Levels for Multiple Endpoints," Statistics in Medicine 17, 2107-2119.
Summary: Bayesian Methods
Several Bayesian MCPs are available!
• Intervals
• Tests
• Rankings
• Decision theory
Other current research:
• FDR – Bayesian connection (genetics)
• Mixture models and Bayesian MCPs (variable selection)
178
Discussion
 Good methods and software are available
 You can’t use the excuse “I don’t have to use MCPs
because there is no good method available”
 This brings us back to the $100,000,000 question:
“When should we use MCPs/MTPs”?
179
When Should You Adjust?
A Scientific View
 When there is substantial doubt concerning the
collection of hypotheses tested
 When you data snoop
 When you play “pick the winner”
 When conclusions require joint validity
180
But What “Family” Should I Use?
• The set over which you play “pick the winner”
• The set of conclusions requiring joint validity
• Not always well-defined
• Better to decide at design stage or simply to
“frame the discussion”
181
Multiplicity Invites Selection;
Selection has an Effect
Variability, probability theory, VERY relevant.
182
Final Words:
a/k
183
References: Books
Hochberg, Y. and Tamhane, A.C. (1987). Multiple Comparison Procedures. John Wiley, New York.
Hsu, J.C. (1996). Multiple Comparisons: Theory and Methods, Chapman and Hall, London.
Westfall, P.H., and Young, S.S. (1993) Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley,
New York.
Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D., and Hochberg, Y. (1999). Multiple Comparisons and Multiple Tests Using the
SAS® System, Cary, NC: SAS Institute Inc.
Westfall, P.H. and Tobias, R. (2000). Exercises to Accompany "Multiple Comparisons and Multiple Tests Using the SAS® System" ,
Cary, NC: SAS Institute Inc.
184
References: Journal Articles
•Bauer, P.; George Chi; Nancy Geller; A. Lawrence Gould; David Jordan; Surya Mohanty; Robert O'Neill; Peter H. Westfall (2003). Industry, Government, and
Academic Panel Discussion on Multiple Comparisons in a “Real” Phase Three Clinical Trial. Journal of Biopharmaceutical Statistics, 13(4), 691-701.
•Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A new and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57, 12891300.
•Berger, J. O. and Delampady, M. (1987), Testing precise hypothesis. Statistical Science 2, 317-352.
•Cook, R.J. and Farewell, V.T.(1996). Multiplicity considerations in the design and analysis of clinical trials. JRSS-A 159, 93-110.
•Dmitrienko, A, Offen, W. and Westfall, P. (2003). Gatekeeping strategies for clinical trials that do not require all primary effects to be significant. Statistics in
Medicine 22, 2387-2400.
•Gönen, M., Westfall, P.H. and Johnson, W.O. (2003). "Bayesian multiple testing for two-sample multivariate endpoints," Biometrics 59, 76-82.
•Hellmich M, Lehmacher W. Closure procedures for monotone bi-factorial dose-response designs. Biometrics 2005;61:269-276.
•Koyama, T., and Westfall, P.H. (2005). Decision-Theoretic Views on Simultaneous Testing of Superiority and Noninferiority, Journal of Biopharmaceutical
Statistics 15, 943-955.
•Lehmacher W., Wassmer G., Reitmeir P.: Procedures for Two-Sample Comparisons with Multiple Endpoints Controlling the Experimentwise Error Rate. Biometrics,
1991, 47: 511-521.
•Marcus, R., Peritz, E. and Gabriel, K.R. (1976). On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655-660.
•Shaffer, J.P. (1986). Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association 81, 826—831.
•Westfall, P.H. (1997). "Multiple Testing of General Contrasts Using Logical Constraints and Correlations," Journal of the American Statistical Association 92, 299306.
•Westfall, P.H. and Wolfinger, R.D.(1997). "Multiple Tests with Discrete Distributions," The American Statistician 51, 3-8.
•Westfall, P.H., Johnson, W.O., and Utts, J.M. (1997). A Bayesian perspective on the Bonferroni adjustment. Biometrika 84, 419-427.
•Westfall,P.H. and Wolfinger, R.D. (2000). "Closed Multiple Testing Procedures and PROC MULTTEST." SAS Observations, July, 2000.
•Westfall, P.H., Ho, S.-Y., and Prillaman, B.A. (2001). "Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials," Journal
of Biopharmaceutical Statistics 11, 125-138.
•Westfall, P.H. and Krishen, A. (2001). "Optimally weighted, fixed sequence, and gatekeeping multiple testing procedures," Journal of Statistical Planning and
Inference 99, 25-40.
•Westfall, P. and Bretz, F. (2003). Multiplicity in Clinical Trials. Encyclopedia of Biopharmaceutical Statistics, second edition, Shein-Chung Chow, ed., Marcel
Decker Inc., New York, pp. 666-673.
•Westfall, P.H., Zaykin, D.V., and Young, S.S. (2001). Multiple tests for genetic effects in association studies. Methods in Molecular Biology, vol. 184: Biostatistical
Methods, pp. 143-168. Stephen Looney, Ed., Humana Press, Toloway, NJ.
•Westfall, P.H. and Tobias, R.D. (2007). Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method, Journal of the
American Statistical Association 102: 487-494.
185