Transcript Document

Multiple Testing in the Survival
Analysis of Microarray Data
José A. Correa, Florida Atlantic University
Sandrine Dudoit, Univ. California Berkeley
Darlene R. Goldstein, École Polytechnique
Fédérale de Lausanne
Contact: [email protected]
Software: http://www.math.fau.edu/correa/
cDNA gene expression data
Data on m genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Gene expression level of gene i in mRNA sample j
= (normalized) Log( Red intensity / Green intensity)
Multiple Testing Problem
• Simultaneously test m null hypotheses,
one for each gene j
Hj: no association between expression
level of gene j and the covariate or
response
• Because microarray experiments
simultaneously monitor expression levels
of thousands of genes, there is a large
multiplicity issue
• Would like some sense of how ‘surprising’
the observed results are
Hypothesis Truth vs. Decision
Decision # not
rejected
Truth
# true H
U
# rejected
totals
V (F +)
m0
# non-true H T
S
m1
totals
R
m
m-R
Type I (False Positive) Error Rates
• Per-family Error Rate
PFER = E(V)
• Per-comparison Error Rate
PCER = E(V)/m
• Family-wise Error Rate
FWER = p(V ≥ 1)
• False Discovery Rate
FDR = E(Q), where
Q = V/R if R > 0; Q = 0 if R = 0
Strong vs. Weak Control
• All probabilities are conditional on which
hypotheses are true
• Strong control refers to control of the Type
I error rate under any combination of true
and false nulls
• Weak control refers to control of the Type I
error rate only under the complete null
hypothesis (i.e. all nulls true)
• In general, weak control without other
safeguards is unsatisfactory
Comparison of Type I Error Rates
• In general, for a given multiple testing
procedure,
and
PCER  FWER  PFER,
FDR  FWER,
with FDR = FWER under the complete null
Adjusted p-values (p*)
• If interest is in controlling, e.g., the
FWER, the adjusted p-value for
hypothesis Hj is:
pj* = inf {: Hj is rejected at FWER }
• Hypothesis Hj is rejected at FWER 
if pj*  
• Adjusted p-values for other Type I
error rates are similarly defined
Some Advantages of
p-value Adjustment
• Test level (size) does not need to be
determined in advance
• Some procedures most easily described
in terms of their adjusted p-values
• Usually easily estimated using resampling
• Procedures can be readily compared
based on the corresponding adjusted
p-values
A Little Notation
• For hypothesis Hj, j = 1, …, m
observed test statistic: tj
observed unadjusted p-value: pj
• Ordering of observed (absolute) tj: {rj}
such that |tr1|  |tr2|  …  |trm|
• Ordering of observed pj: {rj}
such that |pr1|  |pr2|  …  |prm|
• Denote corresponding RVs by upper case
letters (T, P)
Control of the FWER
• Bonferroni single-step adjusted p-values
pj* = min (mpj, 1)
• Holm (1979) step-down adjusted p-values
prj* = maxk = 1…j {min ((m-k+1)prk, 1)}
• Hochberg (1988) step-up adjusted
p-values (Simes inequality)
prj* = mink = j…m {min ((m-k+1)prk, 1) }
Control of the FWER
• Westfall & Young (1993) step-down minP
adjusted p-values
prj* = maxk = 1…j { p(maxl{rk…rm} Pl  prk H0C )}
• Westfall & Young (1993) step-down maxT
adjusted p-values
prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}
Westfall & Young (1993)
Adjusted p-values
• Step-down procedures: successively
smaller adjustments at each step
• Take into account the joint distribution
of the test statistics
• Less conservative than Bonferroni, Holm,
or Hochberg adjusted p-values
• Can be estimated by resampling but
computer-intensive (especially for minP)
maxT vs. minP
• The maxT and minP adjusted p-values are the
same when the test statistics are identically
distributed (id)
• When the test statistics are not id, maxT
adjustments may be unbalanced (not all tests
contribute equally to the adjustment)
• maxT more computationally tractable than minP
• maxT can be more powerful in ‘small n, large m’
situations
Control of the FDR
• Benjamini & Hochberg (1995): step-up
procedure which controls the FDR under some
dependency structures
prj* = mink = j…m { min ([m/k] prk, 1) }
• Benjamini & Yuketieli (2001): conservative stepup procedure which controls the FDR under
general dependency structures
prj* = mink = j…m { min (m [1/j]/k] prk, 1) }
• Yuketieli & Benjamini (1999): resampling based
adjusted p-values for controlling the FDR under
certain types of dependency structures
Identification of Genes
Associated with Survival
• Data: survival yi and gene expression xij for
individuals i = 1, …, n and genes j = 1, …, m
• Fit Cox model for each gene singly:
h(t) = h0(t) exp(jxij)
• For any gene j = 1, …, m, can test Hj: j = 0
• Complete null H0C: j = 0 for all j = 1, …, m
• The Hj are tested on the basis of the Wald
statistics tj and their associated p-values pj
Datasets
• Lymphoma (Alizadeh et al.)
40 individuals, 4026 genes
• Melanoma (Bittner et al.)
15 individuals, 3613 genes
• Both available at
http://lpgprot101.nci.nih.gov:8080/GEAW
Results: Lymphoma
Results: Melanoma
Other Proposals from the
Microarray Literature
• ‘Neighborhood Analysis’, Golub et al.
– In general, gives only weak control of FWER
• ‘Significance Analysis of Microarrays
(SAM)’ (2 versions)
– Efron et al. (2000): weak control of PFER
– Tusher et al. (2001): strong control of PFER
• SAM also estimates ‘FDR’, but this ‘FDR’
is defined as E(V|H0C)/R, not E(V/R)
Controversies
• Whether multiple testing methods
(adjustments) should be applied at all
• Which tests should be included in the
family (e.g. all tests performed within a
single experiment; define ‘experiment’)
• Alternatives
– Bayesian approach
– Meta-analysis
Situations where inflated
error rates are a concern
• It is plausible that all nulls may be true
• A serious claim will be made whenever
any p < .05 is found
• Much data manipulation may be
performed to find a ‘significant’ result
• The analysis is planned to be exploratory
but wish to claim ‘sig’ results are real
• Experiment unlikely to be followed up
before serious actions are taken
Discussion (I)
• Lack of significant findings
– Small sample sizes
– FWER-controlling procedures may be too
stringent in microarray applications
– FDR could perhaps be made even more
powerful by taking into account the joint
distribution of gene expression levels
Discussion (II)
• Computational considerations
– All computing done in the R statistical environment
(Ihaka and Gentleman)
– For max T, Cox model analysis was repeated for each of
100,800 random permutations of survival times
– Exact maximum likelihood calculation took about 60
hours per machine in cluster of 24 PCs, each with 1 GHz
Pentium III and 256 MB memory
– Time can be reduced substantially by using a score
approximation to obtain parameter estimates, and by
calling C language code from within R
References
• Alizadeh et al. (2000) Distinct types of diffuse large
B-cell lymphoma identified by gene expression
profiling. Nature 403: 503-511
• Benjamini and Hochberg (1995) Controlling the false
discovery rate: a practical and powerful approach to
multiple testing. JRSSB 57: 289-200
• Benjamini and Yuketieli (2001) The control of false
discovery rate in multiple hypothesis testing under
dependency. Annals of Statistics
• Bittner et al. (2000) Molecular classification of
cutaneous malignant melanoma by gene expression
profiling. Nature 406: 536-540
• Efron et al. (2000) Microarrays and their use in a
comparative experiment. Tech report, Stats, Stanford
• Golub et al. (1999) Molecular classification of cancer.
Science 286: 531-537
References
• Hochberg (1988) A sharper Bonferroni procedure for
multiple tests of significance. Biometrika 75: 800-802
• Holm (1979) A simple sequentially rejective multiple
testing procedure. Scand. J Statistics 6: 65-70
• Ihaka and Gentleman (1996) R: A language for data
analysis and graphics. J Comp Graph Stats 5: 299-314
• Tusher et al. (2001) Significance analysis of
microarrays applied to transcriptional responses to
ionizing radiation. PNAS 98: 5116 -5121
• Westfall and Young (1993) Resampling-based multiple
testing: Examples and methods for p-value adjustment.
New York: Wiley
• Yuketieli and Benjamini (1999) Resampling based false
discovery rate controlling multiple test procedures for
correlated test statistics. J Stat Plan Inf 82: 171-196
Acknowledgements
• Debashis Ghosh
• Erin Conlon