Transcript Document
Statistics for Microarrays
Multiple Testing and Prediction and Variable Selection Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/
Genes
cDNA gene expression data
1 2 3 4 5 Data on G genes for n samples mRNA samples
sample1 sample2 sample3 sample4 sample5 …
0.46
-0.10 0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Gene expression level of gene
i
in mRNA sample
j
=
(normalized) Log( Red intensity / Green intensity )
Multiple Testing Problem
• Simultaneously test G null hypotheses, one for each gene j H j : no association between expression level of gene j and the covariate or response • Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue • Would like some sense of how ‘surprising’ the observed results are
Hypothesis Truth vs. Decision
Decision Truth # true H # not rejected U # rejected V (F +) totals m 0 # non-true H T totals m - R S R m 1 m
Type I (False Positive) Error Rates
• • • • Per-family Error Rate PFER = E(V) Per-comparison Error Rate PCER = E(V)/m Family-wise Error Rate FWER = p(V ≥ 1) False Discovery Rate FDR = E(Q), where Q = V/R if R > 0; Q = 0 if R = 0
Strong vs. Weak Control
• All probabilities are conditional hypotheses are true on which • Strong control refers to control of the Type I error rate under any combination of true and false nulls • Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true) • In general, weak control without other safeguards is unsatisfactory
Comparison of Type I Error Rates
• In general, for a given multiple testing procedure, PCER FWER PFER, and FDR FWER, with FDR = FWER under the complete null
Adjusted p-values (p*)
• If interest is in controlling, e.g., the FWER, the adjusted p-value for hypothesis H j is: p j * = inf { : H j is rejected at FWER } • Hypothesis H if p j * j is rejected at FWER • Adjusted p-values for other Type I error rates are similarly defined
Some Advantages of p-value Adjustment
• Test level determined in advance • Some procedures in terms of their adjusted p-values • Usually • Procedures can be based on the corresponding adjusted p-values (size) does not need to be most easily described easily estimated using resampling readily compared
A Little Notation
• For hypothesis H j , j = 1, …, G observed test statistic: t j observed unadjusted p-value: p j • Ordering of observed (absolute) t j : {r j } such that |t r1 | |t r2 | … |t rG | • Ordering of observed p j : {r j } such that |p r1 | letters (T, P) |p r2 | … |p rG | • Denote corresponding RVs by upper case
Control of the FWER
• Bonferroni single-step adjusted p-values p j * = min (Gp j , 1) • Holm (1979) step-down p rj * = max k = 1…j adjusted p-values {min ((G-k+1)p rk , 1)} • Hochberg (1988) step-down adjusted p-values (Simes inequality) p rj * = min k = j…G {min ((G-k+1)p rk , 1) }
Control of the FWER
• Westfall & Young (1993) step-down minP adjusted p-values p rj * = max k = 1…j { p(max l {rk…rG} P l p rk H 0 C )} • Westfall & Young (1993) step-down maxT adjusted p-values p rj * = max k = 1…j { p(max l {rk…rG} |T l | ≥ |t rk | H 0 C )}
Westfall & Young (1993) Adjusted p-values
• Step-down procedures: successively smaller adjustments at each step • Take into account the joint distribution of the test statistics • Less conservative than Bonferroni, Holm, or Hochberg adjusted p-values • Can be estimated by resampling but computer-intensive (especially for minP)
maxT vs. minP
• The maxT and minP adjusted p-values are the same when the test statistics are identically distributed (id) • When the test statistics are not id, maxT adjustments may be unbalanced (not all tests contribute equally to the adjustment) • maxT more computationally tractable than minP • maxT can be more powerful situations in ‘small n, large G’
Control of the FDR
• Benjamini & Hochberg (1995): step-up procedure which controls the FDR under some dependency structures p rj * = min k = j…G { min ([G/k] p rk , 1) } • • Benjamini & Yuketieli (2001): conservative step up procedure which controls the FDR under general dependency structures p rj * = min k = j…G { min (G j=1 G [1/j]/k] p rk , 1) } Yuketieli & Benjamini (1999): resampling based adjusted p-values for controlling the FDR under certain types of dependency structures
Identification of Genes Associated with Survival
• Data: survival y i and gene expression x ij individuals i = 1, …, n and genes j = 1, …, G for • Fit Cox model for each gene singly: h(t) = h 0 (t) exp( j x ij ) • For any gene j = 1, …, G, can test H j : j = 0 • Complete null H 0 C : j = 0 for all j = 1, …, G • The H j are tested on the basis of the Wald statistics t j and their associated p-values p j
Datasets
• Lymphoma (Alizadeh et al.) 40 individuals, 4026 genes • Melanoma (Bittner et al.) 15 individuals, 3613 genes • Both available at http://lpgprot101.nci.nih.gov:8080/GEAW
Results: Lymphoma
Results: Melanoma
Other Proposals from the Microarray Literature
• ‘Neighborhood Analysis’ , Golub et al.
– In general, gives only weak control of FWER • ‘Significance Analysis of Microarrays (SAM)’ (2 versions) – Efron et al. (2000): weak control of PFER – Tusher et al. (2001): strong control of PFER • SAM also estimates ‘FDR’, but this ‘FDR’ is defined as E(V|H 0 C )/R, not E(V/R)
Controversies
• Whether multiple testing methods (adjustments) should be applied at all • Which tests family should be included in the (e.g. all tests performed within a single experiment; define ‘experiment’) • Alternatives – Bayesian approach – Meta-analysis
Situations where inflated error rates are a concern
• It is plausible that all nulls may be true • A serious claim will be made whenever any p < .05 is found • Much data manipulation performed to find a ‘significant’ result • The analysis is planned to be but wish to claim ‘sig’ results are real • Experiment may be exploratory unlikely to be followed up before serious actions are taken
References
• Alizadeh et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503-511 • Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB 57: 289-200 • Benjamini and Yuketieli (2001) The control of false discovery rate in multiple hypothesis testing under dependency. Annals of Statistics • Bittner et al. (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536-540 • Efron et al. (2000) Microarrays and their use in a comparative experiment. Tech report, Stats, Stanford • Golub et al. (1999) Molecular classification of cancer. Science 286: 531-537
References
• Hochberg (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75: 800-802 • Holm (1979) A simple sequentially rejective multiple testing procedure. Scand. J Statistics 6: 65-70 • Ihaka and Gentleman (1996) R: A language for data analysis and graphics. J Comp Graph Stats 5: 299-314 • Tusher et al. (2001) Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. PNAS 98: 5116 -5121 • Westfall and Young (1993) Resampling-based multiple testing: Examples and methods for p-value adjustment. New York: Wiley • Yuketieli and Benjamini (1999) Resampling based false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plan Inf 82: 171-196
(BREAK)
Prediction and Variable Selection
• Substantial statistical literature on model selection for minimizing prediction error • Most of the focus is on linear models • Almost universally assumed (in the statistics literature) that n > (or >>) p , the number of available predictors • Other fields (e.g. chemometrics) have been dealing with the n << p problem
Model Selection (Generic)
• Select the class of models considered (e.g. linear models, regression trees, etc) to be • Use a procedure to compare models in the class • Search the model space • Assess prediction error
Model Selection and Assessment
• The generalization performance of a learning method relates to its prediction capability on independent test data • This performance guides model choice • Performance is a measure of quality of the chosen model
•
Bias, Variance, and Model Complexity
Test error (or generalization error the expected prediction error over an independent test sample ) is • Training error is the average loss over the training sample Err = (1/n) n i=1 L(y i , f(x i ))
Prediction Error
Error vs. Complexity
High Bias Low Variance Low Bias High Variance Test sample Training sample Model Complexity
Using the data
• Ideally, divide data into 3 sets: – Training set : used to fit models – – Validation set : used to estimate prediction error for model selection Test set : used to assess the generalization error for the final model • How much training data are ‘enough’ depends on signal-noise ratio, model complexity, etc.
• Most microarray data sets too small for dividing further
Approximate Validation Methods
• Analytic Methods include – Akaike information criterion (AIC) – Bayesian information criterion (BIC) – Minimum description length (MDL) • Sample re-use methods – Cross-validation – Bootstrap
Some Approaches when n < p
• Some kind of initial screening is essential for many types of models • Rank genes in terms of variance (or coefficient of variation) across samples, use only biggest • Dimensionality reduction through principal components , use the first (some number) PCs as variables
Parametric Variable Selection (I)
• • • Forward selection criterion : start with no variables; add additional variables satisfying some Backward elimination : start with all variables; delete variables one at a time according to some criterion until some stopping rule is satisfied ‘Stepwise’ : after each variable added, test to see if any previously selected variable may be deleted without appreciable loss of explanatory power
• • •
Parametric Variable Selection (II)
Sequential replacement some criterion : see if any variables can be replaced with another, according to Generating all subsets : provided the number of variables is not too large and the criterion assessing performance is not too difficult or time-consuming to compute Branch and bound some criterion : divide possible subsets into groups (branches), search of some sub branches may be avoided if exceed bound on
An Intriguing Approach
• Gabriel and Pun (1979): suggested that when an exhaustive search infeasible, may be possible to separate variables into groups for which an exhaustive search is feasible • For linear model, grouping would be such that regression sum of squares is additive under certain other conditions) for variables in different groups (orthogonal; also • But hard to see how to extend to other types of models, e.g. survival
Tree-based Variable Selection
• Tree-based models most often used for prediction , with little attention to details on the chosen model • Trees can be used to statistics • An idea is to use identify subsets of variables with good discriminatory power via importance bagging to generate a collection of tree predictors and importance statistics for each variable; can then rank variables by their (median, say) importance • Create a prediction accuracy criterion for inclusion of variables in the final subset
Genomic Computing for Variable Selection
• A type of evolutionary computing algorithm • Goal is to evolve simple explanatory rules with high explanatory power • May do better than tree-based methods, where variables selected on the basis of their individual importance (but bagging may improve this)
The Basic Strategy of Evolutionary Computing
What this course covered
• Biological basics of (mostly cDNA) microarray technology • Special problems arising, particularly regarding normalization of arrays and multiple hypothesis testing • Some ways that standard statistical techniques may be useful • Some ways that more sophisticated techniques have been/may be applied • Examples of areas where more research is needed
What was left out
• Pathway modeling – This is a modeling very active field, as there is much interest in picking out genes working together based on expression – My view is that progress here will not come from generic ‘black box’ methods, but will instead require highly collaborative, directed • A comprehensive review of methods developed for analysis of microarray data – Instead, we have covered what are, in my opinion, some of the most important and fundamentally justifiable methods
Perspectives on the future
• Technologies are evolving, don’t get too ‘locked in’ to any particular technology • Keep an open mind to various problem solving approaches… • …But that doesn’t mean not to think!
Important Applications Include…
• Identification of therapeutic targets • Molecular classification of cancers • Host-parasite interactions • Disease process pathways • Genomic response to pathogens • Many others
Acknowledgements
• Debashis Ghosh • Erin Conlon • Sandrine Dudoit • José Correa