Document

Transcript Document

Statistics for Microarrays

Multiple Testing and Prediction and Variable Selection Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/

Genes

cDNA gene expression data

1 2 3 4 5 Data on G genes for n samples mRNA samples

sample1 sample2 sample3 sample4 sample5 …

0.46

-0.10 0.15

-0.45

-0.06

0.30

0.49

0.74

-1.03

1.06

0.80

0.24

0.04

-0.79

1.35

1.51

0.06

0.10

-0.56

1.09

0.90

0.46

0.20

-0.32

-1.09

...

Gene expression level of gene

in mRNA sample

(normalized) Log( Red intensity / Green intensity )

Multiple Testing Problem

• Simultaneously test G null hypotheses, one for each gene j H j : no association between expression level of gene j and the covariate or response • Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue • Would like some sense of how ‘surprising’ the observed results are

Hypothesis Truth vs. Decision

Decision Truth # true H # not rejected U # rejected V (F +) totals m 0 # non-true H T totals m - R S R m 1 m

Type I (False Positive) Error Rates

• • • • Per-family Error Rate PFER = E(V) Per-comparison Error Rate PCER = E(V)/m Family-wise Error Rate FWER = p(V ≥ 1) False Discovery Rate FDR = E(Q), where Q = V/R if R > 0; Q = 0 if R = 0

Strong vs. Weak Control

• All probabilities are conditional hypotheses are true on which • Strong control refers to control of the Type I error rate under any combination of true and false nulls • Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true) • In general, weak control without other safeguards is unsatisfactory

Comparison of Type I Error Rates

• In general, for a given multiple testing procedure, PCER  FWER  PFER, and FDR  FWER, with FDR = FWER under the complete null

Adjusted p-values (p*)

• If interest is in controlling, e.g., the FWER, the adjusted p-value for hypothesis H j is: p j * = inf {  : H j is rejected at FWER  } • Hypothesis H if p j *   j is rejected at FWER • Adjusted p-values for other Type I error rates are similarly defined 

Some Advantages of p-value Adjustment

• Test level determined in advance • Some procedures in terms of their adjusted p-values • Usually • Procedures can be based on the corresponding adjusted p-values (size) does not need to be most easily described easily estimated using resampling readily compared

A Little Notation

• For hypothesis H j , j = 1, …, G observed test statistic: t j observed unadjusted p-value: p j • Ordering of observed (absolute) t j : {r j } such that |t r1 |  |t r2 |  …  |t rG | • Ordering of observed p j : {r j } such that |p r1 |  letters (T, P) |p r2 |  …  |p rG | • Denote corresponding RVs by upper case

Control of the FWER

• Bonferroni single-step adjusted p-values p j * = min (Gp j , 1) • Holm (1979) step-down p rj * = max k = 1…j adjusted p-values {min ((G-k+1)p rk , 1)} • Hochberg (1988) step-down adjusted p-values (Simes inequality) p rj * = min k = j…G {min ((G-k+1)p rk , 1) }

Control of the FWER

• Westfall & Young (1993) step-down minP adjusted p-values p rj * = max k = 1…j { p(max l  {rk…rG} P l  p rk  H 0 C )} • Westfall & Young (1993) step-down maxT adjusted p-values p rj * = max k = 1…j { p(max l  {rk…rG} |T l | ≥ |t rk |  H 0 C )}

Westfall & Young (1993) Adjusted p-values

• Step-down procedures: successively smaller adjustments at each step • Take into account the joint distribution of the test statistics • Less conservative than Bonferroni, Holm, or Hochberg adjusted p-values • Can be estimated by resampling but computer-intensive (especially for minP)

maxT vs. minP

• The maxT and minP adjusted p-values are the same when the test statistics are identically distributed (id) • When the test statistics are not id, maxT adjustments may be unbalanced (not all tests contribute equally to the adjustment) • maxT more computationally tractable than minP • maxT can be more powerful situations in ‘small n, large G’

Control of the FDR

• Benjamini & Hochberg (1995): step-up procedure which controls the FDR under some dependency structures p rj * = min k = j…G { min ([G/k] p rk , 1) } • • Benjamini & Yuketieli (2001): conservative step up procedure which controls the FDR under general dependency structures p rj * = min k = j…G { min (G  j=1 G [1/j]/k] p rk , 1) } Yuketieli & Benjamini (1999): resampling based adjusted p-values for controlling the FDR under certain types of dependency structures

Identification of Genes Associated with Survival

• Data: survival y i and gene expression x ij individuals i = 1, …, n and genes j = 1, …, G for • Fit Cox model for each gene singly: h(t) = h 0 (t) exp(  j x ij ) • For any gene j = 1, …, G, can test H j :  j = 0 • Complete null H 0 C :  j = 0 for all j = 1, …, G • The H j are tested on the basis of the Wald statistics t j and their associated p-values p j

Datasets

• Lymphoma (Alizadeh et al.) 40 individuals, 4026 genes • Melanoma (Bittner et al.) 15 individuals, 3613 genes • Both available at http://lpgprot101.nci.nih.gov:8080/GEAW

Results: Lymphoma

Results: Melanoma

Other Proposals from the Microarray Literature

• ‘Neighborhood Analysis’ , Golub et al.

– In general, gives only weak control of FWER • ‘Significance Analysis of Microarrays (SAM)’ (2 versions) – Efron et al. (2000): weak control of PFER – Tusher et al. (2001): strong control of PFER • SAM also estimates ‘FDR’, but this ‘FDR’ is defined as E(V|H 0 C )/R, not E(V/R)

Controversies

• Whether multiple testing methods (adjustments) should be applied at all • Which tests family should be included in the (e.g. all tests performed within a single experiment; define ‘experiment’) • Alternatives – Bayesian approach – Meta-analysis

Situations where inflated error rates are a concern

• It is plausible that all nulls may be true • A serious claim will be made whenever any p < .05 is found • Much data manipulation performed to find a ‘significant’ result • The analysis is planned to be but wish to claim ‘sig’ results are real • Experiment may be exploratory unlikely to be followed up before serious actions are taken

References

• Alizadeh et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503-511 • Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB 57: 289-200 • Benjamini and Yuketieli (2001) The control of false discovery rate in multiple hypothesis testing under dependency. Annals of Statistics • Bittner et al. (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536-540 • Efron et al. (2000) Microarrays and their use in a comparative experiment. Tech report, Stats, Stanford • Golub et al. (1999) Molecular classification of cancer. Science 286: 531-537

References

• Hochberg (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75: 800-802 • Holm (1979) A simple sequentially rejective multiple testing procedure. Scand. J Statistics 6: 65-70 • Ihaka and Gentleman (1996) R: A language for data analysis and graphics. J Comp Graph Stats 5: 299-314 • Tusher et al. (2001) Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. PNAS 98: 5116 -5121 • Westfall and Young (1993) Resampling-based multiple testing: Examples and methods for p-value adjustment. New York: Wiley • Yuketieli and Benjamini (1999) Resampling based false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plan Inf 82: 171-196

(BREAK)

Prediction and Variable Selection

• Substantial statistical literature on model selection for minimizing prediction error • Most of the focus is on linear models • Almost universally assumed (in the statistics literature) that n > (or >>) p , the number of available predictors • Other fields (e.g. chemometrics) have been dealing with the n << p problem

Model Selection (Generic)

• Select the class of models considered (e.g. linear models, regression trees, etc) to be • Use a procedure to compare models in the class • Search the model space • Assess prediction error

Model Selection and Assessment

• The generalization performance of a learning method relates to its prediction capability on independent test data • This performance guides model choice • Performance is a measure of quality of the chosen model

•

Bias, Variance, and Model Complexity

Test error (or generalization error the expected prediction error over an independent test sample ) is • Training error is the average loss over the training sample Err = (1/n)  n i=1 L(y i , f(x i ))

Prediction Error

Error vs. Complexity

High Bias Low Variance Low Bias High Variance Test sample Training sample Model Complexity

Using the data

• Ideally, divide data into 3 sets: – Training set : used to fit models – – Validation set : used to estimate prediction error for model selection Test set : used to assess the generalization error for the final model • How much training data are ‘enough’ depends on signal-noise ratio, model complexity, etc.

• Most microarray data sets too small for dividing further

Approximate Validation Methods

• Analytic Methods include – Akaike information criterion (AIC) – Bayesian information criterion (BIC) – Minimum description length (MDL) • Sample re-use methods – Cross-validation – Bootstrap

Some Approaches when n < p

• Some kind of initial screening is essential for many types of models • Rank genes in terms of variance (or coefficient of variation) across samples, use only biggest • Dimensionality reduction through principal components , use the first (some number) PCs as variables

Parametric Variable Selection (I)

• • • Forward selection criterion : start with no variables; add additional variables satisfying some Backward elimination : start with all variables; delete variables one at a time according to some criterion until some stopping rule is satisfied ‘Stepwise’ : after each variable added, test to see if any previously selected variable may be deleted without appreciable loss of explanatory power

• • •

Parametric Variable Selection (II)

Sequential replacement some criterion : see if any variables can be replaced with another, according to Generating all subsets : provided the number of variables is not too large and the criterion assessing performance is not too difficult or time-consuming to compute Branch and bound some criterion : divide possible subsets into groups (branches), search of some sub branches may be avoided if exceed bound on

An Intriguing Approach

• Gabriel and Pun (1979): suggested that when an exhaustive search infeasible, may be possible to separate variables into groups for which an exhaustive search is feasible • For linear model, grouping would be such that regression sum of squares is additive under certain other conditions) for variables in different groups (orthogonal; also • But hard to see how to extend to other types of models, e.g. survival

Tree-based Variable Selection

• Tree-based models most often used for prediction , with little attention to details on the chosen model • Trees can be used to statistics • An idea is to use identify subsets of variables with good discriminatory power via importance bagging to generate a collection of tree predictors and importance statistics for each variable; can then rank variables by their (median, say) importance • Create a prediction accuracy criterion for inclusion of variables in the final subset

Genomic Computing for Variable Selection

• A type of evolutionary computing algorithm • Goal is to evolve simple explanatory rules with high explanatory power • May do better than tree-based methods, where variables selected on the basis of their individual importance (but bagging may improve this)

The Basic Strategy of Evolutionary Computing

What this course covered

• Biological basics of (mostly cDNA) microarray technology • Special problems arising, particularly regarding normalization of arrays and multiple hypothesis testing • Some ways that standard statistical techniques may be useful • Some ways that more sophisticated techniques have been/may be applied • Examples of areas where more research is needed

What was left out

• Pathway modeling – This is a modeling very active field, as there is much interest in picking out genes working together based on expression – My view is that progress here will not come from generic ‘black box’ methods, but will instead require highly collaborative, directed • A comprehensive review of methods developed for analysis of microarray data – Instead, we have covered what are, in my opinion, some of the most important and fundamentally justifiable methods

Perspectives on the future

• Technologies are evolving, don’t get too ‘locked in’ to any particular technology • Keep an open mind to various problem solving approaches… • …But that doesn’t mean not to think!

Important Applications Include…

• Identification of therapeutic targets • Molecular classification of cancers • Host-parasite interactions • Disease process pathways • Genomic response to pathogens • Many others

Acknowledgements

• Debashis Ghosh • Erin Conlon • Sandrine Dudoit • José Correa

Document

Transcript Document

Statistics for Microarrays

cDNA gene expression data

Multiple Testing Problem

Hypothesis Truth vs. Decision

Type I (False Positive) Error Rates

Strong vs. Weak Control

Comparison of Type I Error Rates

Adjusted p-values (p*)

Some Advantages of p-value Adjustment

A Little Notation

Control of the FWER

Control of the FWER

Westfall & Young (1993) Adjusted p-values

maxT vs. minP

Control of the FDR

Identification of Genes Associated with Survival

Datasets

Results: Lymphoma

Results: Melanoma

Other Proposals from the Microarray Literature

Controversies

Situations where inflated error rates are a concern

References

References

(BREAK)

Prediction and Variable Selection

Model Selection (Generic)

Model Selection and Assessment

Bias, Variance, and Model Complexity

Error vs. Complexity

Using the data

Approximate Validation Methods

Some Approaches when n < p

Parametric Variable Selection (I)

Parametric Variable Selection (II)

An Intriguing Approach

Tree-based Variable Selection

Genomic Computing for Variable Selection

The Basic Strategy of Evolutionary Computing

What this course covered

What was left out

Perspectives on the future

Important Applications Include…

Acknowledgements

Directory