Design & Analysis of Microarray Studies for Diagnostic

Download Report

Transcript Design & Analysis of Microarray Studies for Diagnostic

Design & Analysis of Microarray
Studies for Diagnostic & Prognostic
Classification
Richard Simon, D.Sc.
Chief, Biometric Research Branch
National Cancer Institute
http://linus.nci.nih.gov/brb
http://linus.nci.nih.gov/brb
• http://linus.nci.nih.gov/brb
– Powerpoint presentations
– Reprints & Technical Reports
– BRB-ArrayTools software
– BRB-ArrayTools Data Archive
– Sample Size Planning for Targeted Clinical
Trials
Simon R, Korn E, McShane L, Radmacher M, Wright G, Zhao Y. Design and analysis of DNA microarray
investigations, Springer-Verlag, 2003.
Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. Journal of
Computational Biology 9:505-511, 2002.
Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data. Journal of the
National Cancer Institute 95:14-18, 2003.
Dobbin K, Simon R. Comparison of microarray designs for class comparison and class discovery, Bioinformatics
18:1462-69, 2002; 19:803-810, 2003; 21:2430-37, 2005; 21:2803-4, 2005.
Dobbin K and Simon R. Sample size determination in microarray experiments for class comparison and prognostic
classification. Biostatistics 6:27-38, 2005.
Dobbin K, Shih J, Simon R. Questions and answers on design of dual-label microarrays for identifying differentially
expressed genes. Journal of the National Cancer Institute 95:1362-69, 2003.
Wright G, Simon R. A random variance model for detection of differential gene expression in small microarray
experiments. Bioinformatics 19:2448-55, 2003.
Korn EL, Troendle JF, McShane LM, Simon R.Controlling the number of false discoveries. Journal of Statistical
Planning and Inference 124:379-08, 2004.
Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics
21:3301-7,2005.
Simon R. Using DNA microarrays for diagnostic and prognostic prediction. Expert Review of Molecular Diagnostics,
3(5) 587-595, 2003.
Simon R. Diagnostic and prognostic prediction using gene expression profiles in high dimensional microarray data.
British Journal of Cancer 89:1599-1604, 2003.
Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical
Cancer Research 10:6759-63, 2004.
Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005.
Simon R. When is a genomic classifier ready for prime time? Nature Clinical Practice – Oncology 1:4-5, 2004.
Simon R. An agenda for Clinical Trials: clinical trials in the genomic era. Clinical Trials 1:468-470, 2004.
Simon R. Development and Validation of Therapeutically Relevant Multi-gene Biomarker Classifiers. Journal of the
National Cancer Institute 97:866-867, 2005.
Simon R. A roadmap for developing and validating therapeutically relevant genomic classifiers. Journal of Clinical
Oncology (In Press).
Freidlin B and Simon R. Adaptive signature design. Clinical Cancer Research (In Press).
Simon R. Validation of pharmacogenomic biomarker classifiers for treatment selection. Disease Markers (In Press).
Simon R. Guidelines for the design of clinical studies for development and validation of therapeutically relevant
biomarkers and biomarker classification systems. In Biomarkers in Breast Cancer, Hayes DF and Gasparini G,
Humana Press (In Press).
Myth
• That microarray investigations should be
unstructured data-mining adventures
without clear objectives
• Good microarray studies have clear
objectives, but not generally gene specific
mechanistic hypotheses
• Design and analysis methods should be
tailored to study objectives
Good Microarray Studies Have
Clear Objectives
• Class Comparison
– Find genes whose expression differs among
predetermined classes
• Class Prediction
– Prediction of predetermined class (phenotype)
using information from gene expression profile
• Class Discovery
– Discover clusters of specimens having similar
expression profiles
– Discover clusters of genes having similar
expression profiles
Class Comparison and Class
Prediction
• Not clustering problems
– Global similarity measures generally used for
clustering arrays may not distinguish classes
– Don’t control multiplicity or for distinguishing
data used for classifier development from
data used for classifier evaluation
• Supervised methods
• Requires multiple biological samples from
each class
Levels of Replication
• Technical replicates
– RNA sample divided into multiple aliquots and rearrayed
• Biological replicates
– Multiple subjects
– Replication of the tissue culture experiment
• Biological conclusions generally require
independent biological replicates. The
power of statistical methods for microarray
data depends on the number of biological
replicates.
• Technical replicates are useful insurance
to ensure that at least one good quality
array of each specimen will be obtained.
Class Prediction
• Predict which tumors will respond to a
particular treatment
• Predict which patients will relapse after a
particular treatment
• Class prediction methods usually have
gene selection as a component
– The criteria for gene selection for class
prediction and for class comparison are
different
• For class comparison false discovery rate is
important
• For class prediction, predictive accuracy is
important
Clarity of Objectives is Important
• Patient selection
– Many microarray studies developing
classifiers are not “therapeutically relevant”
• Analysis methods
– Many microarray studies use cluster analysis
inappropriately or misleadingly
Microarray Platforms for
Developing Predictive Classifiers
• Single label arrays
– Affymetrix GeneChips
• Dual label arrays using common reference
design
– Dye swaps are unnecessary
Common Reference Design
RED
A1
A2
B1
B2
GREEN
R
R
R
R
Array 1
Array 2
Ai = ith specimen from class A
Bi = ith specimen from class B
R = aliquot from reference pool
Array 3
Array 4
• The reference generally serves to control
variation in the size of corresponding spots
on different arrays and variation in sample
distribution over the slide.
• The reference provides a relative measure of
expression for a given gene in a given
sample that is less variable than an absolute
measure.
• The reference is not the object of
comparison.
• The relative measure of expression will be
compared among biologically independent
samples from different classes.
Myth
• For two color microarrays, each sample of
interest should be labeled once with Cy3
and once with Cy5 in dye-swap pairs of
arrays.
Dye Bias
• Average differences among dyes in label
concentration, labeling efficiency, photon
emission efficiency and photon detection
are corrected by normalization procedures
• Gene specific dye bias may not be
corrected by normalization
• Dye swap technical replicates of the same two
rna samples are rarely necessary.
• Using a common reference design, dye swap
arrays are not necessary for valid comparisons
of classes since specimens labeled with different
dyes are never compared.
• For two-label direct comparison designs for
comparing two classes, it is more efficient to
balance the dye-class assignments for
independent biological specimens than to do
dye swap technical replicates
Balanced Block Design
RED
A1
B2
A3
B4
GREEN
B1
A2
B3
A4
Array 1
Array 2
Ai = ith specimen from class A
Bi = ith specimen from class B
Array 3
Array 4
• Detailed comparisons of the effectiveness of
designs:
– Dobbin K, Simon R. Comparison of microarray
designs for class comparison and class discovery.
Bioinformatics 18:1462-9, 2002
– Dobbin K, Shih J, Simon R. Statistical design of
reverse dye microarrays. Bioinformatics 19:803-10,
2003
– Dobbin K, Simon R. Questions and answers on the
design of dual-label microarrays for identifying
differentially expressed genes, JNCI 95:1362-1369,
2003
• Common reference designs are very effective for many
microarray studies. They are robust, permit comparisons
among separate experiments, and permit many types of
comparisons and analyses to be performed.
• For simple two class comparison problems, balanced
block designs require many fewer arrays than common
reference designs.
– Efficiency decreases for more than two classes
– Are more difficult to apply to more complicated class comparison
problems.
– They are not appropriate for class discovery or class prediction.
• Loop designs are less robust, and dominated by either
common reference designs or balanced block designs,
and are not suitable for class prediction or class
discovery.
What We Will Not Discuss Today
•
•
•
•
Image analysis
Normalization
Clustering methods
Class comparison
– SAM and other methods of gene finding
– FDR (false discovery rate) and methods for
controlling the number of false positive genes
Class Prediction Model
• Given a sample with an expression profile vector x of
log-ratios or log signals and unknown class.
• Predict which class the sample belongs to
• The class prediction model is a function f which maps
from the set of vectors x to the set of class labels {1,2} (if
there are two classes).
• f generally utilizes only some of the components of x (i.e.
only some of the genes)
• Specifying the model f involves specifying some
parameters (e.g. regression coefficients) by fitting the
model to the data (learning the data).
Problems With Many
Diagnostic/Prognostic Marker
Studies
• Are not reproducible
– Retrospective non-focused analysis
– Multiplicity problems
– Inter-laboratory assay variation
• Have no impact
– Not therapeutically relevant questions
– Not therapeutically relevant group of patients
– Black box predictors
Components of Class Prediction
• Feature (gene) selection
– Which genes will be included in the model
• Select model type
– E.g. Diagonal linear discriminant analysis,
Nearest-Neighbor, …
• Fitting parameters (regression coefficients)
for model
– Selecting value of tuning parameters
Do Not Confuse Statistical Methods
Appropriate for Class Comparison with
Those Appropriate for Class Prediction
• Demonstrating statistical significance of prognostic
factors is not the same as demonstrating predictive
accuracy.
• Demonstrating goodness of fit of a model to the data
used to develop it is not a demonstration of predictive
accuracy.
• Statisticians are used to inference, not prediction
• Most statistical methods were not developed for p>>n
prediction problems
Feature Selection
• Genes that are differentially expressed among the
classes at a significance level  (e.g. 0.01)
– The  level is selected only to control the number of genes in the
model
t-test Comparisons of Gene
Expression
• xj~N(j1 , j2) for class 1
• xj~N(j2 , j2) for class 2
• H0j: j1 = j2
Estimation of Within-Class Variance

• Estimate separately
for each gene
2
j
– Limited degrees of freedom
– Gene list dominated by genes with small fold changes and
small variances
• Assume all genes have same variance
– Poor assumption
• Random (hierarchical) variance model
–
Wright G.W. and Simon R. Bioinformatics19:2448-2455,2003
–
Inverse gamma distribution of residual variances
–
Results in exact F (or t) distribution of test statistics with increased
degrees of freedom for error variance
– For any normal linear model
Feature Selection
• Small subset of genes which together give
most accurate predictions
– Combinatorial optimization algorithms
• Genetic algorithms
• Little evidence that complex feature
selection is useful in microarray problems
– Failure to compare to simpler methods
– Some published complex methods for
selecting combinations of features do not
appear to have been properly evaluated
Linear Classifiers for Two
Classes
l ( x )   wi xi
i F
x  vector of log ratios or log signals
F  features (genes) included in model
wi  weight for i'th feature
decision boundary l ( x ) > or < d
Linear Classifiers for Two Classes
• Fisher linear discriminant analysis
w  y 'S
1
– Requires estimating correlations among all genes selected
for model
– y = vector of class labels
• Diagonal linear discriminant analysis (DLDA)
assumes features are uncorrelated
• Naïve Bayes classifier
• Compound covariate predictor (Radmacher) and
Golub’s method are similar to DLDA in that they can
be viewed as weighted voting of univariate classifiers
Linear Classifiers for Two Classes
• Compound covariate predictor
xi(1)  xi(2)
wi 
ˆ i
Instead of for DLDA
xi(1)  xi(2)
wi 
ˆ i2
Linear Classifiers for Two Classes
• Support vector machines with inner
product kernel are linear classifiers with
weights determined to separate the
classes with a hyperplain that minimizes
the length of the weight vector
Support Vector Machine
minimize
w
2
i
i
subject to y j  w ' x
( j)
 b  1
where y j  1 for class 1 or 2.
Perceptrons
• Perceptrons are neural networks with no hidden
layer and linear transfer functions between input
output
– Number of input nodes equals number of genes selected
– Number of output nodes equals number of classes minus 1
– Number of inputs may be major principal components of genes
or major principal components of informative genes
• Perceptrons are linear classifiers
Naïve Bayes Classifier
• Expression profiles for class j assumed
normal with mean vector mj and diagonal
covariance matrix D
• Likelihood of expression profile vector x is
l(x; mj ,D)
• Posterior probability of class j for case with
expression profile vector x is proportional
to πj l(x; mj ,D)
Compound Covariate Bayes
Classifier
• Compound covariate y = tixi
– Sum over the genes selected as differentially
expressed
– xi the expression level of the ith selected gene for the
case whose class is to be predicted
– ti the t statistic for testing differential expression for the
i’th gene
• Proceed as for the naïve Bayes classifier but
using the single compound covariate as
predictive variable
– GW Wright et al. PNAS 2005.
When p>>n
The Linear Model is Too Complex
• It is always possible to find a set of
features and a weight vector for which the
classification error on the training set is
zero.
• Why consider more complex models?
Myth
• Complex classification algorithms such as
neural networks perform better than
simpler methods for class prediction.
• Artificial intelligence sells to journal
reviewers and peers who cannot
distinguish hype from substance when it
comes to microarray data analysis.
• Comparative studies have shown that
simpler methods work as well or better for
microarray problems because they avoid
overfitting the data.
Other Simple Methods
•
•
•
•
Nearest neighbor classification
Nearest k-neighbors
Nearest centroid classification
Shrunken centroid classification
Nearest Neighbor Classifier
• To classify a sample in the validation set as
being in outcome class 1 or outcome class 2,
determine which sample in the training set it’s
gene expression profile is most similar to.
– Similarity measure used is based on genes
selected as being univariately differentially
expressed between the classes
– Correlation similarity or Euclidean distance
generally used
• Classify the sample as being in the same
class as it’s nearest neighbor in the training
set
Other Methods
•
•
•
•
•
Neural networks
Top-scoring pairs
CART
Random Forrest
Genetic algorithm based classification
Apparent Dimension Reduction
Based Methods
• Principal component regression
• Supervised principal component
regression
• Partial least squares
• Stepwise logistic regression
When There Are More Than 2
Classes
• Nearest neighbor type methods
• Decision tree of binary classifiers
Decision Tree of Binary Classifiers
• Partition the set of classes {1,2,…,K} into two disjoint subsets S1 and
S2
• Develop a binary classifier for distinguishing the composite classes
S1 and S2
• Compute the cross-validated classification error for
distinguishing S1 and S2
• Repeat the above steps for all possible partitions in order to find the
partition S1and S2 for which the cross-validated classification error is
minimized
• If S1and S2 are not singleton sets, then repeat all of the above steps
separately for the classes in S1and S2 to optimally partition each of
them
Evaluating a Classifier
• “Prediction is difficult, especially the
future.”
– Neils Bohr
• Fit of a model to the same data used to
develop it is no evidence of prediction
accuracy for independent data.
Evaluating a Classifier
• Fit of a model to the same data used to develop
it is no evidence of prediction accuracy for
independent data
– Goodness of fit vs prediction accuracy
• Demonstrating statistical significance of
prognostic factors is not the same as
demonstrating predictive accuracy
• Demonstrating stability of identification of gene
predictors is not necessary for demonstrating
predictive accuracy
Evaluating a Classifier
• The classification algorithm includes the
following parts:
–
–
–
–
Determining what type of classifier to use
Gene selection
Fitting parameters
Optimizing with regard to tuning parameters
• If a re-sampling method such as cross-validation
is to be used to estimate predictive error of a
classifier, all aspects of the classification
algorithm must be repeated for each training set
and the accuracy of the resulting classifier
scored on the corresponding validation set
Split-Sample Evaluation
• Training-set
– Used to select features, select model type, determine
parameters and cut-off thresholds
• Test-set
– Withheld until a single model is fully specified using
the training-set.
– Fully specified model is applied to the expression
profiles in the test-set to predict class labels.
– Number of errors is counted
– Ideally test set data is from different centers than the
training data and assayed at a different time
Leave-one-out Cross Validation
• Omit sample 1
– Develop multivariate classifier from scratch on
training set with sample 1 omitted
– Predict class for sample 1 and record whether
prediction is correct
Leave-one-out Cross Validation
• Repeat analysis for training sets with each
single sample omitted one at a time
• e = number of misclassifications
determined by cross-validation
• Subdivide e for estimation of sensitivity
and specificity
• Cross validation is only valid if the test set is not used in
any way in the development of the model. Using the
complete set of samples to select genes violates this
assumption and invalidates cross-validation.
• With proper cross-validation, the model must be
developed from scratch for each leave-one-out training
set. This means that feature selection must be repeated
for each leave-one-out training set.
• The cross-validated estimate of misclassification error is
an estimate of the prediction error for model fit using
specified algorithm to full dataset
• If you use cross-validation estimates of prediction error
for a set of algorithms indexed by a tuning parameter
and select the algorithm with the smallest cv error
estimate, you do not have a valid estimate of the
prediction error for the selected model
Prediction on Simulated Null Data
Generation of Gene Expression Profiles
• 14 specimens (Pi is the expression profile for specimen i)
• Log-ratio measurements on 6000 genes
• Pi ~ MVN(0, I6000)
• Can we distinguish between the first 7 specimens (Class 1) and the last 7
(Class 2)?
Prediction Method
• Compound covariate prediction (discussed later)
• Compound covariate built from the log-ratios of the 10 most differentially
expressed genes.
Proportion of simulated data sets
1.00
Cross-validation: none (resubstitution method)
Cross-validation: after gene selection
Cross-validation: prior to gene selection
0.95
0.90
0.10
0.05
0.00
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Number of misclassifications
Invalid Criticisms of CrossValidation
• “You can always find a set of features that
will provide perfect prediction for the
training and test sets.”
– For complex models, there may be many sets
of features that provide zero training errors.
– A modeling strategy that either selects among
those sets or aggregates among those
models, will have a generalization error which
will be validly estimated by cross-validation.
Simulated Data
40 cases, 10 genes selected from 5000
Method
True
Resubstitution
LOOCV
10-fold CV
5-fold CV
Split sample 1-1
Split sample 2-1
.632+ bootstrap
Estimate
.078
.007
.092
.118
.161
.345
.205
.274
Std Deviation
.016
.115
.120
.127
.185
.184
.084
DLBCL Data
Method
Bias
Std Deviation
MSE
LOOCV
-.019
.072
.008
10-fold CV
-.007
.063
.006
5-fold CV
.004
.07
.007
Split 1-1
.037
.117
.018
Split 2-1
.001
.119
.017
.632+ bootstrap
-.006
.049
.004
Simulated Data
40 cases
Method
Estimate
Std Deviation
True
.078
10-fold
.118
.120
Repeated 10-fold
.116
.109
5-fold
.161
.127
Repeated 5-fold
.159
.114
Split 1-1
.345
.185
Repeated split 1-1 .371
.065
Permutation Distribution of Crossvalidated Misclassification Rate of a
Multivariate Classifier
• Randomly permute class labels and repeat the
entire cross-validation
• Re-do for all (or 1000) random permutations of
class labels
• Permutation p value is fraction of random
permutations that gave as few misclassifications
as e in the real data
Gene-Expression Profiles in
Hereditary Breast Cancer
cDNA Microarrays
Parallel Gene Expression Analysis
• Breast tumors studied:
7 BRCA1+ tumors
8 BRCA2+ tumors
7 sporadic tumors
• Log-ratios measurements of
3226 genes for each tumor
after initial data filtering
RESEARCH QUESTION
Can we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from
BRCA2– cancers based solely on their gene expression profiles?
BRCA1

g
10-2
10-3
10-4
# of
# of misclassified
significant
samples (m)
genes
182
53
9
3
2
1
% of random
permutations with
m or fewer
misclassifications
0.4
1.0
0.2
BRCA2
g
# of significant
genes
m = # of misclassified elements
(misclassified samples)
10-2
10-3
10-4
212
49
11
4 (s11900, s14486, s14572, s14324)
3 (s11900, s14486, s14324)
4 (s11900, s14486, s14616, s14324)
% of random
permutations with m
or fewer
misclassifications
0.8
2.2
6.6
Classification of BRCA2 Germline
Mutations
Classification Method
LOOCV Prediction
Error
Compound Covariate Predictor
14%
Fisher LDA
36%
Diagonal LDA
14%
1-Nearest Neighbor
9%
3-Nearest Neighbor
23%
Support Vector Machine
(linear kernel)
18%
Classification Tree
45%
Common Problems With Internal
Classifier Validation
• Pre-selection of genes using entire dataset
• Failure to consider optimization of tuning
parameter part of classification algorithm
– Varma & Simon, BMC Bioinformatics 2006
• Erroneous use of predicted class in
regression model
Incomplete (incorrect) CrossValidation
• Publications are using all the data to select
genes and then cross-validating only the
parameter estimation component of model
development
– Highly biased
– Many published complex methods which make strong
claims based on incorrect cross-validation.
• Frequently seen in complex feature set selection algorithms
• Some software encourages inappropriate cross-validation
Incomplete (incorrect) CrossValidation
• Let M(b,D) denote a classification model developed on a
set of data D where the model is of a particular type that
is parameterized by a scalar b.
• Use cross-validation to estimate the classification error
of M(b,D) for a grid of values of b; Err(b).
• Select the value of b* that minimizes Err(b).
• Caution: Err(b*) is a biased estimate of the prediction
error of M(b*,D).
• This error is made in some commonly used methods
Complete (correct) Cross-Validation
• Construct a learning set D as a subset of the full set S of
cases.
• Use cross-validation restricted to D in order to estimate
the classification error of M(b,D) for a grid of values of b;
Err(b).
• Select the value of b* that minimizes Err(b).
• Use the mode M(b*,D) to predict for the cases in S but
not in D (S-D) and compute the error rate in S-D
• Repeat this full procedure for different learning sets D1 ,
D2 and average the error rates of the models M(bi*,Di)
over the corresponding validation sets S-Di
Does an Expression Profile Classifier
Predict More Accurately Than Standard
Prognostic Variables?
• Not an issue of which variables are significant
after adjusting for which others or which are
independent predictors
– Predictive accuracy and inference are different
• The two classifiers can be compared by ROC
analysis as functions of the threshold for
classification
• The predictiveness of the expression profile
classifier can be evaluated within levels of the
classifier based on standard prognostic
variables
Does an Expression Profile Classifier
Predict More Accurately Than Standard
Prognostic Variables?
• Some publications fit logistic model to
standard covariates and the cross-validated
predictions of expression profile classifiers
log it ( p)     y( xi | i)   zi
• This is valid only with split-sample analysis
because the cross-validated predictions are
not independent
External Validation
• Should address clinical utility, not just predictive
accuracy
– Therapeutic relevance
• Should incorporate all sources of variability likely
to be seen in broad clinical application
– Expression profile assay distributed over time and
space
– Real world tissue handling
– Patients selected from different centers than those
used for developing the classifier
Survival Risk Group Prediction
• Evaluate individual genes by fitting single variable
proportional hazards regression models to log signal or
log ratio for gene
• Select genes based on p-value threshold for single gene
PH regressions
• Compute first k principal components of the selected
genes
• Fit PH regression model with the k pc’s as predictors. Let
b1 , …, bk denote the estimated regression coefficients
• To predict for case with expression profile vector x,
compute the k supervised pc’s y1 , …, yk and the
predictive index  = b1 y1 + … + bk yk
Survival Risk Group Prediction
• LOOCV loop:
– Create training set by omitting i’th case
• Develop supervised pc PH model for training set
• Compute cross-validated predictive index for i’th
case using PH model developed for training set
• Compute predictive risk percentile of predictive
index for i’th case among predictive indices for
cases in the training set
Survival Risk Group Prediction
• Plot Kaplan Meier survival curves for
cases with cross-validated risk percentiles
above 50% and for cases with crossvalidated risk percentiles below 50%
– Or for however many risk groups and
thresholds is desired
• Compute log-rank statistic comparing the
cross-validated Kaplan Meier curves
Survival Risk Group Prediction
• Repeat the entire procedure for all (or large
number) of permutations of survival times and
censoring indicators to generate the null
distribution of the log-rank statistic
– The usual chi-square null distribution is not valid
because the cross-validated risk percentiles are
correlated among cases
• Evaluate statistical significance of the
association of survival and expression profiles
by referring the log-rank statistic for the
unpermuted data to the permutation null
distribution
Survival Risk Group Prediction
• Other approaches to survival risk group
prediction have been published
• The supervised pc method is implemented
in BRB-ArrayTools
• BRB-ArrayTools also provides for
comparing the risk group classifier based
on expression profiles to one based on
standard covariates and one based on a
combination of both types of variables
Sample Size Planning
References
• K Dobbin, R Simon. Sample size
determination in microarray experiments
for class comparison and prognostic
classification. Biostatistics 6:27-38, 2005
• K Dobbin, R Simon. Sample size planning
for developing classifiers using high
dimensional DNA microarray data.
Biostatistics (In Press)
Sample Size Planning
• GOAL: Identify genes differentially expressed in a comparison of two
pre-defined classes of specimens on dual-label arrays using
reference design or single label arrays
• Compare classes separately by gene with adjustment for multiple
comparisons
• Approximate expression levels (log ratio or log signal) as normally
distributed
• Determine number of samples n/2 per class to give power 1- for
detecting mean difference  at level 
Comparing 2 equal size classes
n = 42(z/2 + z)2/2
 = mean log-ratio difference between
classes
 = within class standard deviation of biological
replicates
z/2, z = standard normal percentiles
• Choose  small, e.g.  = .001
• Use percentiles of t distribution for improved accuracy
where
Total Number of Samples for
Two Class Comparison




0.001
0.05
1
(2-fold)
0.5
human tissue
13
0.25
transgenic
mice
6
(t approximation)
Samples
Per Class
Dual Label Arrays With Reference Design
Pools of k Biological Samples
 z / 2  z  2
2
n  4m 
 g / k   / m


 

2
•
•
•
•
•
•
•
•
•
m = number of technical reps per sample
k = number of samples per pool
n = total number of arrays
 = mean difference between classes in log
signal
2 = biological variance within class
2 = technical variance
 = significance level e.g. 0.001
1- = power
z = normal percentiles (use t percentiles for
better accuracy)
Number of samples
pooled per array
Number of arrays
required
Number of samples
required
1
25
25
2
17
34
3
14
42
4
13
52
=0.001, =0.05, =1, 2+22=0.25, 2/2=4
=0.001 =0.05 =1
2+22=0.25, 2/2=4
m technical
reps
1
n arrays
required
25
samples
required
25
2
42
21
3
60
20
4
76
19
Number of Events Needed to Detect
Gene Specific Effects on Survival
•  = standard deviation in log2 ratios for each
gene
•  = hazard ratio (>1) corresponding to 2-fold
change in gene expression
•  = 1/N for 1 expected false positive gene
identified per N genes examined
•  = 0.05 for 5% false negative
rate
2
 z1 / 2  z1  
  log  
2


Number of Events Required to Detect
Gene Specific Effects on Survival
=0.001,=0.05
Hazard Ratio 

Events
Required
2
0.5
26
1.5
0.5
76
Sample Size Planning for Classifier
Development
• The expected value (over training sets) of
the probability of correct classification
PCC(n) should be within  of the maximum
achievable PCC()
Probability Model
• Two classes
• Log expression or log ratio MVN in each class with
common covariance matrix
• m differentially expressed genes
• p-m noise genes
• Expression of differentially expressed genes are
independent of expression for noise genes
• All differentially expressed genes have same inter-class
mean difference 2
• Common variance for differentially expressed genes and
for noise genes
Classifier
• Feature selection based on univariate ttests for differential expression at
significance level 
• Simple linear classifier with equal weights
(except for sign) for all selected genes.
Power for selecting each of the informative
genes that are differentially expressed by
mean difference 2 is 1-(n)
• For 2 classes of equal prevalence, let 1
denote the largest eigenvalue of the
covariance matrix of informative genes. Then
 
PCC ()   

m 

1 
 
PCC (n)   


m 1   
1 

1
m 1      p  m  
m
Sample size as a function of effect size (log-base 2 fold-change between classes divided by standard
100
deviation). Two different tolerances shown, . Each class is equally represented in the population.
22000 genes on an array.
60
40
Sample size
80
gamma=0.05
gamma=0.10
1.0
1.2
1.4
1.6
2 delta/sigma
1.8
2.0
Optimal significance level cutoffs for gene selection. 50 differentially expressed genes
out of 22,000 genes on the microarrays
2δ/σ
n=10
n=30
n=50
1
0.167
0.003
0.00068
1.25
0.085
0.0011
0.00035
1.5
0.045
0.00063
0.00016
1.75
0.026
0.00036
0.00006
2
0.015
0.0002
0.00002
γ=0.05, p=22,000 genes, gene standard deviation σ=0.75
.
Distance
2δ
Fold change
Sample
size n
PCC(n),
m=1
PCCworst(n)
m=1, pmin=0.25
PCC(n),
m=5
PCCworst(n)
m=5,
pmin=0.25
Adjusted
sample size
0.75
1.68
106
0.64
0.34
0.82
0.70
212
1.125
2.18
58
0.72
0.45
0.93
0.87
116
1.5
2.83
38
0.80
0.52
0.98
0.95
76
1.875
3.67
28
0.85
0.63
0.99
0.96
56
a) Power example with 60 samples, 2δ/σ=1/0.71 effect size for differentially expressed genes, alpha
= 0.001 cutoffs for gene selection. As the proportion in the under-represented class gets smaller, the
60
40
20
power
80
100
power to identify differentially expressed genes decreases.
0.1
0.2
0.3
0.4
Proportion in under-represented class
0.5
b) PCC(60) as a function of the proportion in the under-represented class. Parameter settings same
0.80
0.75
PCC(60)
0.85
as a), with 10 differentially expressed genes among 22,000 total genes. If the proportion in the underrepresented class is small (e.g., <20%), then the PCC(60) can decline significantly.
0.1
0.2
0.3
0.4
Proportion in under-represented class
0.5
BRB-ArrayTools
• Integrated software package using Excel-based
user interface but state-of-the art analysis
methods programmed in R, Java & Fortran
• Publicly available for non-commercial use
http://linus.nci.nih.gov/brb
Selected Features of BRB-ArrayTools
•
Multivariate permutation tests for class comparison to control number
and proportion of false discoveries with specified confidence level
– Permits blocking by another variable, pairing of data, averaging of technical
replicates
•
SAM
– Fortran implementation 7X faster than R versions
•
Extensive annotation for identified genes
– Internal annotation of NetAffx, Source, Gene Ontology, Pathway information
– Links to annotations in genomic databases
•
•
•
Find genes correlated with quantitative factor while controlling number
of proportion of false discoveries
Find genes correlated with censored survival while controlling number
or proportion of false discoveries
Analysis of variance
– Time course analysis
– Log intensities for non-reference designs
– Mixed models
Selected Features of BRB-ArrayTools
• Gene enhancement analysis.
– Find Gene Ontology groups and signaling
pathways that are differentially expressed among
classes
• Class prediction
– DLDA, CCP, Nearest Neighbor, Nearest Centroid,
Shrunken Centroids, SVM, Random Forests
– Complete LOOCV, k-fold CV, repeated k-fold,
.632+ bootstrap
– permutation significance of cross-validated error
rate
Selected Features of BRB-ArrayTools
• Clustering tools for class discovery with
reproducibility statistics on clusters
– Internal access to Eisen’s Cluster and
Treeview
• Visualization tools including rotating 3D
principal components plot exportable to
Powerpoint with rotation controls
• Extensible via R plug-in feature
• Tutorials and datasets
Acknowledgements
•
•
•
•
•
Kevin Dobbin
Michael Radmacher
Sudhir Varma
Yingdong Zhao
BRB-ArrayTools Development Team
– Amy Lam
– Ming-Chung Li
– Supriya Menezes