The challenge of achieving statistical robustness and clinical utility

Download Report

Transcript The challenge of achieving statistical robustness and clinical utility

Development of omics-based clinical tests:
The challenge of achieving statistical
robustness and clinical utility
Lisa M. McShane, Ph.D.
Biometric Research Branch
Division of Cancer Treatment and Diagnosis, NCI
University of Pennsylvania 5th Annual Conference on Statistical Issues in
Clinical Trials: Emerging Statistical Issues in Biomarker Validation
Philadelphia, PA
April 18, 2012
Omics
• “Omics” is a term encompassing multiple molecular
disciplines, which involve the characterization of
global sets of biological molecules such as DNAs,
RNAs, proteins, and metabolites.” (IOM. 2012. Evolution
of Translational Omics: Lessons Learned and the Path Forward.
Washington, DC: The National Academies Press.)





2
Genomics
Transcriptomics
Proteomics
Metabolomics
Epigenomics
Example Omic Assays
SKY analysis of AML cells
Mutation sequence surveyor trace
Illumina SNP bead array
cDNA expression microarray
3
Affymetrix expression GeneChip
MALDI-TOF proteomic spectrum
GOAL: Omic “signature”  Clinical test
Quantify pattern
• Pre-process
• Input to classifier
or calculate risk
score
Predict clinical
outcome or
characteristic
ER+, N0
Inform
clinical
decision
4
Buyse et al. 2006, J Natl Cancer Inst
Mammaprint (70-gene), prognostic
Paik et al. 2006, J Clin Oncol
Oncotype DX 21-gene RS,
prognostic, predictive?
Definitions
• Analytical validity
 Does the test accurately and reproducibly measure the
analyte or characteristic?
• Clinical/biological validity
 Does the test identify a biologic difference (e.g., “pos”
vs. “neg”) that may or may not be clinically useful?
• Clinical utility
 Do results of the test lead to a clinical decision that has
been shown with high level of evidence to improve
outcomes?
5
Teutsch et al. 2009, Genet Med
Simon et al. 2009, J Natl Cancer Inst
Potential roles for omics-based tests
in medicine
Diagnosis
Pre-diagnosis
• Confirmation
• Staging
• Subtyping
Pre-treatment
• Risk
• Prognostic
• Screening
• Predictive
• Early detection
FOCUS: Initial
therapy selection
6
Intratreatment
• Early
response or
futility
• Toxicity
monitoring
Post-treatment
• Early endpoint
• Recurrence or
progression
monitoring
Overview
• Distinguishing prognostic vs. predictive
• What makes a test “clinically useful”
• Pitfalls in development of prognostic and
predictive tests from high-dimensional omic
data
• Challenges in evaluation of tests on
retrospective specimen & data sets
• Challenges for prospective evaluation
7
Prognostic test
• PROGNOSTIC: Measurement associated with clinical
outcome in absence of therapy (natural course) or with
standard therapy all patients are likely to receive
 Clinical use: Identify patients having highly
favorable outcome in absence of (additional)
therapy or extremely poor outcome regardless of
(additional) therapy
 Research use: Disease biology, identify drug
targets, stratification factor in clinical trials
8
Predictive test
• PREDICTIVE: Measurement associated with benefit
or lack of benefit (or potentially even harm) from a
particular therapy relative to other available therapy
 Alternate terms:
•
•
•
•
Treatment stratification biomarker
Treatment effect modifier
Treatment-guiding biomarker
Treatment-selection biomarker
 Examples:
• Endocrine therapies for breast cancer will benefit only patients
whose tumors express hormone receptors
• SNPs in the drug metabolizing gene CYP2D6 may confer high risk
of serious toxicities from narcotic analgesics
 Ideally should be developed synchronously with new
therapeutics
9
When is a prognostic test clinically
useful?
• Is the prognostic information sufficiently strong to
•
influence clinical decisions (absolute risk
important)?
Does the biomarker provide information beyond
standard prognostic factors?
Good prognosis group (M-)
may forego additional therapy
Is this prognostic
information helpful?
Hazard ratio = .56
Hazard ratio = .18
10
Prognostic vs. predictive distinction:
Importance of control groups
No survival
benefit from
new
treatment
Prognostic
but not
predictive
New treatment for
all or for M+ only?*
Prognostic
and
predictive
11
(*Different considerations might apply for Standard Treatment  New Treatment)
When is a predictive test clinically
useful?
Treatment-by-biomarker interaction: Is it sufficient?
Prognostic and predictive;
New treatment for M+ only
Prognostic and predictive;
New treatment for all?*
Qualitative interaction
Quantitative interaction
• Std Trt better for M (HR= 1.36)
• New Trt better for M+ (HR+= 0.63)
• Interaction = 0.63/1.36 = 0.47
• New Trt better for M (HR = 0.44)
• New Trt better for M+ (HR+ = 0.63)
• Interaction = 0.63/0.44 = 1.45
Interaction = HR+/HR where HR=New/Std
12
(*Different considerations might apply for Standard Treatment  New Treatment)
Pitfalls in developing prognostic and
predictive tests from omic data
• Most published papers on omic signatures
derived from high-dimensional omic data
represent biological explorations or
computational exercises in search of
statistically significant findings
 Some biological insights gained, BUT . . .
 Few signatures have advanced to the point of
having established clinical utility
• Unfocused clinical context (“convenience” specimens)
• Clinical vs. statistical significance
13
Pitfalls in developing prognostic and
predictive tests from omic data
• Many published papers on omic signatures
have generated spurious findings or used
faulty model development or validation
strategies
 Poor data quality (specimens or assay)
 Poor experimental design (e.g. confounding with
specimen handling or assay batches)
 Multiple testing & model overfitting
 Failure to conduct rigorous, independent validation
•
•
•
•
14
Blinding & separation of training and test sets
Biases introduced by non-randomized treatment
Pre-specified analyses with appropriate type I error control
Lack of statistical power
Development of an omic signature
Training
sets
Generate raw data from
selected specimens
Screen out
unsuitable data
or specimens
Raw data pre-processing: normalization,
calibration, summary measures
Identify features (e.g., genes, proteins) relevant
to a clinical or pathological distinction
Apply algorithm to develop a classifier or
score; INTERNAL VALIDATION
15
EXTERNAL VALIDATION on
INDEPENDENT set of specimens/data
Model
development
Artifacts
• Omic assays can be exquisitely sensitive
to detect subtle biological differences
• BUT, also exquisitely sensitive to
 Specimen processing and handling differences
 Assay variation
• Different technical protocol
• Between-lab
• Within-lab, between-batch, between-technician
• BE VIGILANT
16
 Check for artifacts & confounding
 Control for in experimental design if possible
Assay batch effects: Gene expression microarrays
Density estimates of PM probe intensities (Affymetrix CEL files) for 96 NSCLC specimens
Red = batch 1
Blue = batch 2
Purple &
Green = outliers?
PCA plots after RMA pre-processing with and without outlier CEL files
Normalized data
may depend on
other arrays
normalized in the
same set
17
(Figure 1 from Owzar et al. 2008, Clin Cancer Res
using data from Beer et al. 2002, Nat Med )
Assay batch effects: Sequence data
Batch effects for 2nd generation sequence data from the 1000
Genomes Project. Standardized coverage data represented. Same
facility, same platform. Horizontal lines divide by date.
18
Figure 2 from Leek et al. 2010, Nature Rev Genet
Development and validation of the
signature model
• Selection of informative features
 Reduce noise
 Reduce dimension
• Building a classifier (predictor) model
 Link signature variations to clinical outcome or
biological characteristic
• Check for overfitting of the signature model
 Internal validation
• External assessment of model performance
19
Feature selection & data reduction
• Identify “informative” features (e.g., distinguish
favorable vs. unfavorable outcome)
 Control false positives
 Potentially many distinct, equally informative sets
• Unsupervised dimension reduction
 Create “super” features (e.g., “super genes”, pathways)
 Example emprical methods:
• Principal components analysis (PCA), or more generally
multidimensional scaling
• Clustering to produce cluster level summary features
• Supervised dimension reduction
 Feature selection followed by dimension reduction
 Example: Supervised principal components
20
Building the molecular signature model
• Construct classifier function or risk score
 Linear Predictors (e.g., LDA, SVM):
L(x) = w1x1 + w2x2 + . . . + wfxf
to which cutpoint is often applied
 Distance-based (e.g., nearest neighbor, nearest
centroid)
 Numerous other methods:
• Decision trees
• Random forests
• Completely stochastic or Bayesian model averaging
• No “best” method for all problems
21
Dangers of model overfitting
• Overfitting occurs when a statistical model describes
random error or noise instead of the underlying
relationship
 Model is excessively complex, such as having too many
parameters relative to the number of observations
 Overfit model will generally have poor predictive
performance, as it can exaggerate minor fluctuations in
the data
22
(Source: http://en.wikipedia.org/wiki/Overfitting)
• In high dimensions, true models are always complex
and data are always sparse
• VALIDATION OF MODEL PERFORMANCE IS
ESSENTIAL
Model validation
• RESUBSTITUTION (plug in training data)
estimates of model performance are highly biased
and COMPLETELY USELESS in high-dimensional
data setting
• INTERNAL: Within-sample validation
 Cross-validation
• (Leave-one-out, split-sample, k-fold, etc.)
 Bootstrap and other resampling methods
 Method comparisons: Molinaro et al. 2005, Bioinformatics
• EXTERNAL: Independent-sample validation
23
Simulation of prognostic model resubstitution method
Simulation
Training
Validation
1
p=7.0e-05
p=0.70
p=4.2e-07
p=0.54
p=2.4e-13
p=0.60
2
3
4
p=1.3e-10
p=0.89
5
p=1.8e-13
p=0.36
p=5.5e-11
p=0.81
6
7
p=3.2e-09
p=0.46
p=1.8e-07
p=0.61
p=1.1e-07
p=0.49
p=4.3e-09
p=0.09
8
9
10
24
(Subramanian and Simon 2010,
J Natl Cancer Inst – lung cancer
prognostic signatures)
• Survival data on 129 patients
from Bild et al. 2006, Nature
• Expression values for 5000
genes generated randomly
from N(0, I5000) for each
patient
• Data divided randomly into
training and validation sets
• Prognostic model developed
from training set and used to
classify patients in both
training and validation sets
Prognostic model resubstitution
example
All stages, OBS, n=62
HR=15.02, p<.001
95% CI=(5.12,44.04)
Stage IB, OBS, n=34
HR=13.32, p<.001
95% CI=(2.86,62.11)
Stage II, OBS, n=28
HR=13.47, p<.001
95% CI=(3.00,60.43)
25
“A 15-gene signature separated
OBS patients into high-risk and
low-risk subgroups with
significantly different survival
(hazard ratio [HR], 15.02; 95% CI,
5.12 to 44.04; P <.001; stage I HR,
13.31; P <.001; stage II HR, 13.47;
P <.001).”
(Zhu et al. 2010, J Clin Oncol)
Figure 1 legend:
“Disease-specific survival outcome
based on the 15-gene signature in
the JBR.10 training set.”
Independent validations (?) of 15-gene
prognostic score
DCC: HR=2.36, p=.026
Duke: HR=2.01, p=.08
UM: HR=3.18, p=.006
NKI: HR=2.02, p=.033
RT-qPCR
RT-qPCR
1/9 events
JBR.10 OBS: HR=2.02, p=.033
(years)
26
JBR.10 ADD: HR=2.02, p=.033
(years)
“The prognostic effect
was verified in the same
62 OBS patients where
gene expression was
assessed by qPCR.
Furthermore, it was
validated consistently in
four separate microarray
data sets (total 356
stage IB to II patients
without adjuvant
treatment) and additional
JBR.10 OBS patients by
qPCR (n=19).”
What happened to
HR=15.02?
Is this still
clinically useful?
Partial resubstitution:
Combining training and test sets
Failure to maintain separation of training and test sets
Lung Metagene Score Predictor
(Figure 5A from Potti et al. 2006,
N Engl J Med)
27
Cohort
Fraction
Stage IA
Duke
Training
39/89
ACOSOG
Validation
5/25
CALGB
Validation
24/84
Over half (39/68) of the
cases used to generate the
figure were from the
training set used to
develop the model, and
39/89 of those training
cases were Stage IA.
Internal validation: Leave-one-out
cross-validation (LOOCV)
Specimens
1, 2, . . ., j-1, j+1, . . ., N
Specimen j
Set
aside
Build classifier (feature selection,
model parameter estimation, etc.)
“Plug-in” Specimen j and
record predicted class
Repeat for each j
28
ALL steps, including feature selection, must
be included in the cross-validation loop
Simulation of cross-validation
approaches
• 100 specimens, 1000 simulations
• 6000 markers measured on each specimen
• Marker measurements generated as independent
Gaussian white noise (i.i.d. N(0,1))
• Artificially separate specimens into two groups (first
50, last 50) so there is NO true relation between
markers and group
• Build predictor of class
 Select markers univariate t-test, =0.001
 Linear discriminant analysis (LDA)
• TRUE PREDICTION ACCURACY (and
misclassification error rate) SHOULD BE 50%
29
Importance of correct cross-validation
True accuracy = 50% obtained by “LOOCV Correct”
“Resubstitution” is the naïve
method of testing model
performance by “plugging in”
the exact same data as were
used to build the model
“LOOCV Wrong” does not reselect features at each iteration
of the cross-validation, and it is
nearly as biased as the naïve
resubstitution estimate
30
Incorrect validation: Is bias only a problem
with a very large number of markers?
• 100 specimens, 1000 simulations
• M = 10, 50, or 100 markers
M=10
Mean % Errors:
Correct: 52%
Wrong: 44%
Resub: 42%
measured on each specimen
• Markers i.i.d. N(01)
• Randomly separate specimens
M=50
Mean % Errors: into two groups (first 50, last 50)
Correct: 51%
so there is NO true relation
Wrong: 37%
Resub: 32%
between markers and group
• Build predictor of class
Simulations
performed
by E. Polley
31
M=100
Mean % Errors:
Correct: 51%
Wrong: 31%
Resub: 24%
 Select markers by univariate t-test,
=0.1
 Linear discriminant analysis (LDA)
• TRUE PREDICTION ACCURACY
(and misclassification error rate)
SHOULD BE 50%
Limitations of internal validation
• Frequently performed incorrectly (e.g., not including
•
feature selection)
Cross-validated predictions can be tested in models using
typical statistical inference procedures
 Not independent variables
 Conventional testing levels and CI widths not correct
(Lusa et al. 2007, Stat in Med; Jiang et al. 2008, Stat Appl Gen Mol Biol )
• Large variance in estimated accuracy and effects
• Doesn’t protect against biases due to selective
•
inclusion/exclusion of samples
Doesn’t protect against built-in biases (e.g., lab batch,
variable specimen handling)
EXTERNAL VALIDATION IS ESSENTIAL!!!
32
Assessment of predictive tests:
Dangers of resubstitution
Is resubstitution acceptable when model was fit using the control
(OBS) arm only? NO! (Fig. 3, Zhu et al. 2010, J Clin Oncol)
33
High risk, microarray
Low risk, microarray
High risk, RT-qPCR
Low risk, RT-qPCR
Assessment of predictive tests: Dangers of
nonrandomized treatment, different cohorts
Figure 1. Genomic Decision Algorithm to Predict Sensitivity of Invasive Breast
Cancer to Adjuvant Chemotherapy (CT) or Chemoendocrine Therapy (CT+ HT)
(Hatzis et al. 2011, JAMA)
eFigure 6A.
Validation Cohort #2
Figure 2.
Validation Cohort #1
35% N, 65% N+
62% ER+
AT  HT if ER+
Claim: Test is
predictive and not
prognostic
P=.002 (Fig 2) vs.
P=.096 (eFig 6A)
A = anthracycline
T = Taxane
34
100% N
ER+ & ER (%?)
No HT & no CT
Prospective trials to evaluate clinical
utility of omic tests
• Comparison of randomized designs (Sargent et al.
2005, J Clin Oncol; Freidlin et al. 2010, J Natl Cancer Inst;
Clark and McShane 2011, Stat in Biopharm Res)
 Enrichment design
 Completely randomized design
 Randomized block design
 Biomarker-strategy design
 Adaptive designs
• Challenges
35




Big, long, and expensive
Inadequate enforcement of regulatory requirements
Test may become available before trial completes accrual
Difficulties with blinding & compliance
Quiz
• Detection of model overfitting and biased study
designs requires in-depth knowledge of
complex statistical approaches to the analysis
of high-dimensional omic data.
 TRUE or FALSE
• Poor model development practices have few
adverse consequences because the models
will eventually be tested in rigorously designed
clinical studies.
 TRUE or FALSE
36
Summary remarks
• Need earlier and more intense focus on clinical utility
• Need rigor in omics-based test development and study
•
•
design
EXTERNAL VALIDATION is essential
Need more complete and transparent reporting of
omics studies
 REMARK guidelines (McShane et al. 2005, J Natl Cancer
Inst)
 REMARK Explanation & Elaboration (Altman et al. 2012,
BMC Med and PLoS Med)
 Availability of data and computer code?
• Need multi-disciplinary collaborative teams with ALL
relevant expertise included
37
Acknowledgments
• NCI Cancer Diagnosis Program
•
•
Barbara Conley (Director), Tracy Lively
NCI Biometric Research Branch
Richard Simon (Chief), Ed Korn, Boris Freidlin, Eric
Polley, Mei Polley
Institute of Medicine Committee for Review of
Omics-Based Tests for Predicting Patient
Outcomes in Clinical Trials
(http://iom.edu/Activities/Research/OmicsBasedTests.aspx)
38
References for images
1. SKY AML image
http://www.nature.com/scitable/topicpage/human-chromosome-translocations-and-cancer23487
2. Mutation sequence surveyor trace (public domain)
http://upload.wikimedia.org/wikipedia/commons/8/89/Mutation_Surveyor_Trace.jpg
3. Illumina SNP bead array
https://www.sanger.ac.uk/Teams/Team67/
4. cDNA expression microarray image (public domain)
http://en.wikipedia.org/wiki/File:Microarray2.gif
5. Affy GeneChip expression array
Source unknown
6. MALDI-TOF proteomic spectrum
Hodgkinson et al. (Cancer Letters, 2010)
39