Transcript Slide 1

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 3
Clinical Data Integration
Anna Lapuk, PhD
Module 3 overview
• Part I
– Clinical Data and Biomarkers
• Part II
– Statistical aspects
• Lab:
– Survival analysis
Module 3: Clinical Data Integration
bioinformatics.ca
Learning Objectives
• To understand how clinical information is used
• To understand the process of genomic data integration
with clinical information
• To review the current advances on the biomarker/clinical
applications development
• To learn how to evaluate the biomarker and perform
survival analysis
• Be able to analyze tumour cohorts with regard to
association of molecular subgroups with outcome.
Module 3: Clinical Data Integration
bioinformatics.ca
What is the clinical data
Module 3: Clinical Data Integration
bioinformatics.ca
Usage of clinical data alone: prediction
Adjuvant! Online
(breast cancer)
Module 3: Clinical Data Integration
Nomograms
(prostate cancer)
bioinformatics.ca
Cancer ‘omics data
SEQUENCING
ARRAYS
Transcriptome
Genome
Gene expression
Methylome
Copy number
Alternative splicing
ChIP-Seq
Histone marks
Mutations
Protein binding sites
Rearrangements
s
CpG methylation
Fusion genes
100s-1000s of aberrations
Module 3: Clinical Data Integration
bioinformatics.ca
Goal: use ‘omics data to aid clinical decisions
Example:
SNV
Biomarker
Clinical use
Van Allen, JCO 2013
Module 3: Clinical Data Integration
bioinformatics.ca
Biomarkers & therapeutic targets (NCI)
• Biomarker is a biological molecule (or a set thereof) found in
blood, other body fluids, or tissues that is a sign of a normal or
abnormal process, or of a condition or disease. Also called
molecular marker and signature molecule.
• Therapeutic target is a biological molecule,
an enzyme, receptor or other protein that can be modified by
an external stimulus (drug). The implication is that a molecule
is "hit" by a signal and its behavior is thereby changed.
Module 3: Clinical Data Integration
bioinformatics.ca
Biomarker features
• Biomarker comes from alterations:
–
–
–
–
Germline/somatic mutation
Genomic amplification/deletion
Transcriptional change
Post-transcriptional modification
• Biomarker types:
–
–
–
–
–
Proteins
Nucleic acids (mRNA, miRNA, non-coding RNA)
Cells (Circulating Tumour Cells)
Peptides
Individual molecules vs Sets (signatures):
• Gene expression (n genes)
• Proteomic (n proteins)
• Metabolomic (n metabolites)
Module 3: Clinical Data Integration
bioinformatics.ca
Biomarker features (cont’d)
• Biomarker screening:
– Circulation (blood, serum, plasma)
– Excretions/secretions (stool, urine, etc.)
– Tissues (biopsy + imaging)
• Biomarkers may be also therapeutic targets, but not always:
– HER2 (breast cancer): biomarker and therapeutic target
– PSA (prostate cancer): biomarker. AR – target.
– KRAS mutations (colorectal cancer): biomarker. EGFR – target.
Module 3: Clinical Data Integration
bioinformatics.ca
Clinical use of biomarkers
• To diagnose or subclassify the disease state => diagnostic
– BCR-ABL fusion leukemia (Philadelphia chromosome)
• To make prognosis about a clinical outcome (survival or
recurrence) => prognostic
– OncotypeDx gene expression (estimates the risk of breast cancer
recurrence)
• To predict the activity of a therapy => predictive
– HER2 and herceptin (predicts response in breast cancer)
• To identify a subgroup of patients for whom therapy has shown
benefit => companion diagnostic markers
– BRAF V600E mutation and BRAF inhibitor (confers sensitivity in melanoma)
Module 3: Clinical Data Integration
bioinformatics.ca
Examples of use
Module 3: Clinical Data Integration
bioinformatics.ca
Biomarkers used in clinic
NSC Lung
Colon
Prostate
Glioma
Breast
Glioma
NCCN
Module 3: Clinical Data Integration
bioinformatics.ca
Fusion gene biomarker: ALK in NSCLC
• Activating ALK fusions
in 2-7% of
NSCLC/adenocarcino
ma/non-smokers
(EML4-ALK)
• Crizotinib – ALK
inhibitor
• FDA-approved FISH
test for ALK-fusion as
a companion test for
crizotinib treatment
ALK-3’
ALK-5’
Response to crizotinib (tumour burden)
Kwak, NEJM 2010
Module 3: Clinical Data Integration
bioinformatics.ca
Oncotype DX: gene chip for breast cancer
• Breast cancer patients treated
with hormone therapy alone
(tamoxifen) recur only in 15%
within 10 years. =>85% may not
need additional chemotherapy.
• Start: 250 candidate genes from
3 independent studies (447
patients)
• End: 21-gene RT-PCR assay in
FFPE samples => recurrence
score
• Test for recurrence in node-neg,
ER-pos breast tumours treated
with Tamoxifen
21-gene set
Difference in outcome for
predicted risk groups (P<0.001)
Paik, NEJM 2004
Module 3: Clinical Data Integration
bioinformatics.ca
Methylation biomarker: CpG in colon cancer
•
•
Colorectal cancer (CRC) – 40% lethal
outcome
CpG island methylator (CIMP)
phenotype – subclasses.
– CIMP-high (15-20%)
– CIMP-low (20%-45%)
CIMP1
high
CIMP
low
CIMP2
high
Cohort 1
Cohort 2
CIMP-high
with MSS
- worse
outcome
(HR>3)
Methylation profile
•
CIMP-high CRCs – unclear
association with outcome
– Other factors: microsatellite
instability (MSI – clinical marker
of better prognosis); BRAF
mutations
Module 3: Clinical Data Integration
Dahlin, Clin Can Res 2010
bioinformatics.ca
CTC biomarker: breast and NE cancers
• In metastatic breast cancer
CTCs count in blood sample (>5
or <5) is associated with
outcome
• Dynamic of CTC count is
important
• In metastatic NET
(neuroendocrine tumours)
CTCs count in blood sample
(>1 or <1) is predictive of
outcome, HR>6 (compared
with NET marker CgA)
Hayes, Imag Diag Prog 2006
Khan, JCO 2013
Module 3: Clinical Data Integration
bioinformatics.ca
Biomarker development
• Identification. Discovery approach to identify biomarkers that are
different between cohorts of tumours using variety of technologies
– Microarrays/ sequencing/ mass spectrometry.
– Important: careful study design to avoid bias in biomarker discovery
(matched cases and controls)!
• Validation.
– Analytical validity.
• Biomarker assay: reproducibility, sensitivity, specificity.
– Clinical validity
• How reliably the biomarker divides the populations into 2 groups of different outcomes.
Important: validation should be done on independent cohorts of tumours!
– Clinical utility
• Does the biomarker able to improve the clinical decision-making. Depends on the strength
of association of biomarker with outcome, the size of the effect, particular disease and
overall benefits, risks and economics.
Example: marker identifies 2 subgroups of tumours with very different survival. However, no
treatment options are available => no clinical utility.
Module 3: Clinical Data Integration
bioinformatics.ca
Established clinical utility: KRAS mutations in
colorectal cancer
EGFR
•
•
•
Frequent up-regulation of EGFR
in human tumours.
EGFR – targeted therapy
Resistance mechanisms:
•
•
•
EGFR mutations
Alternative pathways
Activation of downstream
effectors (PI3-K, KRAS,BRAF)
Dempke, Antican Res, 2010
Module 3: Clinical Data Integration
bioinformatics.ca
Established clinical utility: KRAS mutations in
colorectal cancer(cont’d)
•
•
KRAS-mut in 40% of CRC, associated
with poor survival
Screening of KRAS mutations in
patients treated with anti-EGFR:
–
–
•
•
•
responders – KRAS-wt
non-responders – high frequency KRAS-mut
In vitro studies confirm the role of
KRAS-mut in resistance
4 prospective clinical trials
investigating the effect of KRAS-mut
on anti-EGFR therapy gave consistent
results
NCCN, ASCO recommended test for
KRAS mutations in metastatic CRC in
conjunction with EGFR-treatment.
Anti- EGFR-treatment
KRAS-wt
KRAS-mut
Lievre, Can Res, 2006; Benvenuti, Can Res 2007
Module 3: Clinical Data Integration
bioinformatics.ca
Allegra, JCO, 2009
Module 3: Clinical Data Integration
bioinformatics.ca
No clinical utility: nomograms vs genomic markers
in prostate cancer
Genomic markers – no/little benefit
Nomograms perform well
Nomogram alone
•
•
•
+ gene expression
Gene expression c-index – 0.75
Nomogram c-index – 0.84
Combined model concordance index - 0.89
Note: c-index is a generalisation of the area under the ROC
curve (AUC); c <0.5 – no classification;
>0.5 – successful classification; c=1 – perfect.
Iremashvili, Onc 2013
Module 3: Clinical Data Integration
Stephenson, Can 2005
bioinformatics.ca
Rigorous clinical validation: early stage lung cancer
Hazard ratio
•
•
•
•
•
Kaplan-Meier survival
Validation set 1
442 lung cancers, 6 collection sites
4 institutions profiled gene expression using the
same platform
Uniform sample selection, processing and data
pre-processing
8 distinct biomarkers developed on a training
cohorts by 4 institutions; blinded validation on
two independent cohorts
Conclusion: combination of biomarker A (multigene) with clinical info had best performance.
ROC curves
Validation set 2
Shedden, Nat Med 2008
Module 3: Clinical Data Integration
bioinformatics.ca
Biased biomarker: prostate cancer
•
•
•
•
Signature serum peptides for
discrimination of cancers vs
healthy controls
Prostate cancer cohort: 32
patients (age =66)
Control cohort: 33 healthy
individuals (age 34, mostly
females)
Biomarkers are related to
age/sex, not prostate cancer
Vellanueva, J Clin Invest 2006
Module 3: Clinical Data Integration
bioinformatics.ca
Statistical aspects of biomarker
development
Module 3: Clinical Data Integration
bioinformatics.ca
Identification of biomarkers
Supervised analysis
Known outcome subgroups
Marker identification
KRAS mutations in
responders vs nonresponders
Unsupervised analysis
Unknown outcome
subgroups
Subgroups
discovery
Marker
identification
Biomarker=classifier
Module 3: Clinical Data Integration
bioinformatics.ca
Example
Novel subgroups
classifier
Testing, validation
Curtis, Nature 2012
Module 3: Clinical Data Integration
bioinformatics.ca
Classification
methods
Feature
selection
Classification rule
classifier
Prediction
discrimination
Classifier (biomarker) purpose
Module 3: Clinical Data Integration
bioinformatics.ca
Classifier development strategy
Learning/(training)
set
V subset
All but V subset
resubstitution error rate
Performance
assessment
classifier
classifier
CV average test set error rate
Test set error rate
Independent
test set
Note: Learning and Test sets have to be
identically distributed
Module 3: Clinical Data Integration
bioinformatics.ca
Classifier performance assessment
• How accurate is classifier (confusion matrix, accuracy)
• How well classifier worked on learning set (resubstitution
error rate)
• How well classifier worked on test set (test set error rate)
• Cross validation
• How do different classifiers compare (ROC curves)
Module 3: Clinical Data Integration
bioinformatics.ca
Confusion matrix
Actual diagnose (pathology)
(patient is positive/negative for cancer)
Fractionpo
positive
negative
Positive
True positive
False positive
negative
False negative
True negative
Sensitivity
(true pos rate)
Specificity
(true neg rate)
Fractionneg
s
Prediction
using
biomarkers
Positive predictive value
Negative predictive value
Accuracy
ACC = (TP + TN) / (P + N)
or
ACC=Sensitivity*Fractionpos + specificity*Fractionneg
Module 3: Clinical Data Integration
bioinformatics.ca
Example
2010
best cut-off values of
CA125 for preoperative
selection of intermediateto high-risk, and high-risk
diseases
Module 3: Clinical Data Integration
bioinformatics.ca
ROC curves
Definition: receiver operating characteristic (ROC), is a graphical plot of the sensitivity, or true positive rate, vs.
false positive rate (1 − specificity), for a binary classifier system as its discrimination threshold is varied.
Purpose:
- to find the best threshold for discrimination (value of expression of a gene-classifier)
- compare performance of different classifiers
Summary: - AUC (area under the curve, c-index) (c <0.5 – no classification; c>0.5 successful classification and
closer to 1 is best)
True Pos Rate
Best method
False Pos Rate
Module 3: Clinical Data Integration
bioinformatics.ca
Survival data – special case
Survival times – time to a given
end point
Survival analysis
Module 3: Clinical Data Integration
bioinformatics.ca
Survival analysis
Goal
Technique
Estimate the probability of individual
surviving for a given time period (one
year)
Kaplan-Meier survival curve, life table
Compare survival experience of two
different groups of individuals
(drug/placebo)
Logrank test (comparison of different KM curves)
Detect clinical/genomic/epidemiologic
variables which contribute to the risk
(associated with poor outcome)
Multivariate (univariate) Cox regression
model
Module 3: Clinical Data Integration
bioinformatics.ca
Survival data
• Survival time – is the time from a fixed point to an
end point
Starting point
End point
Surgery
Death/Recurrence/Relapse
Diagnosis
Death/Recurrence/Relapse
Treatment
Death/Recurrence/Relapse
• Almost never observe the event of interest in all
subjects (censoring of data)
• Need for a special analytical techniques
Module 3: Clinical Data Integration
bioinformatics.ca
Censored observations
• Arise whenever the dependent variable of interest represents
the time to a terminal event, and the duration of the study is
limited in time.
• Incomplete observation - the event of interest did not occur at
the time of the analysis.
Event of Interest
Censored observation
Death of the disease
Still alive
Survival of marriage
Still married
Drop-out-time from school
Still in school
• Type I and II censoring (time fixed/proportion of subjects
fixed)
• Right and left censoring
Module 3: Clinical Data Integration
bioinformatics.ca
Kaplan-Meier Curve
Survival probability
1
Patient Group 1
Patient Group 2
0.5
Censored
observations
0
0
1
2
3
4
5
6
7
Time (months)
r – still at risk
f – failure (reached the end point)
Module 3: Clinical Data Integration
bioinformatics.ca
Kaplan-Meier Curve
Survival probability
1
What is the probability of
a patient to survive 2.5
months?
0.5
Censored
observations
0
0
1
2
3
4
5
Time (months)
Module 3: Clinical Data Integration
6
7
P-value?
bioinformatics.ca
Logrank test: compare survival experience
of two different groups of individuals
Log-rank
k - groups of patients to compare
O – observed proportion (summed over time points)
E – expected proportion (summed over time points)
V – variance of (O-E) (summed over time points)
Then compare with the χ2 distribution with (k-1)
degrees of freedom and get the p-value
Module 3: Clinical Data Integration
(Doesn’t tell how different)
bioinformatics.ca
Hazard ratio
Hazard ratio compares two groups differing in treatments or
prognostic variables etc. Measures relative survival in two
groups based on the complete period studied.
R=0.43 – relative risk (hazard) of poor outcome under the
condition of group 1 is 43% of that of group 2.
R= 2.0 then the rate of failure in group 1 is twice the rate in the
group 2.
(tells how different)
Module 3: Clinical Data Integration
bioinformatics.ca
Cox-proportional hazard model
Used to investigate the effect of several variables on
survival experience.
Multivariable proportional hazards regression model
described by D.R. Cox for modeling survival times. It is also
called proportional hazards model because it estimates the
ratio of the risks (hazard ratio or relative hazard). There are
multiple predictor variables (such as prognostic markers
whose individual contribution to the outcome is being
assessed in the presence of the others) and the outcome
variable .
Module 3: Clinical Data Integration
bioinformatics.ca
Hazard function
Prognostic index (PI)
• X1...Xp – independent variable of interest
• b1 ... bp – regression coefficients to be estimated
• Assumption: the effect of variables is constant over time and
additive in a particular scale
• (Similarly to K-M) Hazard function is a risk of dying after a given
time assuming survival thus far.
• Cumulative function
• H0(t) – cumulative baseline or underlying function.
• Probability of surviving to time t is
S(t) = exp[-H(t)]
for every individual with given values of the variables in the model
we can estimate this probability.
Module 3: Clinical Data Integration
bioinformatics.ca
Interpretation of the Cox model
Cox regression model fitted to data from PBC trial of azathioprine vs placebo (n=216)
variable
Regression coef (b)
SE(b)
exp(b)
Increase of value of the
variable by 1 will result
in (relative to baseline)
Serum billirubin
2.510
0.316
12.31
1231%
Age
0.00690
0.00162
1.01
101%
Cirrhosis
0.879
0.216
2.41
241%
Serum albumin
-0.0504
0.0181
0.95
95%
Central cholestasis
0.679
0.275
1.97
197%
Therapy
0.52
0.207
1.68
168%
• Coefficient:
•Sign – positive or negative association with poor survival
•Magnitude – refers to the increase in log hazard for an
increase of 1 in the value of the covariate. If the value
changes by 1, hazard changes Exp(b) times.
Modified from Altman D, 1991
Module 3: Clinical Data Integration
bioinformatics.ca
Example of Cox HR: lung cancer
• Higher HR -> higher risk/better association with poor outcome
• Multivariate risk estimation is more powerful. Compare .
Module 3: Clinical Data Integration
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 3: Clinical Data Integration
bioinformatics.ca