Transcript Slide 1

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 3
Statistical Analysis
Paul Boutros
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine Learning
Module 3: Statistical Analysis
bioinformatics.ca
What Are The Outputs of A Microarray
Study?
• Primary Data
– Raw image (.DAT file)
– Quantitation (.CEL file)
These file can be
10s of GB for a
typical Affy study
• Secondary Data
– Normalized data (usually an ASCII text file)
– QA/QC plots
• Tertiary Data
– Statistical analyses
– Global visualization (e.g. heatmaps)
– Downstream analyses (e.g. pathway, dataset-integration)
Module 3: Statistical Analysis
bioinformatics.ca
How Do You Organize These Data?
I recommend you put things on a fast, backed-up network drive
/data/
Organize data by project
/data/Project
Create separate directories for each analysis
/data/Project/raw
/data/Project/QAQC
/data/Project/pre-processing
/data/Project/statistical
/data/Project/pathway
Module 3: Statistical Analysis
bioinformatics.ca
How Do You Organize The Scripts?
I recommend you write a separate script for each analysis, and
put those in a standardized (backed-up!) location, mirroring the
directory structure and naming of your dataset directories.
Some sub-structure here is often useful:
/scripts/Project/pre-processing.R
/scripts/Project/statistical-univariate.R
/scripts/Project/statistical-multivariate.R
/scripts/Project/pathway/GOMiner.R
/scripts/Project/pathway/Reactome.R
/scripts/Project/integration/mRNA+CNV.R
/scripts/Project/integration/public-data.R
Module 3: Statistical Analysis
bioinformatics.ca
Why Many Small Scripts?
• Monolithic scripts are hard to maintain
– Easier to make errors
• Accidentally re-using the same variable name
• Harder to debug
– Harder for somebody else to learn
• Small scripts are more flexible
– Quicker to modify/re-run a small part of your analysis
– Easier to re-use the same code on another dataset
• This is akin to the “unix” mindset of systems design
Module 3: Statistical Analysis
bioinformatics.ca
What To Save?
• Everything!!
– All QA/QC plots (common reviewer request)
– All pre-processed data (needed for GEO uploads)
– Gene-wise statistical analyses
• Not just the statistically-significant genes
• Collapse all analyses into one file, though
– All plots/etc
• Using clear filenames is critical
• Disk-space is not usually a critical concern here
– Your raw data will be much larger than your output!
Module 3: Statistical Analysis
bioinformatics.ca
Most Important Points
• Do not delete things:
– Keep all old versions of your scripts by including the date in
the filename (or using source-control)
– Version output files by date
– I have needed to go back to analyses done 7 years prior!
• Make regular (weekly) backups:
– Try to pass this work off to professional sysadmins
– External hard-drives/USBs are okay if you cannot get
access to network drives, but try to automate
Module 3: Statistical Analysis
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3: Statistical Analysis
bioinformatics.ca
Not All Experimental Designs Are Simple:
Alternative Questions
• Are a large number of groups different?
• Do two things synergize?
• Are mRNA levels correlated to something?
• Are mRNA levels associated with survival?
• Can we use mRNA levels to predict things?
Module 3: Statistical Analysis
bioinformatics.ca
General Linear Modeling
• The underlying mathematical framework for most
statistical techniques we are familiar with:
–
–
–
–
ANOVAs
Logistic regression
Linear regression
Multiple regression
Y = a 0 + a 1 x 1 + a 2 x2 + …
NOT the same as a “Generalized Linear Model”!!!
Module 3: Statistical Analysis
bioinformatics.ca
General Linear Modeling: Special Cases
Y = a0 + a1x1
x1 continuous
Linear Regression
Y = a0 + a1x1
Y factorial
Logistic Regression
Y = a0 + a1x1 + a2x2
x1,x2 continuous
Multiple Regression
Module 3: Statistical Analysis
bioinformatics.ca
ANOVAs
Y = a0 + a1x1
x1 factorial
1-way ANOVA
Y = a0 + a1x1 + a2x2 + a3x1x2
x1 x2 two-level factors
Module 3: Statistical Analysis
2-way ANOVA
bioinformatics.ca
ANOVA Experimental Designs Are Common
• Classic one-way ANOVAs:
– Treat a cell-line with 5 drugs – do any of them make a
difference?
– Make 5 different genetic mutations – do any of them alter
gene-expression?
• H0: the mean of at least one group differs
• Guesses at the assumptions?
Module 3: Statistical Analysis
bioinformatics.ca
Assumptions Are Similar to T-test
• Normal distribution for the dependent variable
• Samples are independent
• Homoscedasticity
• Independent variables are:
– Not correlated
– Random normal variables
Module 3: Statistical Analysis
bioinformatics.ca
1-Way ANOVAs in R (Part 1)
# read the data normally
raw.data <- ReadAffy();
eset <- expresso(…);
# localize for readability
expression.matrix <- exprs(eset);
# have a list of groups
groups <- as.factor( pData(eset)$x );
Module 3: Statistical Analysis
bioinformatics.ca
1-Way ANOVAs in R (Part 2)
# loop over each gene
for (i in 1:nrow(expression.matrix)) {
# fit a one-way anova
tmp <- aov(expression.matrix[i,] ~ x);
# extract p-value
pvalue <- summary(tmp)[[1]][1,5];
}
Module 3: Statistical Analysis
bioinformatics.ca
But This Is Limited
• A 1-way ANOVA just says that one group differs
• Which one  post hoc tests
• No microarray-specific aspects here
– Note: connection to multiple-testing
Module 3: Statistical Analysis
bioinformatics.ca
Sometimes 1-Way ANOVAs are not worth
the Effort
Mutation 1
Wildtype
Mutation 2
1-way ANOVA + post hoc
Or 2 t-tests?
Module 3: Statistical Analysis
bioinformatics.ca
Not Always Testing Raw Data
Vehicle 1
Drug 1
Vehicle 2
Drug 2
Vehicle 3
Drug 3
3 drugs with different controls
Module 3: Statistical Analysis
1-way ANOVA on
the fold-changes
bioinformatics.ca
Two-Way ANOVAs
• Probably even more common than one-way ANOVAs
• Very powerful:
– Synergy?
– Additivity?
– Antagonism?
Y = a0 + a1x1 + a2x2 + a3x1x2
Assumptions?
Module 3: Statistical Analysis
bioinformatics.ca
Assumptions Are Similar to 1-Way ANOVA
• Normal distribution for the dependent variable
• Samples are independent
• Homoscedasticity
• Independent variables are:
– Not correlated
– Random normal variables
Module 3: Statistical Analysis
bioinformatics.ca
Do these treatments interact?
Treatment #2
• Standard approach: ANOVA
Interaction
Treatment #1
Module 3: Statistical Analysis
bioinformatics.ca
Example
Radiation Toxicity
• Some people are prone to late-stage radio-toxicity
• Does radiation induce specific patterns of gene-expression in
these people?
Radiation
3 Gy
3 Gy
0 Gy
Radio-Sensitive
Module 3: Statistical Analysis
0 Gy
bioinformatics.ca
Solution
• Fit an ANOVA model to each gene
120
Basal
Radiation
Interaction
60
0
-2.0 Fold
-1.4 Fold
+1.4 Fold
+2.0 Fold
Most effects are due to radiation alone, minimal interaction
Module 3: Statistical Analysis
bioinformatics.ca
Two-Way ANOVAs in R
• The limma package is one very good approach for
this
• Alternatively standard model-fitting using the lm()
function can be done for each gene
• We will cover each approach in Tutorial #3
Module 3: Statistical Analysis
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3: Statistical Analysis
bioinformatics.ca
So Far We Have Considered ExactlyDefined Groups
15%
45%
80%
30%
70%
85%
Six cell-lines with differential sensitivity to a drug
What genes are associated with this phenomenon?
Module 3: Statistical Analysis
bioinformatics.ca
Two Basic Approaches
• Correlation metrics
– Correlations
– Mutual Information
• Fit linear models with continuous variables
Y = a0 + a1x1
Module 3: Statistical Analysis
Y
a0
a1
x1
=
=
=
=
mRNA abundance
basal level
effect of drug
drug sensitivity
bioinformatics.ca
Correlation Basics
• Start from the beginning, univariate statistics:
– Variance = Var(X) = E[(X – μX)2]
– Standard Deviation = [Var(X)]0.5
• But if you have two variables, how are they related?
– Covariance = Cox(X,Y) = E[ (X – μX)(Y – μY) ]
– Correlation is a scaled form of the covariance
Module 3: Statistical Analysis
bioinformatics.ca
Basic Properties of Correlations
• Unit-less
– Variance and covariance have squared units
– Standard deviation has normal units
• Range [-1.0,1.0]
– Range is independent of sample-size
– Range is independent of the range of X and Y
• Captures the degree in which two variables change
together
Module 3: Statistical Analysis
bioinformatics.ca
Relationship Types
• Correlation > 0
– Variables positively correlated
– When one goes up, the other one tends to as well
• Correlation < 0
– Variables negatively (inversely) correlated
– When one goes up, the other tends to go down
• Correlation = 0
– No relationship
– NB: if variables are independent, then correlation = covariance = 0
– NB: if correlation = covariance = 0, variables may be independent
Module 3: Statistical Analysis
bioinformatics.ca
Pearson’s Correlation
• Most common correlation metric, R
• Measures linear relationship between two variables
• R = Cov(X, Y) / (σXσY)
Module 3: Statistical Analysis
bioinformatics.ca
Pearson’s R Cannot Capture Non-Linear
Relationships Correctly
Module 3: Statistical Analysis
bioinformatics.ca
Spearman’s Rank-Order Correlation
• Second most-common correlation, ρ (Greek rho)
• Makes no assumptions about the relationships
between variables
• Simplified version of Pearson’s R
• Works directly on ranks
• di = xi – yi (the differences between ranks)
• ρ = 1 – (6 Σ di) / [n (n2 – 1) ]
Module 3: Statistical Analysis
bioinformatics.ca
Spearman Example
Module 3: Statistical Analysis
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3: Statistical Analysis
bioinformatics.ca
Survival Analysis
• A major new area in microarray analysis
• Works with any right-censored data
– Censoring: the value is only partially known
– Right-censoring: the value is at least this large
– Final outcome is not known:
• Patients are still alive at the time of the analysis
• An adverse drug-reaction has not happened yet
• Standard statistical approaches in use
Module 3: Statistical Analysis
bioinformatics.ca
Typical Survival Curve
Module 3: Statistical Analysis
bioinformatics.ca
Key Survival Statistics
• Cox proportional hazards model
– HR = hazard ratio
– P = probability the hazard ratio is not 1.0
• Log-rank test
– Probability two curves differ
Module 3: Statistical Analysis
bioinformatics.ca
Example
• Beer and coworkers studied non-small cell lung
cancer using an older Affymetrix microarray:
– 12 samples of normal lung
– 83 samples of non-small cell lung cancer
– ~10,000 genes on their array
• Two questions:
– How many genes are associated with tumour-initiation?
– How many genes are associated with tumour-progression?
Module 3: Statistical Analysis
bioinformatics.ca
Tumour Initiation: per-gene t-tests
ProbeSet
P Value
100_g_at
0.738582421
1000_at
0.00116827
1001_at
2.30041E-10
1002_f_at
0.073332706
1003_s_at
0.007888154
1004_at
0.004095259
1005_at
4.66128E-07
1006_at
0.000765821
1007_s_at
6.20196E-07
1008_f_at
0.85658184
1009_at
0.549685496
101_at
0.030792544
Module 1010_at
3: Statistical Analysis
0.369870948
More genes repressed
180
120
60
0
Down
Up
Fewer oncogenes?!
bioinformatics.ca
Tumour Progression: per gene Cox models
More genes are involved in helping a
tumour resist treatment and grow larger
than in “making” it in the first place!
P < 0.05
733 Genes
230 Genes
P < 0.01
136 Genes
63 Genes
P < 0.001
15 Genes
2 Genes
Progression
Module 3: Statistical Analysis
Initiation
bioinformatics.ca
Warning
• There are several assumptions to a Cox model:
– Non-parametric
• No assumptions made about “baseline hazard”
– Censoring must be independent of events
• You shouldn’t be more likely to lose follow-up information on
patients who die
– Hazard must be proportional
• No changes across time
• In general you want to have a statistician around to
ensure you are doing survival analyses correctly.
Module 3: Statistical Analysis
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3: Statistical Analysis
bioinformatics.ca
Meta-Analysis
• Combining results of multiple-studies that study
related hypotheses
• Often used to merge data from different microarray
platforms
• Very challenging – unclear what the best approaches
are, or how they should be adapted to the
pecularities of microarray data
Module 3: Statistical Analysis
bioinformatics.ca
Why Do Meta-Analysis?
• Can identify publication biases
• Appropriately weights diverse studies
– Sample-size
– Experimental-reliability
– Similarity of study-specific hypotheses to the overall one
• Increases statistical power
• Reduces information
– A single meta-analysis vs. five large studies
– Provides clearer guidance
Module 3: Statistical Analysis
bioinformatics.ca
Challenges of Meta-Analysis
• No control for bias
– What happens if most studies are poorly designed?
• File-drawer problem
– Publication bias can be detected, but not explicitly
controlled for
• How homogeneous is the data?
– Can it be fairly grouped?
– Simpson’s Paradox
Module 3: Statistical Analysis
bioinformatics.ca
Simpson’s Paradox
Group-wise
correlations are
inverted when
the groups are
merged.
Cautionary note
for all metaanalyses!
Module 3: Statistical Analysis
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3: Statistical Analysis
bioinformatics.ca
What are Predictors?
• Predictors: Use information to predict the likelihood
of a future event
– e.g. use historical data to predict economic performance
– e.g. use wind patterns to predict weather
• The process of finding a predictor is usually called
“machine learning”
• Have you used any machine-learning today?
Module 3: Statistical Analysis
bioinformatics.ca
Basic Examples
• If you have a mutation in Brca1 we predict that your
chance of getting breast cancer is elevated.
• Can we find other predictors that can identify people
at risk for disease?
• Can these predictors be comprised of multiple
genes?
This is a hard problem – consider lung cancer again….
Module 3: Statistical Analysis
bioinformatics.ca
Different Groups; Different Genes
Wigle et al
(Toronto)
Ramaswamy et al
17
Garber et al
(Stanford)
3
OCI Analysis
29
1
38
Tomida et al
(Japan)
Module 3: Statistical Analysis
2
1
3
22
Bhattacharjee et al
(Harvard)
18
Beer et al
(Michigan)
25
bioinformatics.ca
What Causes Low Overlap?
• Experimental Technique
– Different platforms, patient cohorts
• Inherent limitations in mRNA data
– Too much noise
– Too little signal
• Statistical Analysis
– Insufficient replication
– Weak analytical methods
Module 3: Statistical Analysis
bioinformatics.ca
Improved Algorithms Give Improved
Results….
Caveat
P : 9.8 x 10-6
HR: 4.8 (2.4 - 9.5)
Module 3: Statistical Analysis
Years
This analysis
worked beautiful,
but it did take ~4
years to complete.
Developing
predictive
signatures is a
very slow and
complex process,
and cannot be
done quickly.
bioinformatics.ca
How are Predictors Identified?
Genes
Common Machine Learning Algorithms
Support Vector Machines (SVMs)
Random Forests (RF)
Genes
Prediction Analysis of Microarrays
(PAM)
Naïve Bayes
Linear Discriminant Analysis (LDA)
Subset
Decision Trees
Selection
Classification and Regression Trees (CART)
Boltzmann Learning
Machine
Neural Networks
K-Nearest Neighbours (KNN)
Learning
Maximum Likelihood Estimation (MLE)
Algorithm
Multiple Discriminant Analysis (MDA)
Predictor
Logistic Regression
Training
Multivariate Adaptive Regression Splines (MARS)
Flexible Discriminant Analysis
Gaussian Mixtures
Module 3: Statistical Analysis
Predictor
Testing
(Validation)
bioinformatics.ca
The “No Free Lunch Theorem”
A major theoretical finding in the field patternrecognition from the 1990s
No method of finding predictors is
generally better than others:
There is always data-set dependency.
Module 3: Statistical Analysis
bioinformatics.ca
Curse of Dimensionality
• Machine-learning for –omic studies is an extremely
active research topic
• One major challenge is the curse of dimensionality
–
–
–
–
p >> n
p = number of dimensions (i.e. number of genes)
n = number of samples (i.e. number of patients)
By chance alone, you will find good predictors… but which
ones will generalize to larger patient cohorts?
• Akin to multiple-testing
Module 3: Statistical Analysis
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine Learning
Module 3: Statistical Analysis
bioinformatics.ca
Closing Thought
• Meta-Analysis
• Survival-Analysis
• Machine-Learning
Module 3: Statistical Analysis
The last three topics are
highly complex. I strongly
recommend you consider
finding experts in them to
collaborate with as needed.
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module
bioinformatics.ca