Module 3 bioinformatics - Canadian Bioinformatics Workshops

Download Report

Transcript Module 3 bioinformatics - Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 3
Advanced Analysis
Paul Boutros
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine Learning
Module 3
bioinformatics.ca
What Are The Outputs of A Microarray
Study?
• Primary Data
– Raw image (.DAT file)
– Quantitation (.CEL file)
These file can be
10s of GB for a
typical Affy study
• Secondary Data
– Normalized data (usually an ASCII text file)
– QA/QC plots
• Tertiary Data
– Statistical analyses
– Global visualization (e.g. heatmaps)
– Downstream analyses (e.g. pathway, dataset-integration)
Module 3
bioinformatics.ca
How Do You Organize These Data?
I recommend you put things on a fast, backed-up network drive
/data/
Organize data by project
/data/Project
Create separate directories for each analysis
/data/Project/raw
/data/Project/QAQC
/data/Project/pre-processing
/data/Project/statistical
/data/Project/pathway
Module 3
bioinformatics.ca
How Do You Organize The Scripts?
I recommend you write a separate script for each analysis, and
put those in a standardized (backed-up!) location, mirroring the
directory structure and naming of your dataset directories.
Some sub-structure here is often useful:
/scripts/Project/pre-processing.R
/scripts/Project/statistical-univariate.R
/scripts/Project/statistical-multivariate.R
/scripts/Project/pathway/GOMiner.R
/scripts/Project/pathway/Reactome.R
/scripts/Project/integration/mRNA+CNV.R
/scripts/Project/integration/public-data.R
Module 3
bioinformatics.ca
Why Many Small Scripts?
• Monolithic scripts are hard to maintain
– Easier to make errors
• Accidentally re-using the same variable name
• Harder to debug
– Harder for somebody else to learn
• Small scripts are more flexible
– Quicker to modify/re-run a small part of your analysis
– Easier to re-use the same code on another dataset
• This is akin to the “unix” mindset of systems design
Module 3
bioinformatics.ca
What To Save?
• Everything!!
– All QA/QC plots (common reviewer request)
– All pre-processed data (needed for GEO uploads)
– Gene-wise statistical analyses
• Not just the statistically-significant genes
• Collapse all analyses into one file, though
– All plots/etc
• Using clear filenames is critical
• Disk-space is not usually a critical concern here
– Your raw data will be much larger than your output!
Module 3
bioinformatics.ca
Most Important Points
• Do not delete things:
– Keep all old versions of your scripts by including the date in
the filename (or using source-control)
– Version output files by date
– I have needed to go back to analyses done 7 years prior!
• Make regular (weekly) backups:
– Try to pass this work off to professional sysadmins
– External hard-drives/USBs are okay if you cannot get
access to network drives, but try to automate
Module 3
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3
bioinformatics.ca
Not All Experimental Designs Are Simple:
Alternative Questions
• Are a large number of groups different?
• Do two things synergize?
• Are mRNA levels correlated to something?
• Are mRNA levels associated with survival?
• Can we use mRNA levels to predict things?
Module 3
bioinformatics.ca
General Linear Modeling
• The underlying mathematical framework for most
statistical techniques we are familiar with:
–
–
–
–
ANOVAs
Logistic regression
Linear regression
Multiple regression
Y = a 0 + a 1 x 1 + a 2 x2 + …
NOT the same as a “Generalized Linear Model”!!!
Module 3
bioinformatics.ca
General Linear Modeling: Special Cases
Y = a0 + a1x1
x1 continuous
Linear Regression
Y = a0 + a1x1
Y factorial
Logistic Regression
Y = a0 + a1x1 + a2x2
x1,x2 continuous
Multiple Regression
Module 3
bioinformatics.ca
ANOVAs
Y = a0 + a1x1
x1 factorial
1-way ANOVA
Y = a0 + a1x1 + a2x2 + a3x1x2
x1 x2 two-level factors
Module 3
2-way ANOVA
bioinformatics.ca
ANOVA Experimental Designs Are Common
• Classic one-way ANOVAs:
– Treat a cell-line with 5 drugs – do any of them make a
difference?
– Make 5 different genetic mutations – do any of them alter
gene-expression?
• H0: the mean of at least one group differs
• Guesses at the assumptions?
Module 3
bioinformatics.ca
Assumptions Are Similar to T-test
• Normal distribution for the dependent variable
• Samples are independent
• Homoscedasticity
• Independent variables are:
– Not correlated
– Random normal variables
Module 3
bioinformatics.ca
1-Way ANOVAs in R (Part 1)
# read the data normally
raw.data <- ReadAffy();
eset <- expresso(…);
# localize for readability
expression.matrix <- exprs(eset);
# have a list of groups
groups <- as.factor( pData(eset)$x );
Module 3
bioinformatics.ca
1-Way ANOVAs in R (Part 2)
# loop over each gene
for (i in 1:nrow(expression.matrix)) {
# fit a one-way anova
tmp <- aov(expression.matrix[i,] ~ x);
# extract p-value
pvalue <- summary(tmp)[[1]][1,5];
}
Module 3
bioinformatics.ca
But This Is Limited
• A 1-way ANOVA just says that one group differs
• Which one  post hoc tests
• No microarray-specific aspects here
– Note: connection to multiple-testing
Module 3
bioinformatics.ca
Sometimes 1-Way ANOVAs are not worth
the Effort
Mutation 1
Wildtype
Mutation 2
1-way ANOVA + post hoc
Or 2 t-tests?
Module 3
bioinformatics.ca
Not Always Testing Raw Data
Vehicle 1
Drug 1
Vehicle 2
Drug 2
Vehicle 3
Drug 3
3 drugs with different controls
Module 3
1-way ANOVA on
the fold-changes
bioinformatics.ca
Two-Way ANOVAs
• Probably even more common than one-way ANOVAs
• Very powerful:
– Synergy?
– Additivity?
– Antagonism?
Y = a0 + a1x1 + a2x2 + a3x1x2
Assumptions?
Module 3
bioinformatics.ca
Assumptions Are Similar to 1-Way ANOVA
• Normal distribution for the dependent variable
• Samples are independent
• Homoscedasticity
• Independent variables are:
– Not correlated
– Random normal variables
Module 3
bioinformatics.ca
Do these treatments interact?
Treatment #2
• Standard approach: ANOVA
Interaction
Treatment #1
Module 3
bioinformatics.ca
Example
Radiation Toxicity
• Some people are prone to late-stage radio-toxicity
• Does radiation induce specific patterns of gene-expression in
these people?
Radiation
3 Gy
3 Gy
0 Gy
Module 3
Radio-Sensitive
0 Gy
bioinformatics.ca
Solution
• Fit an ANOVA model to each gene
120
Basal
Radiation
Interaction
60
0
-2.0 Fold
-1.4 Fold
+1.4 Fold
+2.0 Fold
Most effects are due to radiation alone, minimal interaction
Module 3
bioinformatics.ca
Two-Way ANOVAs in R
• The limma package is one very good approach for
this
• Alternatively standard model-fitting using the lm()
function can be done for each gene
• We will cover each approach in Tutorial #3
Module 3
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3
bioinformatics.ca
So Far We Have Considered ExactlyDefined Groups
15%
45%
80%
30%
70%
85%
Six cell-lines with differential sensitivity to a drug
What genes are associated with this phenomenon?
Module 3
bioinformatics.ca
Two Basic Approaches
• Correlation metrics
– Correlations
– Mutual Information
• Fit linear models with continuous variables
Y = a0 + a1x1
Module 3
Y
a0
a1
x1
=
=
=
=
mRNA abundance
basal level
effect of drug
drug sensitivity
bioinformatics.ca
Correlation Basics
• Start from the beginning, univariate statistics:
– Variance = Var(X) = E[(X – μX)2]
– Standard Deviation = [Var(X)]0.5
• But if you have two variables, how are they related?
– Covariance = Cox(X,Y) = E[ (X – μX)(Y – μY) ]
– Correlation is a scaled form of the covariance
Module 3
bioinformatics.ca
Basic Properties of Correlations
• Unit-less
– Variance and covariance have squared units
– Standard deviation has normal units
• Range [-1.0,1.0]
– Range is independent of sample-size
– Range is independent of the range of X and Y
• Captures the degree in which two variables change
together
Module 3
bioinformatics.ca
Relationship Types
• Correlation > 0
– Variables positively correlated
– When one goes up, the other one tends to as well
• Correlation < 0
– Variables negatively (inversely) correlated
– When one goes up, the other tends to go down
• Correlation = 0
– No relationship
– NB: if variables are independent, then correlation = covariance = 0
– NB: if correlation = covariance = 0, variables may be independent
Module 3
bioinformatics.ca
Pearson’s Correlation
• Most common correlation metric, R
• Measures linear relationship between two variables
• R = Cov(X, Y) / (σXσY)
Module 3
bioinformatics.ca
Pearson’s R Cannot Capture Non-Linear
Relationships Correctly
Module 3
bioinformatics.ca
Spearman’s Rank-Order Correlation
• Second most-common correlation, ρ (Greek rho)
• Makes no assumptions about the relationships
between variables
• Simplified version of Pearson’s R
• Works directly on ranks
• di = xi – yi (the differences between ranks)
• ρ = 1 – (6 Σ di) / [n (n2 – 1) ]
Module 3
bioinformatics.ca
Spearman Example
Module 3
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3
bioinformatics.ca
Survival Analysis
• A major new area in microarray analysis
• Works with any right-censored data
– Censoring: the value is only partially known
– Right-censoring: the value is at least this large
– Final outcome is not known:
• Patients are still alive at the time of the analysis
• An adverse drug-reaction has not happened yet
• Standard statistical approaches in use
Module 3
bioinformatics.ca
Typical Survival Curve
Module 3
bioinformatics.ca
Key Survival Statistics
• Cox proportional hazards model
– HR = hazard ratio
– P = probability the hazard ratio is not 1.0
• Log-rank test
– Probability two curves differ
Module 3
bioinformatics.ca
Example
• Beer and coworkers studied non-small cell lung
cancer using an older Affymetrix microarray:
– 12 samples of normal lung
– 83 samples of non-small cell lung cancer
– ~10,000 genes on their array
• Two questions:
– How many genes are associated with tumour-initiation?
– How many genes are associated with tumour-progression?
Module 3
bioinformatics.ca
Tumour Initiation: per-gene t-tests
ProbeSet
100_g_at
1000_at
1001_at
1002_f_at
1003_s_at
1004_at
1005_at
1006_at
1007_s_at
1008_f_at
1009_at
101_at
Module 1010_at
3
P Value
0.738582421
0.00116827
2.30041E-10
0.073332706
0.007888154
0.004095259
4.66128E-07
0.000765821
6.20196E-07
0.85658184
0.549685496
0.030792544
0.369870948
More genes repressed
180
120
60
0
Down
Up
Fewer oncogenes?!
bioinformatics.ca
Tumour Progression: per gene Cox models
More genes are involved in helping a
tumour resist treatment and grow larger
than in “making” it in the first place!
P < 0.05
733 Genes
230 Genes
P < 0.01
136 Genes
63 Genes
P < 0.001
15 Genes
2 Genes
Progression
Module 3
Initiation
bioinformatics.ca
Warning
• There are several assumptions to a Cox model:
– Non-parametric
• No assumptions made about “baseline hazard”
– Censoring must be independent of events
• You shouldn’t be more likely to lose follow-up information on
patients who die
– Hazard must be proportional
• No changes across time
• In general you want to have a statistician around to
ensure you are doing survival analyses correctly.
Module 3
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3
bioinformatics.ca
Meta-Analysis
• Combining results of multiple-studies that study
related hypotheses
• Often used to merge data from different microarray
platforms
• Very challenging – unclear what the best approaches
are, or how they should be adapted to the
pecularities of microarray data
Module 3
bioinformatics.ca
Why Do Meta-Analysis?
• Can identify publication biases
• Appropriately weights diverse studies
– Sample-size
– Experimental-reliability
– Similarity of study-specific hypotheses to the overall one
• Increases statistical power
• Reduces information
– A single meta-analysis vs. five large studies
– Provides clearer guidance
Module 3
bioinformatics.ca
Challenges of Meta-Analysis
• No control for bias
– What happens if most studies are poorly designed?
• File-drawer problem
– Publication bias can be detected, but not explicitly
controlled for
• How homogeneous is the data?
– Can it be fairly grouped?
– Simpson’s Paradox
Module 3
bioinformatics.ca
Simpson’s Paradox
Group-wise
correlations are
inverted when
the groups are
merged.
Cautionary note
for all metaanalyses!
Module 3
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine-Learning
Module 3
bioinformatics.ca
What are Predictors?
• Predictors: Use information to predict the likelihood
of a future event
– e.g. use historical data to predict economic performance
– e.g. use wind patterns to predict weather
• The process of finding a predictor is usually called
“machine learning”
• Have you used any machine-learning today?
Module 3
bioinformatics.ca
Basic Examples
• If you have a mutation in Brca1 we predict that your
chance of getting breast cancer is elevated.
• Can we find other predictors that can identify people
at risk for disease?
• Can these predictors be comprised of multiple
genes?
This is a hard problem – consider lung cancer again….
Module 3
bioinformatics.ca
Different Groups; Different Genes
Wigle et al
(Toronto)
Ramaswamy et al
17
Garber et al
(Stanford)
3
OCI Analysis
29
1
38
Tomida et al
(Japan)
Module 3
2
1
3
22
Bhattacharjee et al
(Harvard)
18
Beer et al
(Michigan)
25
bioinformatics.ca
What Causes Low Overlap?
• Experimental Technique
– Different platforms, patient cohorts
• Inherent limitations in mRNA data
– Too much noise
– Too little signal
• Statistical Analysis
– Insufficient replication
– Weak analytical methods
Module 3
bioinformatics.ca
Improved Algorithms Give Improved
Results….
Caveat
P : 9.8 x 10-6
HR: 4.8 (2.4 - 9.5)
Module 3
Years
This analysis
worked beautiful,
but it did take ~4
years to complete.
Developing
predictive
signatures is a
very slow and
complex process,
and cannot be
done quickly.
bioinformatics.ca
How are Predictors Identified?
Genes
Common Machine Learning Algorithms
Support Vector Machines (SVMs)
Random Forests (RF)
Genes
Prediction Analysis of Microarrays
(PAM)
Naïve Bayes
Linear Discriminant Analysis (LDA)
Subset
Decision Trees
Selection
Classification and Regression Trees (CART)
Boltzmann Learning
Machine
Neural Networks
K-Nearest Neighbours (KNN)
Learning
Maximum Likelihood Estimation (MLE)
Algorithm
Multiple Discriminant Analysis (MDA)
Predictor
Logistic Regression
Training
Multivariate Adaptive Regression Splines (MARS)
Flexible Discriminant Analysis
Gaussian Mixtures
Module 3
Predictor
Testing
(Validation)
bioinformatics.ca
The “No Free Lunch Theorem”
A major theoretical finding in the field patternrecognition from the 1990s
No method of finding predictors is
generally better than others:
There is always data-set dependency.
Module 3
bioinformatics.ca
Curse of Dimensionality
• Machine-learning for –omic studies is an extremely
active research topic
• One major challenge is the curse of dimensionality
–
–
–
–
p >> n
p = number of dimensions (i.e. number of genes)
n = number of samples (i.e. number of patients)
By chance alone, you will find good predictors… but which
ones will generalize to larger patient cohorts?
• Akin to multiple-testing
Module 3
bioinformatics.ca
Outline
• Data Organization & Storage
• Two-Level Experimental Designs
• Continuous Variables
• Survival Analysis
• Meta-Analysis
• Machine Learning
Module 3
bioinformatics.ca
Closing Thought
• Meta-Analysis
• Survival-Analysis
• Machine-Learning
Module 3
The last three topics are
highly complex. I strongly
recommend you consider
finding experts in them to
collaborate with as needed.
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 3
bioinformatics.ca