Transcript Slide 1

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Lecture 5
Multivariate Analyses II: General Models
MBP1010
†
Dr. Paul C. Boutros
Winter 2015
DEPARTMENT OF
MEDICAL BIOPHYSICS
†
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
This workshop includes material
originally developed by Drs. Raphael Gottardo,
Sohrab Shah, Boris Steipe and others
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Machine-Learning
Lecture 7: Sequence Analysis
Lecture 8: Microarray Analysis I: Pre-Processing
Lecture 9: Microarray Analysis II: Multiple-Testing
Final Exam (written)
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
House Rules
• Cell phones to silent
• No side conversations
• Hands up for questions
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Topics For This Week
• Review to date
• Examples
• Assignments
• Attendance
• More on Multivariate Models
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Review From Lecture #2
How can you interpret a QQ plot?
Compares two samples or a sample and a
distribution. Straight line indicates identity.
What is hypothesis testing?
Confirmatory data-analysis; test null hypothesis
What is a p-value?
Evidence against null; probability of FP,
probability of seeing as extreme a value by
chance alone
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Review From Lecture #2
Parametric vs. non-parametric tests
Parametric tests have distributional assumptions
What is the t-statistic?
Signal:Noise ratio
Assumptions of the t-test?
Data sampled from normal distribution;
independence of replicates; independence of
groups; homoscedasticity
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Flow-Chart For Two-Sample Tests
Is Data Sampled From a
Normally-Distributed Population?
Yes
No
Equal Variance
(F-Test)?
Yes
Homoscedastic
T-Test
Yes
Sufficient n for
CLT (>30)?
No
Heteroscedastic
T-Test
Lecture 5: Multivariate Analyses II: General Cases
No
Wilcoxon
U-Test
bioinformatics.ca
Review From Lecture #3
What is statistical power?
Probability a test will incorrect reject the null
AKA sensitivity or 1- false-negative rate
What is a correlation?
A relationship between two (random) variables
Common correlation metrics?
Pearson, Spearman, Kendall
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Lecture #3 Review
• Hypergeometric test
• Is a sample randomly selected from a fixed population?
• Proportion test
• Are two proportions equivalent?
• Fisher’s Exact test
• Are two binary classifications associated?
• (Pearson’s) Chi-Squared Test
• Are paired observations on two variables independent?
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Example #1
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice that is naturally susceptible to these tumours at
a frequency of ~20%. You are studying two transgenic lines, one of which
has a deletion of a putative tumour suppressor (TS), the other of which
has an amplification of a putative oncogene (OG). Tumour penetrance in
these two lines is 100%. Your hypothesis: tumours in mice lacking TS will
be smaller than those in mice with amplification of OG, as assessed by
post-mortem volume measurements of the primary tumour. Your data:
TS (cm3)
3.9
7.1
3.1
4.4
5.0
Lecture 5: Multivariate Analyses II: General Cases
OG (cm3)
5.2
1.9
5.0
6.1
4.5
4.8
bioinformatics.ca
Example #2
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice that is naturally susceptible to these tumours at
a frequency of ~20%. You are studying two transgenic lines, one of which
has a deletion of a putative tumour suppressor (TS), the other of which
has an amplification of a putative oncogene (OG). Tumour penetrance in
these two lines is 100%. Your hypothesis: mice lacking TS will acquire
tumours sooner than mice with amplification of OG. You test the mice
weekly using ultrasound imaging. Your data:
TS (week of tumour)
4
2
5
4
4
Lecture 5: Multivariate Analyses II: General Cases
OG (week of tumour)
3
6
3
2
4
3
bioinformatics.ca
Example #3
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice that is naturally susceptible to these tumours at
a frequency of ~20%. You are studying two transgenic lines, one of which
has a deletion of a putative tumour suppressor (TS), the other of which
has an amplification of a putative oncogene (OG). Tumour penetrance in
these two lines is 100%. Your hypothesis: mice lacking TS are less likely
to respond to a novel targeted therapeutic (DX) than those with
amplification of OG as assessed by a trained pathologist:
TS (pathological response)
Yes
No
Yes
Yes
No
Lecture 5: Multivariate Analyses II: General Cases
OG (pathological response)
Yes
Yes
Yes
Yes
No
Yes
bioinformatics.ca
Example #4
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice that is naturally susceptible to these tumours at
a frequency of ~20%. You are studying two transgenic lines, one of which
has a deletion of a putative tumour suppressor (TS), the other of which
has an amplification of a putative oncogene (OG). Based on your
previous data, you now hypothesize that mice lacking TS will show a
similar molecular response to DX as those with amplification of OG. You
use microarrays to study 20,000 genes in each line, and identify the
following genes as changed between drug-treated and vehicle-treated:
TS (DX-responsive genes)
MYC KRAS CD53
CDH1 FBW1 SEPT7
MUC1 MUC3 MUC9
RNF3
Lecture 5: Multivariate Analyses II: General Cases
OG (DX-responsive genes)
MYC KRAS CD53
CDH1 MUC1 MARCH1
PTEN IDH3 ESR2
RHEB CTCF STK11
MLL3 KEAP1 NFE2L2
ARID1A
bioinformatics.ca
Review From Lecture #4
Assumptions of linear-modeling
One variable is a response and one a predictor
No adjustment is needed for confounding or
other between-subject variation
Linearity
σ2 is constant, independent of x
Predictors are independent of each other
For proper statistical inference (CI, p-values),
errors are normally distributed
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Review From Lecture #4
How do we assess the adequacy of a model?
By considering the size of the residuals (R2)
How can we test the quality of a model?
Residual plots; qq plots; prediction accuracy
Compare a one-way ANOVA to a logistic regression
Linear model where x is factorial vs. one
where y is factorial
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Lots of Analyses Are Linear Regressions
Y = a0 + a1x1
x1 continuous
Linear Regression
Y = a0 + a1x1
Y factorial
Logistic Regression
Y = a0 + a1x1
x1 factorial
1-way ANOVA
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Quick Thoughts on Assignment Code
Tip #1: avoid reserved words
data
Tip #2: take advantage of filehandling arguments
Tip #3: consistent indentation
Lecture 5: Multivariate Analyses II: General Cases
Shorter
code
readability
bioinformatics.ca
Attendance Break
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
When Do We Use Statistics?
• Ubiquitous in modern biology
• Every class I will show a use of statistics in a (very, very)
recent Nature paper.
Advance Online Publication
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Cervix Cancer 101
• Diesease burden increasing
• (~380k to ~450k in the last 30 years)
• By age 50, >80% of women have HPV infection
• >75% of sexually active women exposed, only a subset affected
• Why is nearly totally unknown!
• Tightly Associated with Poverty
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
HPV Infection Associated Multiple Cancers
•
•
•
•
•
•
Cervix
Anal
Vaginal
Vulvar
Penile
Head & Neck
>99%
~85%
~70%
~40%
~45%
~20-30%
Of course not all of these are the HPV subtypes
caught by current vaccines, but a majority are.
Thus many cancers are preventable.
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Figure 1 is a Classic Sequencing Figure
Mutation rate vs. histology
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
But Histology Is Associated With Age
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Age Is Associated With Mutation Rate
R2 = 0.08; p = 0.005  Is this meaningful?
4.2/Mbp
1.6/Mbp
P(Wilcoxon) = 0.0095
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Perhaps Not in Isolation But...
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
The Solution: Linear Regression
Mutation Rate = a0 + x1a1 + x2a2
x1 = histology indicator (adeno = 1; squam = 0)
x2 = age in years (continuous)
Mutation Rate = 0.259 - 0.145x1 + 0.006x2
P(a1 ≠ 0) = 0.045
P(a2 ≠ 0) = 0.012
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
General Linear Modeling
• The underlying mathematical framework for most
statistical techniques we are familiar with:
•
•
•
•
ANOVAs
Logistic regression
Linear regression
Multiple regression
Y = a0 + a1x1 + a2x2 + …
NOT the same as a “Generalized Linear Model”!!!
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
General Linear Modeling: Special Cases
Y = a0 + a1x1
x1 continuous
Linear Regression
Y = a0 + a1x1
Y factorial
Logistic Regression
Y = a0 + a1x1 + a2x2
x1,x2 continuous
Multiple Regression
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
ANOVAs
Y = a0 + a1x1
x1 factorial
1-way ANOVA
Y = a0 + a1x1 + a2x2 + a3x1x2
x1 x2 two-level factors
Lecture 5: Multivariate Analyses II: General Cases
2-way ANOVA
bioinformatics.ca
ANOVA Experimental Designs Are Common
• Classic one-way ANOVAs:
• Treat a cell-line with 5 drugs – do any of them make a
difference?
• Make 5 different genetic mutations – do any of them alter geneexpression?
• H0: the mean of at least one group differs
• Guesses at the assumptions?
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Assumptions Are Similar to T-test
• Normal distribution for the dependent variable
• Samples are independent
• Homoscedasticity
• Independent variables are:
• Not correlated
• Random normal variables
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
But This Is Limited
• A 1-way ANOVA just says that one group differs
• Which one  post hoc tests
• Often hard to know which post hoc test to use, often worth
consulting a statistician here
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Sometimes 1-Way ANOVAs are not worth the Effort
Mutation 1
Wildtype
Mutation 2
1-way ANOVA + post hoc
Or 2 t-tests?
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Not Always Testing Raw Data
Vehicle 1
Drug 1
Vehicle 2
Drug 2
Vehicle 3
Drug 3
3 drugs with different controls
Lecture 5: Multivariate Analyses II: General Cases
1-way ANOVA on the
fold-changes
bioinformatics.ca
Two-Way ANOVAs
• Probably even more common than one-way ANOVAs
• Very powerful:
• Synergy?
• Additivity?
• Antagonism?
Y = a0 + a1x1 + a2x2 + a3x1x2
Assumptions?
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Assumptions Are Similar to 1-Way ANOVA
• Normal distribution for the dependent variable
• Samples are independent
• Homoscedasticity
• Independent variables are:
• Not correlated
• Random normal variables
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Do these treatments interact?
Treatment #2
• Standard approach: ANOVA
Interaction
Treatment #1
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Example: Radiation Toxicity
• Some people are prone to late-stage radio-toxicity
• Does radiation induce specific patterns of gene-expression in
these people?
Radiation
3 Gy
3 Gy
0 Gy
Radio-Sensitive
Lecture 5: Multivariate Analyses II: General Cases
0 Gy
bioinformatics.ca
Two-Way ANOVAs in R
• Standard model-fitting uses the lm() function
• For microarray and –omic analyses, the limma package
is one very good approach for this (covered over the next
few weeks)
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Machine-Learning
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Sequence Analysis
Final Exam (written)
Lecture 5: Multivariate Analyses II: General Cases
bioinformatics.ca