Transcript Slide 1
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Lecture 5 Multivariate Analyses II: General Models MBP1010 † Dr. Paul C. Boutros Winter 2015 DEPARTMENT OF MEDICAL BIOPHYSICS † Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE) This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Machine-Learning Lecture 7: Sequence Analysis Lecture 8: Microarray Analysis I: Pre-Processing Lecture 9: Microarray Analysis II: Multiple-Testing Final Exam (written) Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca House Rules • Cell phones to silent • No side conversations • Hands up for questions Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Topics For This Week • Review to date • Examples • Assignments • Attendance • More on Multivariate Models Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Review From Lecture #2 How can you interpret a QQ plot? Compares two samples or a sample and a distribution. Straight line indicates identity. What is hypothesis testing? Confirmatory data-analysis; test null hypothesis What is a p-value? Evidence against null; probability of FP, probability of seeing as extreme a value by chance alone Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Review From Lecture #2 Parametric vs. non-parametric tests Parametric tests have distributional assumptions What is the t-statistic? Signal:Noise ratio Assumptions of the t-test? Data sampled from normal distribution; independence of replicates; independence of groups; homoscedasticity Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Flow-Chart For Two-Sample Tests Is Data Sampled From a Normally-Distributed Population? Yes No Equal Variance (F-Test)? Yes Homoscedastic T-Test Yes Sufficient n for CLT (>30)? No Heteroscedastic T-Test Lecture 5: Multivariate Analyses II: General Cases No Wilcoxon U-Test bioinformatics.ca Review From Lecture #3 What is statistical power? Probability a test will incorrect reject the null AKA sensitivity or 1- false-negative rate What is a correlation? A relationship between two (random) variables Common correlation metrics? Pearson, Spearman, Kendall Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Lecture #3 Review • Hypergeometric test • Is a sample randomly selected from a fixed population? • Proportion test • Are two proportions equivalent? • Fisher’s Exact test • Are two binary classifications associated? • (Pearson’s) Chi-Squared Test • Are paired observations on two variables independent? Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Example #1 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: tumours in mice lacking TS will be smaller than those in mice with amplification of OG, as assessed by post-mortem volume measurements of the primary tumour. Your data: TS (cm3) 3.9 7.1 3.1 4.4 5.0 Lecture 5: Multivariate Analyses II: General Cases OG (cm3) 5.2 1.9 5.0 6.1 4.5 4.8 bioinformatics.ca Example #2 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS will acquire tumours sooner than mice with amplification of OG. You test the mice weekly using ultrasound imaging. Your data: TS (week of tumour) 4 2 5 4 4 Lecture 5: Multivariate Analyses II: General Cases OG (week of tumour) 3 6 3 2 4 3 bioinformatics.ca Example #3 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS are less likely to respond to a novel targeted therapeutic (DX) than those with amplification of OG as assessed by a trained pathologist: TS (pathological response) Yes No Yes Yes No Lecture 5: Multivariate Analyses II: General Cases OG (pathological response) Yes Yes Yes Yes No Yes bioinformatics.ca Example #4 You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Based on your previous data, you now hypothesize that mice lacking TS will show a similar molecular response to DX as those with amplification of OG. You use microarrays to study 20,000 genes in each line, and identify the following genes as changed between drug-treated and vehicle-treated: TS (DX-responsive genes) MYC KRAS CD53 CDH1 FBW1 SEPT7 MUC1 MUC3 MUC9 RNF3 Lecture 5: Multivariate Analyses II: General Cases OG (DX-responsive genes) MYC KRAS CD53 CDH1 MUC1 MARCH1 PTEN IDH3 ESR2 RHEB CTCF STK11 MLL3 KEAP1 NFE2L2 ARID1A bioinformatics.ca Review From Lecture #4 Assumptions of linear-modeling One variable is a response and one a predictor No adjustment is needed for confounding or other between-subject variation Linearity σ2 is constant, independent of x Predictors are independent of each other For proper statistical inference (CI, p-values), errors are normally distributed Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Review From Lecture #4 How do we assess the adequacy of a model? By considering the size of the residuals (R2) How can we test the quality of a model? Residual plots; qq plots; prediction accuracy Compare a one-way ANOVA to a logistic regression Linear model where x is factorial vs. one where y is factorial Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Lots of Analyses Are Linear Regressions Y = a0 + a1x1 x1 continuous Linear Regression Y = a0 + a1x1 Y factorial Logistic Regression Y = a0 + a1x1 x1 factorial 1-way ANOVA Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Quick Thoughts on Assignment Code Tip #1: avoid reserved words data Tip #2: take advantage of filehandling arguments Tip #3: consistent indentation Lecture 5: Multivariate Analyses II: General Cases Shorter code readability bioinformatics.ca Attendance Break Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca When Do We Use Statistics? • Ubiquitous in modern biology • Every class I will show a use of statistics in a (very, very) recent Nature paper. Advance Online Publication Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Cervix Cancer 101 • Diesease burden increasing • (~380k to ~450k in the last 30 years) • By age 50, >80% of women have HPV infection • >75% of sexually active women exposed, only a subset affected • Why is nearly totally unknown! • Tightly Associated with Poverty Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca HPV Infection Associated Multiple Cancers • • • • • • Cervix Anal Vaginal Vulvar Penile Head & Neck >99% ~85% ~70% ~40% ~45% ~20-30% Of course not all of these are the HPV subtypes caught by current vaccines, but a majority are. Thus many cancers are preventable. Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Figure 1 is a Classic Sequencing Figure Mutation rate vs. histology Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca But Histology Is Associated With Age Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Age Is Associated With Mutation Rate R2 = 0.08; p = 0.005 Is this meaningful? 4.2/Mbp 1.6/Mbp P(Wilcoxon) = 0.0095 Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Perhaps Not in Isolation But... Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca The Solution: Linear Regression Mutation Rate = a0 + x1a1 + x2a2 x1 = histology indicator (adeno = 1; squam = 0) x2 = age in years (continuous) Mutation Rate = 0.259 - 0.145x1 + 0.006x2 P(a1 ≠ 0) = 0.045 P(a2 ≠ 0) = 0.012 Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca General Linear Modeling • The underlying mathematical framework for most statistical techniques we are familiar with: • • • • ANOVAs Logistic regression Linear regression Multiple regression Y = a0 + a1x1 + a2x2 + … NOT the same as a “Generalized Linear Model”!!! Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca General Linear Modeling: Special Cases Y = a0 + a1x1 x1 continuous Linear Regression Y = a0 + a1x1 Y factorial Logistic Regression Y = a0 + a1x1 + a2x2 x1,x2 continuous Multiple Regression Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca ANOVAs Y = a0 + a1x1 x1 factorial 1-way ANOVA Y = a0 + a1x1 + a2x2 + a3x1x2 x1 x2 two-level factors Lecture 5: Multivariate Analyses II: General Cases 2-way ANOVA bioinformatics.ca ANOVA Experimental Designs Are Common • Classic one-way ANOVAs: • Treat a cell-line with 5 drugs – do any of them make a difference? • Make 5 different genetic mutations – do any of them alter geneexpression? • H0: the mean of at least one group differs • Guesses at the assumptions? Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Assumptions Are Similar to T-test • Normal distribution for the dependent variable • Samples are independent • Homoscedasticity • Independent variables are: • Not correlated • Random normal variables Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca But This Is Limited • A 1-way ANOVA just says that one group differs • Which one post hoc tests • Often hard to know which post hoc test to use, often worth consulting a statistician here Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Sometimes 1-Way ANOVAs are not worth the Effort Mutation 1 Wildtype Mutation 2 1-way ANOVA + post hoc Or 2 t-tests? Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Not Always Testing Raw Data Vehicle 1 Drug 1 Vehicle 2 Drug 2 Vehicle 3 Drug 3 3 drugs with different controls Lecture 5: Multivariate Analyses II: General Cases 1-way ANOVA on the fold-changes bioinformatics.ca Two-Way ANOVAs • Probably even more common than one-way ANOVAs • Very powerful: • Synergy? • Additivity? • Antagonism? Y = a0 + a1x1 + a2x2 + a3x1x2 Assumptions? Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Assumptions Are Similar to 1-Way ANOVA • Normal distribution for the dependent variable • Samples are independent • Homoscedasticity • Independent variables are: • Not correlated • Random normal variables Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Do these treatments interact? Treatment #2 • Standard approach: ANOVA Interaction Treatment #1 Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Example: Radiation Toxicity • Some people are prone to late-stage radio-toxicity • Does radiation induce specific patterns of gene-expression in these people? Radiation 3 Gy 3 Gy 0 Gy Radio-Sensitive Lecture 5: Multivariate Analyses II: General Cases 0 Gy bioinformatics.ca Two-Way ANOVAs in R • Standard model-fitting uses the lm() function • For microarray and –omic analyses, the limma package is one very good approach for this (covered over the next few weeks) Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca Course Overview • • • • • • • • • • Lecture 1: What is Statistics? Introduction to R Lecture 2: Univariate Analyses I: continuous Lecture 3: Univariate Analyses II: discrete Lecture 4: Multivariate Analyses I: specialized models Lecture 5: Multivariate Analyses II: general models Lecture 6: Machine-Learning Lecture 7: Microarray Analysis I: Pre-Processing Lecture 8: Microarray Analysis II: Multiple-Testing Lecture 9: Sequence Analysis Final Exam (written) Lecture 5: Multivariate Analyses II: General Cases bioinformatics.ca