Transcript Slide 1
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Lecture 2
Univariate Analyses: Continuous Data
MBP1010H †
Dr. Paul C. Boutros
DEPARTMENT OF
MEDICAL BIOPHYSICS
†
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
This workshop includes material
originally developed by Drs. Raphael Gottardo,
Sohrab Shah, Boris Steipe and others
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Microarray Analysis I: Pre-Processing
Lecture 7: Microarray Analysis II: Multiple-Testing
Lecture 8: Data Visualization & Machine-Learning
Lecture 9: Sequence Analysis Basics
Final Exam (written)
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
How Will You Be Graded?
• 9% Participation: 1% per week
• 56% Assignments: 8 x 7% each
• 35% Final Examination: in-class
• For most assignments each individual will get their own,
unique assignment
• Assignments will all be in R, and will be graded largely
according to computational correctness only (i.e. does
your R script yield the correct result when run)
• Final Exam will include multiple-choice and written
answers
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
House Rules
• Cell phones to silent
• No side conversations
• Hands up for questions
• Pay attention – I will randomly call on people during the
course of each lecture
• State your name when asking/answering Qs please!
• Others?
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Review From Last Week
Population vs. Sample
All MBP Students = Population
MBP Students in 1010 = Sample
How do you report statistical information?
P-value, variance, effect-size, sample-size, test
Why don’t we use Excel/spreadsheets?
Spreadsheet errors, reproducibility, wrong results
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Topics For This Week
• Introduction to continuous data & probability distributions
• Slightly boring, but necessary!
• Attendance
• Common continuous univariate analyses
• Correlations
• ceRNAs
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Continuous vs. Discrete Data
• Definitions?
• Examples of discrete data in biological studies?
• Why does it matter in the first place?
• Areas of discrete mathematics:
• Combinatorics
• Graph Theory
• Discrete Probability Theory (Dice, Cards)
• Number Theory
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Exploring Data
•
•
•
•
When teaching (or learning new procedures) we usually prefer to work
with synthetic data.
Synthetic data has the advantage that we know what the outcome of the
analysis should be.
Typically one would create values according to a function and then add
noise.
R has several functions to create sequences of values – or you can write
your own ...
0:10;
seq(0, pi, 5*pi/180);
rep(1:3, each=3, times=2);
for (i in 1:10) { print(i*i); }
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Synthetic Data
Function ...
Noise ...
Noisy Function ...
Explore functions and noise.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Probability Distributions
Normal distribution N(μ,σ2)
μ is the mean and σ2 is the
variance.
Extremely important because of
the Central Limit Theorem: if a
random variable is the sum of a
large number of small random
variables, it will be normally
distributed.
The area under the curve is the probability of observing a value between 0 and 2.
x <- seq(-4, 4, 0.1)
f <- dnorm(x, mean=0, sd=1)
plot(x, f, xlab="x", ylab="density", lwd=5, type="l")
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Probability Distributions
Normal distribution N(μ,σ2)
μ is the mean and σ2 is the
variance.
Extremely important because of
the Central Limit Theorem: if a
random variable is the sum of a
large number of small random
variables, it will be normally
distributed.
Task:
Explore line
parameters
The area under the curve is the probability of observing a value between 0 and 2.
x <- seq(-4, 4, 0.1)
f <- dnorm(x, mean=0, sd=1)
plot(x, f, xlab="x", ylab="density", lwd=5, type="l")
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Probability Distributions
Random sampling:
Generate 100 observations
from a N(0,1)
Histograms can be used
to estimate densities!
set.seed(100)
x <- rnorm(100, mean=0, sd=1)
hist(x)
lines(seq(-3,3,0.1),50*dnorm(seq(-3,3,0.1)), col="red")
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Quantiles
(Theoretical) Quantiles:
The p-quantile has the
property that there is a
probability p of getting a
value less than or equal to it.
The 50% quantile is called the median.
90% of the probability (area under the curve) is to the left of the red vertical line.
q90 <- qnorm(0.90, mean = 0, sd = 1);
x <- seq(-4, 4, 0.1);
f <- dnorm(x, mean=0, sd=1);
plot(x, f, xlab="x", ylab="density", type="l", lwd=5);
abline(v=q90, col=2, lwd=5);
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Descriptive Statistics
Empirical Quantiles:
The p-quantile has the property that p% of the observations are
less than or equal to it.
Empirical quantiles can be easily obtained in R.
> set.seed(100);
> x <- rnorm(100, mean=0, sd=1);
> quantile(x);
0%
25%
50%
75%
100%
-2.2719255 -0.6088466 -0.0594199 0.6558911 2.5819589
> quantile(x, probs=c(0.1, 0.2, 0.9));
10%
20%
90%
-1.1744996 -0.8267067 1.3834892
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Descriptive Statistics
We often need to quickly 'quantify' a data set, and this can be
done using a set of summary statistics (mean, median,
variance, standard deviation).
> mean(x);
[1] 0.002912563
> median(x);
[1] -0.0594199
> IQR(x);
[1] 1.264738
> var(x);
[1] 1.04185
> summary(x);
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
-2.272000 -0.608800 -0.059420 0.002913 0.655900 2.582000
Exercise: what are the units of variance and standard deviation?
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Boxplot
Descriptive
statistics can be
intuitively
summarized in a
Boxplot.
1.5 x IQR
75% quantile
IQR
Median
25% quantile
> boxplot(x)
1.5 x IQR
Everything above and below 1.5 x
IQR is considered an "outlier".
IQR = Inter Quantile Range = 75% quantile – 25% quantile
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Violinplot
Internal structure of a
data-vector can be
made visible in a violin
plot. The principle is
the same as for a
boxplot, but a width is
calculated from a
smoothed histogram.
p <- ggplot(X, aes(1,x))
p + geom_violin()
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
plotting data in R
Task:
Explore types of plots.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
QQ–plot
One of the first things we may ask about data is whether it
deviates from an expectation e.g. to be normally distributed.
The quantile-quantile plot provides a way to visually verify this.
The QQ-plot shows the theoretical quantiles versus the
empirical quantiles. If the distribution assumed (theoretical one)
is indeed the correct one, we should observe a straight line.
R provides qqnorm() and qqplot().
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
QQ–plot: sample vs. Normal
Only valid for the
normal distribution!
qqnorm(x)
qqline(x, col=2)
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
QQ–plot: sample vs. Normal
set.seed(100)
t <- rt(100, df=2)
qqnorm(t)
qqline(t, col=2)
Clearly the t distribution with two
degrees of freedom is not Normal.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
QQ–plot
Verify the CLT.
set.seed(101)
generateVariates <- function(n) {
Nvar <- 10000
Vout <- c()
for (i in 1:n) {
x <- runif(Nvar, -0.01, 0.01)
Vout <- c(Vout, sum(x) )
}
return(Vout)
}
x <- generateVariates(1000)
y <- rnorm(1000, mean=0, sd=1)
qqnorm(x)
qqline(x, y, col=2)
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
QQ–plot: sample vs. sample
Comparing two samples: are their
distributions the same?
... or ...
compare a sample vs. a synthetic
dataset.
set.seed(100)
x <- rt(100, df=2)
y <- rnorm(100, mean=0, sd=1)
qqplot(x, y)
Exercise: try different values of df for rt() and compare the vectors.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Boxplots
The boxplot function can
be used to display
several variables at a
time.
boxplot(gvhdCD3p)
Exercise: Interpret this plot.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Attendance Break
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Hypothesis Testing
Hypothesis testing is confirmatory data analysis, in
contrast to exploratory data analysis.
Concepts:
Null – and Alternative Hypothesis
Region of acceptance / rejection and critical value
Error types
p - value
Significance level
Power of a test (1 - false negative)
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Null Hypothesis / Alternative Hypothesis
The null hypothesis H0 states that nothing of
consequence is apparent in the data distribution.
The data corresponds to our expectation. We
learn nothing new.
The alternative hypothesis H1 states that some
effect is apparent in the data distribution. The data
is different from our expectation. We need to
account for something new. Not in all cases will
this result in a new model, but a new model
always begins with the observation that the old
model is inadequate.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Test types
... common types of tests
A Z–test compares a sample mean with a normal distribution.
A t–test compares a sample mean with a t- distribution and thus
relaxes the requirements on normality for the sample.
Nonparametric tests can be applied if we have no reasonable model
from which to derive a distribution for the null hypothesis.
Chi–squared tests analyze whether samples are drawn from the same
distribution.
F-tests analyze the variance of populations (ANOVA).
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Error Types
Truth
Decision
Accept H0
Reject H0
H0
1-
“Sensitivity”
"False positive"
"Type I error"
Lecture 2: Univariate Analyses I: Continuous Data
H1
"False negative"
"Type II error"
1-
“Power”
bioinformatics.ca
Type I vs. Type II Errors
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
what is a p–value?
a) A measure of how much evidence we have against the alternative
hypothesis.
b) The probability of making a false-positive.
c) Something that biologists want to be below 0.05 .
d) The probability of observing a value as extreme or more extreme by
chance alone.
e)
All of the above.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Distributional Assumptions
• A parametric test makes assumptions about the
underlying distribution of the data.
• A non-parametric test makes no assumptions about the
underlying distribution, but may make other
assumptions!
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Most Common Statistical Test: The T-Test
A Z–test compares a sample mean with a normal distribution.
A t–test compares a sample mean with a t-distribution and thus
relaxes the requirements on normality for the sample.
Nonparametric tests can be applied if we have no reasonable model
from which to derive a distribution for the null hypothesis.
One-Sample vs. Two-Sample
One-Sided vs. Two-Sided
Heteroscedastic vs. Homoscedastic
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Two-Sample t–test
Test if the means of two distributions are the same.
The datasets yi1, ..., yin are independent and normally distributed
with mean μi and variance σ2, N (μi,σ2), where i=1,2.
In addition, we assume that the data in the two groups are
independent and that the variance is the same.
H 0 : m1 = m 2
Lecture 2: Univariate Analyses I: Continuous Data
H1 : m1 ¹ m 2
bioinformatics.ca
two–sample t–test
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
t–test assumptions
Normality: The data need to be sampled from a normal distribution.
If not, one can use a transformation or a non-parametric test. If the
sample size is large enough (n>30), the t-test will work just fine (CLT).
Independence: Usually satisfied. If not independent, more complex
modeling is required.
Independence between groups: In the two sample t- test, the
groups need to be independent. If not, one can sometimes use a
paired t-test instead.
Equal variances: If the variances are not equal in the two groups,
use Welch's t-test (default in R).
How Do We Test These?
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
non–parametric tests
Non-parametric tests constitute a flexible alternative to
t-tests if you don't have a model of the distribution.
In cases where a parametric test would be appropriate,
non-parametric tests have less power.
Several non parametric alternatives exist e.g. the
Wilcoxon and Mann-Whitney tests.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Wilcoxon test principle
Consider two random distributions with 25
samples each and slightly different means.
set.seed(53)
n <- 25
M <- matrix(nrow = n+n, ncol=2)
for (i in 1:n) {
M[i,1] <- rnorm(1, 10, 1)
M[i,2] <- 1
M[i+n,1] <- rnorm(1, 11, 1)
M[i+n,2] <- 2
}
plot(M[,1], col=M[,2])
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Wilcoxon test principle
o <- order(M[,1])
plot(M[o,1], col=M[o,2])
For each observation in a, count the
number of observations in b that
have a smaller rank.
The sum of these counts is the test
statistic.
wilcox.test(M[1:n,1], M[(1:n)+n,1])
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Flow-Chart For Two-Sample Tests
Is Data Sampled From a
Normally-Distributed Population?
Yes
No
Equal Variance
(F-Test)?
Yes
Homoscedastic
T-Test
Yes
Sufficient n for
CLT (>30)?
No
Heteroscedastic
T-Test
Lecture 2: Univariate Analyses I: Continuous Data
No
Wilcoxon
U-Test
bioinformatics.ca
Power, error rates and decision
Power calculation in R:
> power.t.test(n = 5, delta = 1, sd=2,
alternative="two.sided", type="one.sample")
One-sample t test power calculation
n=5
delta = 1
sd = 2
sig.level = 0.05
power = 0.1384528
alternative = two.sided
Other tests are available – see ??power.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Power, error rates and decision
PR(False Negative)
PR(Type II error)
μ0 μ 1
PR(False Positive)
PR(Type I error)
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Problem
When we measure more one
than one variable for each
member of a population, a
scatter plot may show us that
the values are not completely
independent: there is a trend
for one variable to increase as
the other increases.
Regression analyzes the
dependence.
Examples:
• Height vs. weight
• Gene dosage vs.
expression level
• Survival analysis:
probability of death vs. age
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Correlation
When one variable depends on
the other, the variables are to
some degree correlated.
NB: correlation ≠ causation
In R, the function cov()
measures covariance and cor()
measures the Pearson
coefficient of correlation (a
normalized measure of
covariance).
Pearson's coefficient of
correlation values range
from -1 to 1, with 0 indicating no
correlation.
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
How to interpret the correlation
coefficient:
Explore varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.99;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.9999666
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.8;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.9661111
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.4;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.6652423
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.01;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.01232522
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # periodic ...
> y <- (r * cos(x*pi)) + ((1-r) * rnorm(50))
> plot(x,y); cor(x,y)
[1] 0.3438495
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # polynomial ...
> y <- (r * x*x) + ((1-r) * rnorm(50))
> plot(x,y); cor(x,y)
[1] -0.5024503
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # exponential
> y <- (r * exp(5*x)) + ((1-r) * rnorm(50))
> plot(x,y); cor(x,y)
[1] 0.6334732
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # circular ...
> a <- (r * cos(x*pi)) + ((1-r) * rnorm(50))
> b <- (r * sin(x*pi)) + ((1-r) * rnorm(50))
> plot(a,b); cor(a,b)
[1] 0.04531711
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Correlation coefficient
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
When Do We Use Statistics?
• Ubiquitous in modern biology
• Every class I will show a use of statistics in a recent paper
January 9, 2014
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Non-Small Cell Lung Cancer 101
15% 5-year
survival
Lung Cancer
80% of lung
cancer
Non-Small Cell
Adenocarcinomas
Squamous Cell
Carcinomas
Lecture 2: Univariate Analyses I: Continuous Data
Small Cell
Large Cell
(and others)
bioinformatics.ca
Non-Small Cell Lung Cancer 102
Stage I
Local Tumour Only
Stage II
Local Lymph Nodes
Stage III
Distal Lymph Nodes
Stage IV
Metastasis
IA = small tumour; IB = large tumour
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
General Idea: HMGA2 is a ceRNA
What are ceRNAs?
Salmena et al. Cell 2011
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Test Multiple Constructs for Activity
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
What Statistical Analysis Did They Do?
• No information given in main text!
• Figure legend says:
“Values are technical triplicates, have been performed
independently three times, and represent mean +/- standard
deviation (s.d.) with propagated error.”
• In supplementary they say:
“Unless otherwise specified, statistical significance was
assessed by the Student’s t-test”
• So, what would you do differently?
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Microarray Analysis I: Pre-Processing
Lecture 7: Microarray Analysis II: Multiple-Testing
Lecture 8: Data Visualization & Machine-Learning
Lecture 9: Sequence Analysis Basics
Final Exam (written)
Lecture 2: Univariate Analyses I: Continuous Data
bioinformatics.ca