Statistical Concepts and Methodologies for Data Analyses

Transcript Statistical Concepts and Methodologies for Data Analyses

Statistical Concepts and
Methodologies for Data Analyses
Benilton Carvalho
Computational Biology and Statistics Group
Department of Oncology
University of Cambridge
FROM RANDOM VARIABLES TO
HYPOTHESIS TESTING
Random Variables
• Function that associates probability to:
– Countable items (discrete random variable);
• Tumor vs. Normal; Yes vs. No; Head vs. Tail;
– Uncountable items (continuous random variable):
• Log-expression; weight; height;
• Characterized by a distribution function:
– Bernoulli; Binomial; Geometric; NegativeBinomial; Poisson;
– Normal; Student’s t; Gamma;
Examples – Discrete Distributions
Examples – Continuous Distributions
Common Uses of
Different Distributions
• Bernoulli: probability of 1 success;
• Binomial: probability of K successes;
• Geometric: probability of K failures before 1st
success;
• Negative-Binomial: probability of K failures
before R successes;
• Poisson: probability of K rare events;
The Questions
• Investigation of populations or groups within a
population leads to questions:
– How does BRCAI behave across groups?
– Can genotype predict drug response?
– Does transcript abundance change as a function
of time?
The Experiment
• A procedure used to answer the questions;
• Comprised of multiple items:
– Population;
– Sample;
– Hypotheses;
– Test statistic;
– Rejection criteria;
Population
• Superset of subjects of interest;
• Ideally, every subject in the population is
surveyed;
• Issues with the “census approach”;
Sample
•
•
•
•
Select some subjects from the population;
We refer to this subset as sample;
Subject in a sample can be called replicate;
Replicate: technical vs. biological;
Hypotheses
• Sets that define the “underlying truth”;
• Null Hypothesis (H0): default situation.
– Cannot be proven;
– Reject (in favor of H1) vs. fail to reject;
• Alternative Hypothesis (H1): alternative (duh!)
– Complements H0 on the parametric space;
– Assists on the definition of the rejection criteria.
Examples of Hypotheses – P1
• Comparing expression: Tumor vs. Normal:
– Expression on tumor is at most as high as on normal;
– Expression on tumor is higher than on normal;
Examples of Hypotheses – P2
• Comparing expression: Tumor vs. Normal:
– Expression on tumor is at least as low as on normal;
– Expression on tumor is lower than on normal;
Examples of Hypotheses – P3
• Comparing expression: Tumor vs. Normal:
– Expressions on tumor and normal are the same;
– Expressions on tumor and normal are different;
Test Statistic
•
•
•
•
•
Summary of the data;
Built “under H0”;
Independent of unknown parameters;
Known distributions;
Compatibility between data and H0;
Test Statistic
• What the statistician see…
Rejection Criteria
• Function of three factors:
– Test statistic;
– Hypotheses;
– Type I Error (False Positive), α;
• Determines thresholds used to reject H0:
– One threshold: one-sided tests;
– Two thresholds: two-sided tests;
• Defines what is “extreme” for the experiment;
Rejection Criteria
From Rejection Criteria to P-value!
p-value
Rejection Criteria
From Rejection Criteria to P-value!
p-value
Rejection Criteria
From Rejection Criteria to P-value!
p-value
Sampling and testing
Discrete
observations
Random sample of 10 balls
from the box
#red = 3
When do I think that I am not sampling from
this box anymore?
How many reds could I expect to get just by
chance alone!
10% red balls and 90%
blue balls
24
Sample
Discrete
observations
Random sample of 10 balls
from the box
#red = 3
Test statistic
Rejection criteria
(based on your observed
sample, do you have evidence
to reject the hypothesis that
you sampled from the null
population)
10% red balls and 90%
blue balls
Null hypothesis
(about the population that is
being sampled)
25
Sample
Continuous
observations
4, 2.3, 5.2, 4.7, 2.1, 3.5, ……..
mean = 3, sd = 0.6
Test statistic
Rejection criteria
(based on your observed
sample, do you have evidence
to reject the hypothesis that
you sampled from the null
population)
Null hypothesis
(about the population that is
being sampled)
26
Summary of the Experiment
1) hypotheses
4) decision
2) sample
3) test statistic
Useful Facts
• The Law of the Large Numbers guarantees
that the larger the sample size is, the closer
the sample average is to the actual mean;
• Normality assumption isn’t that important
with large sample size;
• The Central Limit Theorem states that the
average is asymptotically normal;
Useful Facts
• The Z-score depends on the precise
knowledge of the variance term:
• Estimating the variance changes the
distribution of the test statistic:
Useful Facts
• The Student’s t distribution is similar to the
Normal distribution, but has heavier tails;
• Larger sample size, more d.f.;
• More d.f., closer to Normal;
Multiple Testing
• We are doing high-throughput experiments;
• Comparing thousands of units simultaneously;
• At this scale, we can observe several instances
of rare events just by chance:
– Event A: 1 in 1000 chance of happening;
– Event B: 999 in 1000 chance of happening;
– And the experiment is tried 20,000 times;
– We expect 20 occurrences of Event A to be
observed, although Event B is much more likely;
Multiple Testing
•
•
•
•
•
•
Similar scenario, for example, with DE;
Most genes are not differentially expressed;
High-throughput experiments;
Differential expression is tested for 20K genes;
Need to protect against false positives;
Suggestion: use non-specific filtering;
DATA MODELING
What is a model?
Statistical Models
•
•
•
•
There is no “correct model”;
Models are approximations of the truth;
There is a “useful model”;
Understand the mechanisms of the system for
better choices of model alternatives;
Revisiting Microarrays
•
•
•
•
•
•
Scanned images;
Fluorescence intensities;
Proportional to target abundances;
Restricted dynamic range;
Asymmetrical distribution;
Log-Intensities behave better;
Revisiting Microarrays
Intensities
Log-Intensities
Back to Data Modeling
Linear Regression / ANOVA
• Nature of the data: continuous;
• Linear regression often used;
• For subject i, known factors/covariates are
candidates to predict log-intensities of a gene:
• Residuals expected to be Normal;
Interpreting Coefficients
• Statisticians indicate that a parameter is
estimated by using a “hat” on top of it:
• Assuming that X = 0 for normal tissue:
• Assuming that X = 1 for tumor tissue:
Interpreting Coefficients
Average log-intensity for normal tissue
Change in average log-intensity
associated to the tumor tissue
Average log-intensity for tumor tissue
GLM
•
•
•
•
Generalized Linear Models;
Generic framework;
Accommodates different types of data;
Special cases: Linear regressions and ANOVAs;
Example – GLM Binomial Family
• Responses: yes/no; dead/alive; sick/healthy;
• Predictors: Gene expression / genotype / age;
• Example:
– Response: Cytogenetic abnormalities (Yes/No);
– Predictors: Log-expression of probeset 1059_at;
Log-Expression vs. Abnormalities
Modeling a Binary Response
• Response in the previous example:
– Observed cytogenetic abnormalities;
– Did not observe cytogenetic abnormalities;
• Linear regression does not work:
1059_at
1
0.8
0.6
0.4
0.2
0
3
3.2
3.4
3.6
3.8
4
Modeling a Binary Response
• Instead of modeling the actual response, we
model the probability of that response;
• Linear regression still fails;
Prob(Observe Cytogenetic Abnormalities)
200%
Invalid
Results
150%
100%
Valid
Results
50%
0%
-50%
-100%
2
2.5
3
3.5
4
4.5
5
Logistic Regression - Rationale
• Probability is restricted to the [0, 1] interval;
• Linear regression isn’t;
• Need to transform probability;
Logistic Regression - Rationale
• Instead of probability, model the odds:
• Odds range from 0 to Infinity;
• A linear regression approach would still fail;
Logistic Regression - Rationale
• Instead of odds, model the log-odds:
• Log-odds range from -Infinity to Infinity;
• An approach like linear regression, using the
log-odds scale, would work fine;
Back to GLM
• In the previous example:
Link function: logit
Linear Predictor
Interpreting Coefficients
on a Logistic Model
• b0: average log-odds for normal tissue;
• b1: average change in log-odds on tumor;
• Suppose b0 = 10.87 and b1 = -3.46:
– How do we interpret?
Model Selection
• Likelihood measures the probability of
observing the data under a certain model;
• Given two models, M1 and M2 (M2⊃M1):
– Get L1: likelihood of the data under M1;
– Get L2: likelihood of the data under M2;
• LRT = -2 log(L1/L2) is known;
– Small LRT: choose M1;
– Large LRT: choose M2;
MODELING STRATEGIES FOR
SEQUENCING DATA
Sequencing – Rationale
Technical Replicate
• Sample j, transcript i is generated at rate λij;
• A fragment attaches to the flow cell with a
(low) probability pij;
• Number of observed tags, yij, is Poisson
distributed with rate proportional to λijpij;
Adapted from notes by Tom Hardcastle
Poisson
Probability function:
Analysis method: GLM
Expected count of
region i in sample j
Noise Part
Design matrix
Deterministic
Part
Library size
effect
(Differential) effect
for region i
Need to account for extra variability
technical rep – consistent with Poison
biol. rep – not consistent with Poison
Based on the data of Nagalakshmi et al.
Science 2008; slide adapted from Huber;
Sequencing – Rationale
Biological Replicates
• For subject j, on transcript i:
• Different subjects have different rates, which
we can model through:
• This hierarchy changes the distribution of Y:
Negative Binomial
Probability function:
Adding an additional
source of variation
smooth dispersion-mean relation α
CONSIDERATIONS ON
EXPERIMENT DESIGN
Consideration
• Sample size is crucial. The larger, the better;
• With differential expression, one can observe
this more easily;
• Is RNA-Seq really worth it when we consider:
– Cost,
– Strategies for analysis, and
– Technical requirements?
Differential Expression Across Groups
Flow Cell Confounded With Group
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
1
2
3
4
5
6
Flow Cell 1
Flow Cell 2
Flow Cell 3
Flow Cell 4
Group A
Group B
Group C
Group D
7
8
Differential Expression Across Groups
Randomize Samples wrt Flow Cell
1
2
3
4
5
6
Flow Cell 1
7
8
1
2
3
4
5
6
Flow Cell 2
7
8
1
2
3
4
5
6
Flow Cell 3
7
8
1
2
3
4
5
6
Flow Cell 4
7
8
Differential Expression Across Groups
Barcoding vs. Lane Effect
1
2
3
4
5
6
Flow Cell 1
7
8
1
2
3
4
5
6
Flow Cell 2
7
8
1
2
3
4
5
6
Flow Cell 3
7
8
1
2
3
4
5
6
Flow Cell 4
7
8
CONSIDERATIONS ON
DATA PROCESSING
Normalization
• Samples are sequenced in different depths:
Gene
Gene 1
…
Gene N
Total Reads
Sample 1
Sample 2
500,000
…
500,000
…
0
500,000
15,000,00
0
30,000,00
0
• Genes with higher expression on Sample 2;
• Adjusting by total reads can be misleading;
Normalization
• Length can affect relative inference of
expression across genes;
• Gene A K-times longer than B is expected to
have K-times more reads than B:
Gene A
Gene B