Statistical Analysis of Microarray Data By H. Bjørn Nielsen & Hanne Jarmer The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression.

Download Report

Transcript Statistical Analysis of Microarray Data By H. Bjørn Nielsen & Hanne Jarmer The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression.

Statistical Analysis of
Microarray Data
By
H. Bjørn Nielsen & Hanne Jarmer
The DNA Array Analysis Pipeline
Question
Experimental Design
Array design
Probe design
Sample Preparation
Hybridization
Buy Chip/Array
Image analysis
Normalization
Expression Index
Calculation
Comparable
Gene Expression Data
Statistical Analysis
Fit to Model (time series)
Advanced Data Analysis
Clustering
Meta analysis
PCA
Classification
Survival analysis
Promoter Analysis
Regulatory Network
What's the question?
Typically we want to identify differentially expressed genes
Example:
alcohol dehydrogenase is expressed at a higher level when
alcohol is added to the media
alcohol dehydrogenase
without alcohol
with alcohol
However, the measurements contain
stochastic noise
There is no way around it
He’s going to say it
Statistics
You can choose to think of
statistics as a black box
Noisy
measurements
statistics
But, you still need to understand how to
interpret the results
p-value
The output of the statistics
P-value
The chance of rejecting the null
hypothesis by coincidence
---------------------------For gene expression analysis we can
say:
the chance that a gene is
categorized as differentially
expressed by coincidence
The statistics gives us a p-value
for each gene
We can rank the genes according to the p-value
But, we can’t really trust the p-value in a strict
statistical way!
Why not!
For two reasons:
1. We are rarely fulfilling all the
assumptions of the statistical test
2. We have to take multi-testing into
account
The t-test Assumptions
1. The observations in the two categories
must be independent
2. The observations should be normally
distributed
3. The sample size must be ‘large’
(>30 replicates)
Multi-testing?
In a typical microarray analysis we test
thousands of genes
If we use a significance level of 0.05
and we test 1000 genes. We expect 50 genes
to be significant by chance
1000 x 0.05 = 50
Correction for multiple testing
Bonferroni:
Confidence level of 99%
P≤
0.01
N
Benjamini-Hochberg:
P≤
i
N
0.01
N = number of genes
i = number of accepted genes
But really, those methods are too
strict
What we can trust is the ability of the statistical
test to rank the genes according to their
reliability
The number of genes that are needed or can
be handled in downstream processes can be
used to set the cutoff
If we permute the samples we can get an
estimate of the False Discovery Rate (FDR) in
this set
Volcano Plot
P-value
log2 fold change (M)
What's inside the black box ‘statistics’
t-test or ANOVA
The t-test
Calculate T
Lookup T in a table
The t-test II
The t-test tests for difference in means ()
Density
wt wt
mut mutant
Intensity
of gene x
The t-test III
The t statistic is based on the sample mean and variance
t
ANOVA
ANalysis Of Variance
Very similar to the t-test, but can test
multiple categories
Ex: is gene x differentially expressed
between wt, mutant 1 and mutant 2
Advantage: it has more ‘power’ than the
t-test
ANOVA II
Variance between groups
Density
Variance within groups
Intensity
Blocks and paired tests
Some undesired factors may influence the
experiments, the effect of such can be greatly
reduced if they are blocked out or if the
experiment is paired.
Some possible blocks:
- Dye
- Patient
- Technician
- Batch
- Day of experiment
- Array
Paired t-test
Hypothesis:
There is no difference between the mean BLUE and RED
Block 1
Block 2
Block 3
An example: (2-way ANOVA with blocks)
Experimental Design:
Everything was
done in batches to
capture the
systematic noise
A187
CreA
Glucose Ethanol Glucose Ethanol
3x
Example: Batch to batch variation
Within batch variation is lower
than the between batch variation

Example: Data analysis - Blocking
• We can capture the batch variation by blocking
Two-way ANOVA
Effect 1
Ethanol
Glucose
Effect 2
A187
CreA
Batch A B C
Ex: Result of the 2-way ANOVA
(3 p-values)
A genotype p-value
A growth media p-value
An interaction p-value
Two-way ANOVA
Media effect
Glucose
Ethanol
A187
CreA
Batch A B C
Genotype
effect
Conclusion
• Array data contains stochastic noise
– Therefore statistics is needed to conclude on
differential expression
• We can’t really trust the p-value
• But the statistics can rank genes
• The capacity/needs of downstream
processes can be used to set cutoff
• FDR can be estimated
• t-test is used for two category tests
• ANOVA is used for multiple categories