Transcript Document

Contacts
• Data sets are available at:
http://www.bioinformatica.unito.it/bioinformatics/DAGEL.II/
• Post-workshop support:
– support on data analysis course will be supplied by
Prof. Calogero, sending an email to:
• [email protected] containing in the Subject: Post DW
support
– If personalized training is needed on Bioconductor
programming please refer to the AIRBB personalized
courses web page:
http://www.bioinformatica.unito.it/bioinformatics/AIRBB_c
ourses/AIRBB_courses.html
Workshop topics
• We will analyse the following topics:
– Experimental design.
– Quality controls.
– Data filtering.
– Statistical inference of differential
expression.
– Annotation.
– Assessing the biological meaning of
differential expression.
Essential issues
• Any analysis of microarray data is useless
if :
– there is not a clear biological question to be
investigated;
– biological experiments are not carefully
designed to minimize error sources:
•
•
•
•
human intervention,
reagent lots,
equipments.
etc.
Experimental design
• If the biological material is not a limiting factor
“think wide”:
– Experiment should be designed with many replicas
(>3)
– Time course experiments should be designed with
many points (>4).
– Investigate part of the experiment by microarrays
and use the rest for further validations.
– Discuss the experiment with the
statistician/bioinformatician involved in data analysis
Pools versus single samples
• The basic assumption underlying sample
pooling is biological averaging:
– the expression from a pooled sample
averages out the expression from the
individual contributing samples.
Bioinformatics 2004, 20:3318
Pools versus single samples
• Shih shows that the assumption -“expression from a
pooled sample averages out the expression from the
individual contributing samples”- may not hold
especially when the signals are strong.
• The pooling bias appears to be particularly severe for
Affymetrix arrays.
• Furthermore, it is impossibile to associate the gene
expression from the pooled sample with the individual
phenotypic information:
– Making unfeasible certain statistical inference or predictions for
individuals.
• Conclusions:
– Researcher has to be cautious about designing a pooled
experiment.
– Pooling of samples is recommended when there is not enough
RNA from each individual sample to run an array.
Experiment setup
• Experiments involving various samples
and conditions need to be carefully
designed to avoid unwanted effects.
T1
C1
C1
C1
C2
C2
C2
Day 1
T1
T1
T2
T2
T2
Day 2
C1
C1
C1
T1
C2
C2
C2
T2
T1
T2
T2
T1
Day 1
Day 2
Analysis pipe-line
Sample
Preparation
Hybridization
Scanning
+
Image
Analysis
Array
Fabrication
Quality
control
Filtering
Normalization
statistical
analysis
Annotation
Platform specific
devices
Bioconductor
Biological
Knowledge
extraction
Open source software
• Bioconductor
– is an open source and open development software
project to provide tools for the analysis and
comprehension of genomic data.
• SAM
– is a statistical technique for finding significant
genes in a set of microarray experiments.
• TMEV 4.0
– is an application that allows the viewing of
processed microarray slide representations and the
identification of genes and expression patterns of
interest. A variety of normalization algorithms and
clustering analyses allow the user flexibility in
creating meaningful views of the expression data.
Analysis of microarray
• Bioconductor (www.bioconductor.org) is an
open source and open development software
project to provide tools for the analysis and
comprehension of genomic data.
• Advantages:
– highly dinamic,
– free of charge,
– introductory and advance courses are available in
Europe/USA every year.
• Disadvantages:
– limited graphical interface,
– limited documentation.
oneChannelGUI
• This is a graphical interface to Bioconductor
libraries devoted to the analysis of data derived
from single channel platforms.
• affylmGUI is a graphical interfase to limma
library, which allows differential expression
detection by mean of linear model analysis.
• oneChannelGUI is an extension of affylmGUI
capabilities.
oneChannelGUI
• 3’ IVT/Gene arrays:
– Primary (probe level QC, probe set summary and normalization),
secondary analysis (replicates QC, filtering, statistical analysis,
classification) and data mining (GO enrichment).
• Exon arrays:
– Secondary analysis (replicates QC, filtering, statistical analysis,
classification, basic Splice Index inspection) using expression
console as source of primary data.
• Large data set (i.e. probe set expression in tab delimited
format):
– Secondary analysis (replicates QC, filtering, statistical analysis,
classification) using expression console/GEO/ArrayExpress data
as source of primary data.
Starting R and
oneChannelGUI
Setting the virtual RAM at 2GB:
C:\..\R\R-2.3.0\bin\Rgui.exe --max-mem-size=2048M
A
Double click on “R” to start
B
A
B
Click on “Package” to load Bioconductor packages
A
B
Click on “Load package” to select the oneChannelGUI
package
Click on “OK” to load the oneChannelGUI package
C
A
Click on “Yes” to start the affylmGUI interface.
B
Wait
few seconds!
Yes
C
Yes
Click on “Yes” to start the oneChannelGUI interface.
Standard affylmGUI menu
Overlaying oneChannelGUI to affylmGUI will
change the default affylmGUI menu to the
oneChannelGUI menu for 3’IVT Affymetrix
arrays
oneChannelGUI menu for 3’IVT arrays
oneChannelGUI
• Analysis steps for 3’IVT arrays:
– Loading .CEL files to be analyzed
– Arrays quality control:
• Raw data plots
• Robust probe-level model library (affyPLM)
– NUSE plot
– RLE plot
A
Summary of loaded data: none is available since no CEL files have been loaded
A
Click on “File” to start a new project
B
C
Click on “New” to start a new project
Selected 3’IVT arrays
D
Selected as working dir the folder
containing the .CEL files
Selected the “targets” file.
Then press OK to continue
Targets file is a tab delimited
text file containing the
description of the experiment.
It is made of three columns:
Name: the name you want to
assign to each array.
FileName: the names of the
corresponding .CEL file
Target: the experimental
condition associated to the
array (e.g. mock, treated, etc).
At least two conditions should
be present.
Define the name of you analysis.
Press OK to continue.
…. Now the array will be loaded in
a specific R object called
environment.
Raw data are now loaded and are
ready for normalization.
Widget to create a target for Affy
arrays
Widget to create a target for Affy
arrays
Widget to create a target for Affy
arrays
Widget to create a target for Affy
arrays
Exercise 1 (15 minutes)
• Data set for this exercise:
– spugnini.et.al: 7 CEL files were generated for
two prototypic situations (biological replicas):
– C, untreated MSTO-211H cells:
• baldi.c1_b.CEL, baldi.c2.CEL, baldi.c3.CEL
– T, MSTO-211H cells treated with 780 μM
piroxicam for 48 hours:
• baldi.f1_b.CEL, baldi.f2.CEL, baldi.f3.CEL,
baldi.f4.CEL
See next page
Exercise 1
• Go in the folder spugnini.et.al.
• Create, with excel, a tab delimited file named
targets.txt:
– Targets file is made of three columns with the
following header:
• Name
• FileName
• Target
– In column Name place a brief name (e.g. c1, c2, etc)
– In column FileName place the name of the
corresponding .CEL file
– In column Target place the experimental conditions
(e.g. control, treatment, etc)
See next page
Exercise 1
• Open R
• Load the oneChannelGUI
• Start a new project:
– Change the working dir in spugnini.et.al
– Load the targets.txt file (previously created)
– Set as project name: ex1
Analysis pipe-line
Quality
control
Normalization
Filtering
Statistical
analysis
Biological
Knowledge
extraction
Annotation
A
The next steps are few simple
basic quality controls.
B
Click on Quality Control menu
You can now evaluate:
Intensity histogram for
one array at a time.
A
E
C
D
You can now evaluate:
Intensity density plot for
one array at a time.
A
E
C
D
A
You can now evaluate:
all arrays intensities as
box plots.
C
A
B
B
A
It is possible that cRNA
concentration in sample sE2
was over estimated and a low
cRNA amount was loaded on
the array.
As result a lot of signals are
below the value 100
[log2(100) = 6.44]
A
Some other basic controls can be
done after the calculation of the
probe set intensity summary
using a special Bioconductor
library affyPLM
B
Fit the model
(BE PATIENT!!!)
The end of the fitting procedure
is given by a message.
Then the NUSE/RLE function is
automatically called
C
affyPLM QC library
• affyPLM provides a number of useful tools
based on probe-level modelling
procedures.
• affyPLM package allows arrays quality
controls.
What is a Probe Level Model?
• A Probe Level Model (PLM) is a model
that is fit to probe-intensity data.
• affyPLM fits a model with probe level and
chip level parameters on a probe set by
probe set basis.
• In quality control chip level parameters
are a factor variable with a level for each
array.
What is a PLMset?
• The main function for fitting PLM is the
function fitPLM.
• This function will fit a linear model with an
effect estimated for each chip and an
effect for each probe.
• fitPLM implements iteratively re-weighted
least squares M-estimation regression.
• The fitted model is stored in a PLMset
object containing chip level parameter
estimates and the corresponding standard
errors.
Default fitted model
log2 PMkij  kj   ki   kij
• where kj is the log2 probe set expression value on
array j for probeset k and ki are probe effects.
• To make the model identifiable the constrain
I
i 1 ki  0 is used.
• For this default model, the parameter estimates
given are probe set expression values.
Relative Log Expression (RLE)
• RLE values are computed for each probe set by
comparing the expression value on each array against
the median expression value for that probeset across all
arrays.
• Assuming that most genes are not changing in
expression across arrays means ideally most of these
RLE values will be near 0.
• Boxplots of these values, for each array, provides a
quality assessment tool.
• RLE plots:
– Estimation of expression gi for each gene g on each array i.
– Compute the median value across arrays for each gene
Relative Log Expression (RLE)
Normalized Unscaled Standard
Errors (NUSE)
• Standard error measures the amount of errors done fitting y for every
x value.
se=
• Normalized Unscaled Standard Errors (NUSE) can also be used for
assessing quality.
• The standard error estimates obtained for each gene on each array
from fitPLM are taken and standardized across arrays so that the
median standard error for that genes is 1 across all arrays.
• This process accounts for differences in variability between genes.
• An array were there are elevated SE relative to the other arrays is
typically of lower quality.
• Boxplots of these values, separated by array can be used to compare
arrays.


 
medSEˆ 
NUSE ˆgi 
SE ˆgi
gi
A
C
B
A
B
A
Since the fitPLM object can be
very big. It is a good idea, to
delete it after quality control.
Before Delete PLM
After Delete PLM
Exercise 2 (20 minutes)
• Starting from the data set you have loaded
– check the Raw data with the available plots
– analyze the data using NUSE and RLE plots.
• Answer the following questions:
– Is there any array characterized by a very
narrow probe intensity distribution?
• YES
NO
– Is there any array which is significantly
different with respect to the others by mean of
NUSE or RLE plots?
• YES
NO
See next page
Exercise 2
• If any array has a NUSE very different
from the others create a new target file
without it and load again the remaining
.CEL files.
Analysis pipe-line
Quality
control
Normalization
Filtering
Statistical
analysis
Biological
Knowledge
extraction
Annotation
affylmGUI
• Analysis steps:
– Calculating probe set summaries:
• RMA
• GCRMA
• PLIER
– Normalization:
• Quantile method
Brief summary about probe set
intensity calculation
• RMA methodology (Irizarry et al., 2003) performs background
correction, normalization, and summarization in a modular way.
RMA does not take in account unspecific probe hybridization in
probe set background calculation.
• GCRMA is a version of RMA with a background correction
component that makes use of probe sequence information (Wu et
al., 2004).
• The PLIER (Probe Logarithmic Error Intensity Estimate) method
produces an improved signal by accounting for experimentally
observed patterns in probe behavior and handling error at the
appropriately at low and high signal values.
• Methods such as PLIER+16 and GCRMA, which use model-based
background correction, maintain relatively good accuracy without
losing much precision.
Yeast DNA on human chips
•G,C stretch length better perform with respect to G +
C content.
•The distribution of 4 G-C strata in the non-specific
sample well approximate to a log-normal distribution.
Observed log (base 2) expression versus nominal log concentration (in picoMolar).
Why Normalization ?
•
•
•
•
•
To remove systematic biases, which include,
Sample preparation
Variability in hybridization
Spatial effects
Scanner settings
Experimenter bias
Extracted from D. Hyle presentation, http://www.bioinf.man.ac.uk/microarray
What Normalization Is & What It Isn’t
•
•
•
•
Methods and Algorithms
Applied after some Image Analysis
Applied before subsequent Data Analysis
Allows comparison of experiments
• Not a cure for poor data.
Where Normalization Fits In
Sample
Preparation
Hybridization
Scanning
+ Image
Analysis
Normalization
Data
Analysis
Array
Fabrication
Normalization
Spot location,
assignment of intensities,
background correction
etc.
Subsequent
analysis, e.g
clustering,
uncovering genetic
networks
Quantile normalization
Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)
Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)
Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)
Quantile normalized
Extracted from Irizarry presentation at Bioconductor Course (Brixen IT, 2005)
M-A Plots
log R
M-A plot is 45° rotation of standard scatter
plot
M
0
45°
A
log G
M = log R – log G
A = ½[ log R + log G ]
M = Minus
A = Add
Extracted form Irizarry presentation at Bioconductor Course (Brixen IT, 2005)
The next step is normalization and
calculation of probe set summary.
A
B
Click on probe set menu
and select the probe set
summary and normalization
option.
Normalization and intensity calculation
come together.
Three Normalization/intensity
calculation option are available:
RMA + quantile normalization
GCRMA + quantile normalization
PLM + quantile normalization
At any time it is possible to check the
structure of the normalized data set
Many option but slow computation.
Be patient!!!!
Info about the process are shown in
the main R window
Exercise 3a (20 minutes)
• Data set for following exercises:
• estrogen.IGF1 : 18 CEL files were generated for 4
prototypic situations (biological replicas):
– MCF7 ctrl, untreated MCF7:
• M1.CEL, M4.CEL, M7.CEL
– MCF7 E2, MCF7 treated for 3 hours with Estrogen:
• M3.CEL, M6.CEL, M9.CEL
– MCF7 IGF, MCF7 treated for 3 hours with IGF1:
• M2.CEL, M5.CEL, M8.CEL
– SKER3 ctrl, untreated SKER3:
• s1.CEL, s4.CEL, s7.CEL
– SKER3 E2, SKER3 treated for 3 hours with Estrogen:
• s3.CEL, s6.CEL, s9.CEL
– SKER3 IGF, SKER3 treated for 3 hours with IGF1:
• s2.CEL, s5.CEL, s8.CEL
See next page
Exercise 3a
• Go in the folder estrogen.IGF1.
• Create, with excel, a tab delimited file named targets.txt:
– Targets file is made of three columns with the following header:
• Name
• FileName
• Target
– In column Name place a brief name (e.g. c1, c2, etc)
– In column FileName place the name of the corresponding .CEL
file
– In column Target place the experimental conditions (e.g. control,
treatment, etc)
• Create a target only for MCF7 with/without IGF1
treatment
See next page
Exercise 3a
• Calculate Probe set summaries with
GCRMA for the estrogen.IGF1 set.
• Save the limma project as ex3.
See next page
Expression Console®
(Affymetrix person)
• Presentation of Expression Console®:
– General structure
– Loading exon libraries
– QC for raw data
– Exporting gene level and exon level analysis
Replicates quality control
• To evaluate sample replicates quality we
will use a partition technique called
Principal component analysis (PCA) .
Principal component analysis
• Principal component analysis (PCA) involves a
mathematical procedure that transforms a number of
(possibly) correlated variables into a (smaller) number of
uncorrelated variables called principal components.
• The first principal component accounts for as much of
the variability in the data as possible
• Each succeeding component accounts for as much of
the remaining variability as possible.
• The components can be thought of as axes in ndimensional space, where n is the number of
components. Each axis represents a different trend in
the data.
PCA
PCA2
PCA1
1
2
2° PC will be orthogonal to the 1st
In general the first three components
account for nearly all the variability.
Therefore, PCA can be reasonably
represented in a 3D space.
A
B
t4 is clearly an outlier!
To perform sample
replicates QC we use
principal component
analysis (PCA)
This check is performed on
probe set summaries!
Exercise 3b
• Use the estrogen.IGF1 data set to
evaluate quality replicas by mean of PCA
analysis.
• Questions:
– CTRL and TRT are well separated?
• YES
NO
– Replicates are homogeneous?
• YES
NO
Analysis pipe-line
Quality
control
Normalization
Filtering
Statistical
analysis
Biological
Knowledge
extraction
Annotation
Filtering
• Filtering affects the false discovery rate.
• Researcher is interested in keeping the number
of tests/genes as low as possible while keeping
the interesting genes in the selected subset.
• If the truly differentially expressed genes are
overrepresented among those selected in the
filtering step, the FDR associated with a certain
threshold of the test statistic will be lowered due
to the filtering.
Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004
Filtering can be performed at
various levels:
• Annotation features:
– Specific gene features (i.e. GO term, presence
of transcriptional regulative elements in
promoters, etc.)
• Signal features:
– % intensities greater of a user defined value
– Interquantile range (IQR) greater of a defined
value
Specific gene feature
• In transcriptional studies focusing on genes
characterized by specific feature (i.e.
transcription factor elements in promoters) the
best filtering approach is selecting only those
genes linked to the “peculiar feature”.
• For example:
– Identification of genes modulated by estradiol:ER or
IGF1 by direct binding to Estrogen-Responsive
Elements (ERE):
• HGU133plus2:
– 54675 probe sets  19951 Entrez Genes
• HGU133plus2 with ERE in putative promoter regions:
– 6764 probe sets  3058 Entrez Genes
Specific gene feature
• Data derived from specifically devoted annotation data
set can be used for functional filtering.
• The Ingenuity Pathways Knowledge Base is the world's
largest curated database of biological networks created
from millions of individually modeled relationships
between:
– proteins, genes, complexes, cells, tissues, drugs, diseases.
• The Ingenuity Pathways Analysis software (IPA)
identifies relations between genes.
• The relations that can be grasped are:
– Regulates
– Regulated by
– Binds
Start an Ingenuity session at:
https://analysis.ingenuity.com/pa/login/login.jsp
Specific classes of proteins can be searched and exported
A key word can also be used to perform a wide search
After selection of the Functions & diseases of
interest genes should be visualized as gene
details before exportation in a file to be used
for filtering expression data
Exporting results in a table
as previously
The Entrez Gene IDs present in this file can be used to extract e
specific subset of genes.
To use filtering using a list of EG you need to extract from the
IPA table only the Entrez genes of interest and save them on a
text file without header.
Exercise 4 (30 minutes)
• Export from Ingenuity a table related to
transcription factors
• Create a file containing only the Entrez
genes without header and use it to filter
the data set of exercise 3.
• Save the affylm project as ex4.TF.lma
• Inspect the sample PCA:
– CTRL and TRT are well separated?
• YES
NO
Non–specific filtering
• This technique has as its premise the removal of
genes that are deemed to be not expressed or
unchanged according to some specific criterion
that is under the control of the user.
• The aim of non–specific filtering is to remove
genes that, e. g. due to their low overall intensity
or variability, are unlikely to carry information
about the phenotypes under investigation.
Extracted from: Heydebreck et al. Bioconductor Project Working Papers 2004
Intensity distributions
Bg level probe sets
RMA
GCRMA
How to define the efficacy of a
filtering procedure?
enrichment100
N spikeinAfterFiltering N probesets
N probesetsA fterFilteringN spikein
• This enrichment is very similar to that used to evaluate the purification
folds of a protein after a chromatographic step.

E. AfterChrom gP.BeforeChrom
enrichm ent 100
gP. AfterChrom E.BeforeChrom
Filtering by genefilter pOverA
(keep if ≥ 25% probe sets have intensities ≥ log2(100))
22300
42/42 SpikeIn
Enrichment:
100%
5553
42/42 SpikeIn
Enrichment:
401%
Filtering by InterQuantile Range
25%
IQR
75%
The distribution of all intensity values of a differential expression experiment are the
summary of the distribution of each gene expression over the experimental conditions
How filtering by genefilter IQR works?
How filtering by IQR works?
The filter removes genes that show little changes within the experimental points
Filtering by genefilter IQR
(removing if intensities IQR0.25, 0.5)
22300
42/42 SpikeIn
Enrichment:
100%
244
42/42 SpikeIn
Enrichment:
9139%
68
42/42 SpikeIn
Enrichment:
32794%
A
C
B
D
A
B
C
In this example will be
selected only those genes
characterized by having in at
least 50% of the arrays an
intensity ≥ 100.
Exercise 5 (20 minutes)
• Use the estrogen.IGF1 data normalized with gcrma
(ex3).
• Perform and IQR filter at 0.25 followed by an intensity
filter at 50% of the arrays with and intensity over 100.
• Export the data as tab delimited file.
• Question:
– How many probe sets are left after the first and the second filter?
• …………………………………………………………
– Which are the changes observable at the level of probe sets
distribution after the first and the second filter?
• ……………………………………………………………………………
• Save the affylm project as ex5.lma.
• Make the same filtering on ex4.TF.lma
• Question:
– How many probe sets are left after the first and the second filter?
QC and filtering for exon data
• At the time oneChannelGUI was setup
Bioconductor tools for handling raw data from
Affymetrix exon arrays were not available.
• For this reason the oneChannelGUI uses the
libraries and primary analysis outputs from
Affymetrix Expression Console®.
• Exon raw data quality control is perfomed using
the Expression Console®.
• Sample QC and filtering are performed on
oneChannleGUI.
Exon arrays on oneChannelGUI
• On oneChannelGUI gene level and exon
level data from Expression Console® are
loaded.
• User needs to specify where Expression
Console® library files are located, at any
time a new exon data set is loaded.
Loading an exon array data
set it is necessary to
indicate the organism and
which kind of exon data are
going to be loaded (core,
extended, full)
Loading an exon array data set
it is necessary to specify the
location of Expression Console
libraries.
Subsequently three files have to be
loaded:
The target file, which has the same
structure previously described.
The tab delimited files containing
GENE and EXON level data exported
form the Expression Console.
A new Menu is then available for exon data
Exon arrays QC on oneChannelGUI
The brain (b) replicates are very poor. The quality is particularly bad for exon data.
However, we have to consider that these data are derived from tissues coming from
different post-mortem donors.
Exon arrays filtering
Since the knowledge on exon data is still
relatively limited we have little empirical
information about background threshold.
Exon/intron housekeeping gene information
available in exon data might be a possible
approach to define it.
Different color lines indicate the
possible thresholds to be selected.
In black are shown the intensity
density plots for introns as in red
those for exons.
IQR filter works as described
for 3’IVT arrays. However,
any filter done at gene level
will also affect the
corresponding exon data.
Starting condition
After filtering
Intensity filter is instead based on the
threshold previously selected on the
basis of exon/intron HK expression
signals.
In this example we are keeping only the
genes where all samples have a signal
greater than the pre-defined BG.
Exercise 6 (30 minutes)
• Use breast and cerebellum Affymetrix exon data.
• Calculate, with Expression Console, gene/exon
level expression for the CORE sub set.
• Evaluate the sample quality using gene/exon
PCA.
• Question:
– Are replicates homogeneous?
– YES
NO
• Define a BG threshold using introns information.
• Perform an IQR filter at 0.25 followed by
intensity filter at a threshold of your choise.
• Save the affylm project as ex6.lma.
Splice Index
• The Splicing Index captures the basic
metric for the analysis of alternative
splicing.
• It is a measure of how much exon specific
expression (with gene induction factored
out) differs between two samples.
Defining function-oriented data set
for splice index calculation
A
Use a set of function-oriented EGs to
select probe set IDs
B
C
Use the selected probe set IDs for
“Filtering using a list of probe sets”.
A
B
C
ATTENTION: this is only a
very rough descriptive
instrument!
Much work needs to be
done on exon analysis!
Splice Index inspection is
performed modelling the
splice index exon profiles
for two experimental
conditions.
Results are saved on a pdf
file in your working dir.
The sub set of splice indexes to be inspected is defined using two filters:
A
D
Example of one gene output
B
C
This plot gives some advise about
the scattering levels of the Splice
Indexes over the gene under
analysis
A
Model of splice indexes over the two experimental conditions.
Red dashed lines indicate the confidence interval of the model.
Plots of significance p-value of the alternative splicing versus the
average Splice Index values. In this example only one exon seems to be
differentially spliced : 14.
Significance p-value of the alternative splicing versus the average Splice
Index values. IN this example only one exon seems to be differentially
spliced.
Filtering conditions are shown
over the plot of intensity values
versus exon number.
Exercise 7 (30 minutes)
• Use ex6.lma.
• Perform a filtering on the basis of a probe sets
related to TF.
– Create in NetAffx a Transcript cluster view containing
only Transcript Cluster ID
– Gene level probe set list is created interrogating
NetaAffx with the EG used in exercise 4
• Calculate on the data sub set the splice Index.
• Save the affylm project as ex7.lma.
• Filter for p-value ≤0.01 and mean average Splice
Index differences ≥2.
• Inspect the output plots.
Analysis pipe-line
Quality
control
Normalization
Filtering
Statistical
analysis
Biological
Knowledge
extraction
Annotation
This step is the same for 3’IVT arrays and exon arrays gene
level analyses
Log2(T/C) is frequently used to evaluate
fold change variation
200, 400, 800, 1600, 32000 100, 100, 100, 100, 100
100 100 100 100 100 200 400 800 1600 32000
8.00
6.00
log2(t/c)
down-regulation
compression
t/c
4.00
2.00
0.00
-2.00
F65439
PT765
AC87654
log2(t/c)
MN0987
S098345
AA238
L8754
G7599
U6539
H07498
-4.00
Fold change filtering
• The intensity change between experimental
groups (i.e. control versus treated) are known
as:
– Fold change.
• Frequently an arbitrary threshold
Trtd
log 2
1
Ctrl
is used to define a significant differential
expression.
Fold change filtering
• There are no rules to define the correct fold change (fc)
threshold for differential expression.
• |fc|>1 is an arbitrary threshold.
• Fc threshold estimation is dependent on the % of fc
fluctuations due to experimental reasons.
• Fc threshold estimation can be better appreciated in
time/concentration course experiments.
• “Biologically speaking” many small variations all together
can be functionally important (i.e. |fc|=0.5 for all chr 21
genes induces the Down syndrome)
Statistical analysis
• Intensity changes between
experimental groups (i.e.
control versus treated) are
known as:
– Fold change.
– Ranking genes based on
fold change alone
implicitly assigns equal
variance to every gene.
• Fold change alone is not
sufficient to indicate the
significance of the expression
changes.
• Fold change has to be
supported by statistical
information.
Multiple testing errors
• Performing multiple statistical tests two types of
errors can occur:
– Type I error (False positive)
– Type II error (False negative)
• Reduction of type I errors increases the number
of type II errors.
• It is important to identify an approach that
reduces false positives with the minimum loss of
information (false negative)
The multiple tests problem
• If the number of samples increases the tails of a
distribution are getting more populated.
Type I error correction
• Null hypothesis (H0): the mean of treated and the
mean of control for a gene i belong to the same
distribution.
• Type I error: H0 is false.
• Sidak significance point:
= acceptance level (es 0.05)
g= n. of independent tests
K ( g , )  1  1  
• If the p-values are lower of K (g,) all the
remaining H0 are considered true.
g
Type I error correction (FWER)
K ( g , )  1  1  
g
P of diff. exprs. genes
<10-6
< 10-6
2* 10-5
0.047
…
’
1–
1–
1–
1–
(1
(1
(1
(1
–
–
–
–
0.05)1/5=
0.05)1/4=
0.05)1/3=
0.05)1/2=
0.102
0.0127
0.0170
0.0253
Statistical analysis
• The sensitivity of statistical tests is affected by
the number of available replicates.
• Replicates can be:
– Technical
– Biological
• Biological replicates better summarize the
variability of samples belonging to a common
group.
• The minimum number of replicates is an
important issue!
How much replicates are important?
Yang YH e Speed T, 2002
Sample size
• Microarray experiments are often performed with a
small number of biological replicates, resulting in
low statistical power for detecting differentially
expressed genes and concomitant high false
positive rates.
• The issue of how many replicates are required in a
typical experimental system needs to be addressed.
• Of particular interest is the difference in required
sample sizes for similar experiments in inbred vs.
outbred populations (e.g. mouse and rat vs.
human).
Assessing sample sizes in
microarray experiments
• Assessment of sample sizes for microarray data
is a tricky exercise.
• The reason why we are performing such
analysis is to have a general feeling on the
ability of our experimental data to robustly detect
differential expression.
• The method implemented in oneChannelGUI is
that proposed by Warnes & Liu and
implemented in the Bioconductor library ssize.
Assessing sample sizes in
microarray experiments
• The key component of Warnes’ method is
the generation of cumulative plot of the
proportion of genes achieving a desired
power as a function of sample size, based
on simple gene-by-gene calculations.
• Its real utility is as a visual tool for helping
users to understand the trade off between
sample size and statistical power.
Assumptions
• A microarray experiment is set up to
compare gene expressions between one
treatment group and one control group.
• Microarray data has been normalized and
transformed so that the data for each gene
is sufficiently close to a normal distribution
that a standard 2-sample pooled-variance
t-test will reliably detect differentially
expressed genes.
• The tested hypothesis for each gene is:
versus
where μT and μC are means of gene
expressions for treatment and control group
respectively.
• The analysis is done using the common
variance described in:
– Wei et al. BMC Genomics. 2004, 5:87
Sample size estimation
• The required sample size of an experiment
depends on:
– variance component (σ),
– the desired detectable fold change (δ),
– the power to detect this change (1-β, the likelihood of
detecting the change or the true positive rate),
– a chosen type I error rate (α= false positive).
IMPORTANT:
This implementation of ssize functions uses BH type I
error correction instead of Bonferroni, which is the
default in ssize functions.
= type II error rate, i.e. false negative.
This is not |log2(FC)|
To detect 95% of the differentially expressed genes, characterized by a power of 0.8,
a sample size, FOR EACH GROUP, greater than 20 is needed.
To detect 97% of the differentially expressed genes, characterized by a power of 0.8,
a fold change greater than 10 (log2(10)=3.32) is needed.
Assessing sample sizes in
microarray experiments
• Assessment of sample sizes and experimental
power of a microarray experiment using the
method implemented in the Bioconductor library
sizepower.
• The R package, sizepower, is used to calculate
sample size and power in the planning stage of
a microarray study.
• It helps the user to determine how many
samples are needed to achieve a specified
power for a test of whether a gene is
differentially expressed or, in reverse, to
determine the power of a given sample size.
Exercise 8 (20 minutes)
• Use the breast/cerebellum filtered exon
data (ex6.lma).
• Using the ssize implementation
– Check the fold change needed to achieve a
power of 80% on the basis of the available
sample size.
• Using the sizepower implementation
– Check the sample size and the experimental
power selecting various threshold of false
positives and log2(FC)
Comments about experimental
design
• If the biological material is not a limiting
factor “THINK WIDE”:
– Experiment should be designed with many
replicas (>3)
– Time course experiments should be designed
with many time points (>4).
– Investigate part of the experiment by
microarrays and use the rest for further
validations.
Statistical validation
• Statistical validation can be performed using
parametric and non-parametric tests.
• Parametric tests:
– The populations under analysis are normally
distributed.
• Non parametric tests:
– There is no assumption on samples distribution.
• Non parametric are less sensitive than
parametric.
Selecting differentially expressed
genes
Statistical validation
method I
Statistical validation
method II
Differential expression
linked to a specific
biological event.
Statistical validation
method III
Selecting differentially expressed
genes
• Each method grasps some true signals but
not all.
• Each method catches some false signals.
• The trick is to find the best condition to
maximize true signals while minimizing
fakes.
Mean C
Mean T
Population Ctrl
Population Trtd
t
mT  mC
 

nT nC
2
T
Sample mean “s”
Less than a 5% chance that the sample with mean s came from population C,
i.e., s is significantly different from “mean C” at the p < 0.05 significance level.
But we cannot reject the hypothesis that the sample came from population T.
2
C
• T-statistics is widespread in assessing
differential expression.
• Unstable variance estimates that arise
when sample size is small can be
corrected using:
– Error fudge factors (SAM)
– Bayesian methods (Limma)
SAM
Significance Analysis of
Microarray
SAM
(Significance analysis of microarrays)
(Tusher et al. 2001)
fudge factor regularizes
the t -statistic
by inflating the
denominator
s(i) is the pooled standard deviation, taking into account differing
gene-specific variation across arrays.
Some SAM designs
Two-class unpaired: to pick out genes whose
mean expression level is significantly different
between two groups of samples (analogous to
between subjects t-test).
Two-class paired: samples are split into two
groups, and there is a 1-to-1 correspondence
between an sample in group A and one in group B
(analogous to paired t-test).
Multi-class: picks up genes whose mean expression is
different across > 2 groups of samples (analogous to
one-way ANOVA)
Others: ………………………………………………………………
• SAM uses data permutations to define a
set of significant differential expression.
N
N
N
T
T
T
{
N
T
N
T
N
N
T
N
T
N
T
T
T
N
T
N
T
T
N
T
N
T
N
N
}
FDR is given by p0 * False / Called
p0 is the prior probability pi0 that a gene is not differentially expressed
How SAM calculates the False Discovery
Rate for a specific delta?
Permutations
4
720
3
2
1
Mean false
A
SAM analysis can be performed in
Bioconductor using the siggenes library.
Two class or multi class analysis is
selected automatically due to the
structure of Target information
B
C
The delta table prompts to
the user the information
related to the amount of
differentially expressed
genes given a certain FDR.
The user selects a delta
value and check the
behaviour of the differentially
expressed genes.
The user selects a delta
value and check the
behaviour of the differentially
expressed genes.
Subsequently the user
performs a log2(fold change)
filter to produce a table of
differentially expressed
genes.
Subsequently the user
performs a log2(fold change)
filter to produce a table of
differentially expressed
genes.
The table can be saved in a
tab delimited file
relative difference
in gene expression
Standard
deviation
Fold change
Raw p-value
Significance measurement
derived from raw p-value
Exercise 9 (20 minutes)
• Use the estrogen.IGF1 data set.
• Perform a two class analysis on ctrl versus
IGF1 treatment, after IQR and/or intensity
filter.
• Select a FDR ≤ 0.1 and a |log2(FC)| ≥ 1
• Save results in ex9.xls