Transcript Low-level Analysis of Microarray Data
Low-level Analysis of Microarray Data
Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01
2
Why are microarrays important?
Genome projects have identified lots of new genes, but for many we do not know what they are doing.
"Functional Genomics"
Application areas:
Cancer, Inflammation, Infectious diseases Toxicology - drug (side) effects Developmental biology
3
Papers
A) H. Bengtsson, B. Calder, I. S. Mian, M. Callow, E. Rubin, and T. P. Speed.
Identifying Differentially Expressed Genes in cDNA Microarray Experiments: making aging visible
. 2001.
B) H. Bengtsson.
Identification and normalization of plate effect in cDNA microarray data
. 2002.
C) H. Bengtsson.
The R.oo package - object-oriented programming with references using standard R code
. 2003.
D) H. Bengtsson and O. Hössjer.
Methodological study of affine transformations of gene expression data with proposed normalization method
. 2003.
E) H. Bengtsson, G. Jönsson, and J. Vallon-Christersson.
Calibration and assessment of channel-specific biases in microarray data with extended dynamical range
. 2003.
F) H. Bengtsson. aroma -
An R Object-oriented Microarray Analysis environment
. 2004.
G) H. Bengtsson and O. Hössjer.
Affine calibration for microarrays with dilution series or spike-ins
. 2004.
Central Dogma of Molecular Biology
4 Idea of microarrays (and other gene expression techniques):
Measure the amount of mRNA to find genes that are expressed (active)
.
(measuring protein might be better, but is currently much harder)
7
Which genes are differentially transcribed?
same-same tumor-normal
8
Statistics 101:
bias accuracy
9
Central dogma of data analysis
Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off!
X X X X X X X X X
10
Printing / spotting
11
Microarray slide preparation
J. Vallon-Christersson, Dept Oncology, Lund Univ.
RNA
RNA extraction & hybridization
(targets) Tumor sample Reference sample RNA Extract mRNA from samples Reverse transcription of mRNA to cDNA Label with Cy3 or Cy5 fluorescent dye Hybridize labeled cDNA to slide Wash slide cDNA cDNA Hybridize 12 (probes) Figure: Hybridization chamber.
Two-channel scanning
green laser excitation red laser emission overlay images
13 higher frequency, more energy lower frequency, less energy
14
Combined color image for visualization
15
Signal quantification
1. Addressing
Locate spot centers.
2. Segmentation
Classification of pixels either as signal or background (using circles, seeded region growing or other).
3. Signal quantification
a) foreground estimates : R fg , G fg b) background estimates : R bg , G bg c) ... (shape, size etc) Terry Speed et al.
16 cDNA clones printing
Summary
green laser excitation red laser PCR product amplification purification
RNA Test sample Reference sample RNA
emission
cDNA cDNA
overlay images scanning
Hybridize Production
data: (R fg ,G fg ,R bg ,G bg , ...)
17
Sources of variation
amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection Systematic o measurements o similar effect on many corrections can be estimated from data probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal Stochastic o plicitely accounted for o too random to be ex remain as “noise”
Normalization Error model
Exploratory data analysis: (R,G)
(log 2 R,log 2 G)
(M,A)
18 R vs G
R
= red channel signal
G
= green channel signal log(R) vs log(G) M vs A
M
= log
2
(
R
/
G
) (log-ratio),
A
= ½ ·log
2
(
R
·
G
) (log-intensity)
19
Exploratory data analysis: the need for normalization
In self-self comparisons, find stochastic as well as systematic deviations from M=0.
Idea
: Do various exploratory plots to understand the nature of these deviations: for example, M vs A, spatial plots, density & boxplots plots, print order plots etc.
20
21
22
23
24
Within-slide normalization
Classical methods:
No normalization Median normalization
Shift of log-ratios.
Curve-fit normalization methods.
A.k.a. lo(w)ess normalization.
etc.
25
Global median normalization
Assumption:
Changes roughly symmetric.
M’
=
M
– median(
M
) aroma: normalizeWithinSlide(ma, method=“m”)
Global curve-fit normalization cont.
curve-fit normalization
done for
all spots together
M i ’
=
M
i – c(
A
i ) (Biased towards the green channel & intensity dependent artifacts) 26 aroma: plot(ma); normalizeWithinSlide(ma, method=“l”); plot(ma)
Print-tip curve-fit normalization
Print-tip curve-fit normalized data {(
A,M
)}
n=1..N
:
M’ i,p = M i,p - c p
(
A i
) ;
p
= print tip 1-16 where
c p
is an
intensity dependent
function for print tip
p .
27 aroma: plot(ma); lowessCurve(ma, groupBy=“printtip”); boxplot(ma, groupBy=“printtip”)
28
Print-tip curve-fit normalization cont.
curve-fit normalization
done for each
print-tip group
separately aroma: normalizeWithinSlide(ma, “p”); boxplot(ma, groupBy=“printtip”); plotSpatial(ma)
29
Across-slides normalization
The
within-slide
normalization makes the red and the green signals comparable. What about signals from different arrays?
Across-slide normalization
attacks this problem.
Idea
: Log-ratios (M values) from different slides should have similar variance.
Example
: If the standard deviation of the log-ratios from one array is 0.50 and from another array 0.78, the log-ratios have to be rescaled to get the same standard deviation.
30
Paper A
H. Bengtsson, B. Calder, I. S. Mian, M. Callow, E. Rubin, and T. P. Speed.
Identifying Differentially Expressed Genes in cDNA Microarray Experiments: making aging visible
. Online report accompanying the discussion forum with the same name. Science SAGE KE, 2001 (12), vp8.
• Illustrates typical normalization of two-color microarray data • Short introduction on how to identify differentially expressed genes.
• Basic study on how many replicated arrays are needed.
• Comparison between two image-analysis methods.
31
Paper B
H. Bengtsson.
Identification and normalization of plate effect in cDNA microarray data.
Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2002.
32
Printing of spots
Hmm... why the horizontal stripes?
1
6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips...
16
33
Print-order plot of log-ratios
The spots are order according to
when
they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on
all
arrays.
plotPrintorder(ma)
34
Remove (constant) plate biases?
Will remove some of the intensity dependent effects...
...and some of the spatial artifacts.
35
Intensity normalization plate by plate?
...plus a print-tip normalization?
Removes the plate effects...
...and most of the spatial artifacts.
Comparison of normalization methods
Median absolute deviation (MAD) for gene
i
with replicates
j=1,2,...,J
:
d i = 1.4826 · median | r ij |
where
r ij = M ij – median M ij
is residual
j
for gene
i
.
Ex: two different genes:
d a
<
d b
The
measure of reproducibility
(small is good) is a scalar defined as the mean of
all
genewise MADs:
M.O.R. =
d i / N
where
N
is the number of genes. 36
Remember that minimizing variance alone is not useful - we also need to consider bias
Results
• Doing platewise intensity dependent normalization lowers the gene variability by another ~10% from print-tip norm . • Background correction increases variability (but might reduce bias) • Using
measure of reproducibility
is helpful in deciding what to do.
Pl
– Constant platewise norm.,
Pl(A)
dep. slidewise norm.,
Pr(A)
– Intensity dep. platewise norm.,
Sl(A)
– Intensity dep. print-tip-wise norm.,
sPr(A)
– Intensity – Scaled intensity dep. print-tip-wise norm.
,
bg
– background corrected data.
40
Paper D
H. Bengtsson and O. Hössjer.
Methodological study of affine transformations of gene expression data with proposed normalization method
. Preprints in Mathematical Sciences 2003:38, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2003.
41
A general model
y c,i = f c (x c,i ) + c,i genes
i
= 1
,
2
, . . . , I
RNA extracts
c
= 1
,
2
, . . . , C x c,i
the
true gene expression level y c,i
c,i the
observed gene expression level
measurement noise, E[ c,i ] = 0 f c (·)
measurement function
is (unknown)
42
Measurement functions model the biochemical and physical measurement procedure
43
Recall the M vs A transform
Especially in two-channel microarray analysis the M vs A transform is useful: where
x R,i
and
x G,i
are the true gene expression levels. Since these are unknown, it is a common practice to plug in the observed expression levels
Linear models and M vs A
In the literature, it common to assume a linear measurement function, either explicitly or implicitly: where b c is the scale factor for channel intensities become
c = R,G
. The observed log-ratios and log 44 where =b R /b G is the relative scale factor.
Systematic effects : The overall shift in the log-ratios is log 2 , which is constant.
Examples of bias with linear measurement functions 45 Balance channels: = b R /b G = 1 ) log 2 = 0 Red too strong: = b R /b G = 4/1 ) log 2 = +2 Green too strong: = b R /b G = 1/4 ) log 2 = -2
Affine measurement functions
Assume that the measurement function is
affine
: where a c and b c are channel-specific bias and scale parameters.
The observed log-ratios become 46 where i = a R a G (x R,i / x G,i ) , =b R /b G , and A i is the observed log intensity for gene
i
.
Systematic effects
:
i)
Constant shift of log2 . ii) intensity-dependent bias.
Note that the second bias term is fold-change dependent! Moreover, if
a R
the second bias term is zero as before.
=
a
G = 0 (linear),
Examples of bias with affine measurement functions Assume different channel biases, i.e.
a
R ≠
a
G and =
b
R /
b
G =1.
47 No bias: (a R ,a G )=(0,0) Small bias in red: (a R ,a G )=(20,0) Small bias in red, large bias in green: (a R ,a G )=(20,200)
Theoretical examples of affine transformations Assume different channel biases, i.e.
a
R ≠
a
G , but same =
b
R /
b
G =0.57.
48 (a R ,a G )=(0,0) = 0.57
(a R ,a G )=(20,200) = 0.57
(a R ,a G )=(80,140) = 0.57
The blue dash-dot curves are non-differentially expressed genes. Colored curves above and below are genes with fold-change log 2 (
x
R,i /
x
G
,
i ) =
±
1
, ±
2
, ...
49
… and with data:
50
Robust affine normalization
Consider a two-channel (
C=
{R,G}) microarray with genes
i=1,...,I
for which we observe With that = a R a G and = b R /b G , we have for non-differentially expressed genes (no noise) The red and green signals,
y
i = (y R,i ,y G,i ), then scatter around the line with intercept and slope . Next, robust estimates of and .
51
IWPCA – iterative reweighted PCA
Define the objective function w i is a weight, where r i ( , ;
y
i ) > 0 is the Euclidean distance between
y
i line L( , ). The estimate of and is then and the With weights w i = 1 we obtain standard PCA (minimization in L 2 ). With weights we minimize in L 1 , if we let
!
0.
Algorithm
: Estimate and . Update weights. Reestimate and and so on.
52
Additional constraints
With estimates of = a R a G and = b R /b G , we impose the additional constraint that after normalization
no negative signal may exists
(less conservative constraint may also be used). In addition, let b R =1. The robust affine normalization is then Example: Estimates (a R ,a G , ) = (21.5,29.3,0.163).
53
Generalization to several channels/arrays
The robust affine normalization generalizes directly • to microarrays with multilple channels, i.e. C=3,4,... . • to multiple arrays, if the experimental design allows, i.e. K=2,3,... .
In addition: • No between-array normalization is needed (”it’s included”).
• Normalization to a common reference is straightforward. That is, if all arrays share the same reference, first all reference channels is normalized toward each other to form a ”baseline channel”, then test channels are normalized toward the baseline channel.
Usage (in aroma):
normalizeAffine(rg)
54
Paper E
H. Bengtsson, G. Jönsson, and J. Vallon-Christersson.
Calibration and assessment of channel-specific biases in microarray data with extended dynamical range.
Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2003.
55
The microarray technology
(again)
56
Scan protocol
By scanning the
same
array at
various
PMT gains (and
constant
laser power) we can learn about
h c
. When we scan at gains
v
and
w
, we obtain two different measurements of the
unknown
, but
fixed
amount of hybridized DNA (
x c
,
i
): for both channels
c
=
{ R
,
G }
and all genes
i = 1,...,I
.
Observed signals at various gains
zoom 57 Channel:
c
=
G
.
PMT voltage pairs (
v
,
w
): (800,500) (600,500) (700,500) (800,600) (700,600) (800,700) All PMT pairs converge at the same point (
e c ,e
c ) and not at the origin (0,0).
Affine measurement function:
58
General model and estimation
The K observed signals in channel c for each gene i: scatter around the line with The line L(
a
c ,
b
c ) can be estimate robustly using IWPCA as before. Next, define Bias parameters
a
c are uniquely identified by the additional constraint that ||
d
c || ¼ 0, i.e.
59 Backtransform:
Multiscan calibration
Within-channel M vs A:
60
Results
By scanning the same microarray at various sensitivity levels one obtains: • calibrated signals with scanner biases removed • an extended dynamical range • improved signal-to-noise levels • better understanding of the noise structure (of the scanner)
61
Paper G
H. Bengtsson and O. Hössjer.
Affine calibration for microarrays with dilution series or spike ins.
Preprints in Mathematical Sciences 2004:19, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2004.
62
Model
gene
i=1,...,I
replicated spots
j=1,...,J
at various concentrations (dilution series).
channel
c
. First, assume that the concentration of fluorophores
b i,j
replicated spots of a gene
i
is the same: of all
z c,i,j = z c,i
Model the total amount of light
x c,i,j
from spot
(i,j)
as
x c,i,j = b i,j z c,i
63
Model cont.
We observe
y c,i,j = a c + b c x c,i,j +
c,i,j
= a c + b c b i,j z c,i +
c,i,j c,i,j independent zero-mean noise with variance c,i,j 2 .
a c
and
b c
offset and scale parameters for channel
c
In the paper, the variance function c,i,j 2 = c,i 2 =
k c
i 2 is used. The offset parameter
a c
can now be uniquely estimated using
maximum likelihood techniques
utilizing numerical optimization and PCA.
b c
are estimated as before.
64
Probe calibration
Before After C=2, I=432, J=4 ten dilution series are highlighted
65
In addition the method gives...
Estimates of the
relative expression level of the dilution series
follow directly from the maximum likelihood procedure.
66
Result
Arrays should contain multiple spots for the same gene that have different dilutions (instead of identical replicates) Systematic treatment of bias (optical background), expression ratios for the dilution series ("genes"), and noise.
67
Good scientific software is like a good scientific publication
reproducible peer-reviewed easily accessible by other researchers, society
builds on the work of others other will build their work on top of it
commercialize those developments that are successful
68
Why Open Source?
so that you can find out how it is being used so that you can modify what algorithm is being used, and these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components
Transparency
Pursuit of reproducibilty
Efficiency of development
Training
69
Paper C
H. Bengtsson.
The R.oo package - object-oriented programming with references using standard R code.
In Kurt Hornik, Friedrich Leisch, and Achim Zeileis, editors, Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria, March 2003
70
R.oo package
Objectives:
• Better and more robust object-oriented design and implementation • More memory efficient and faster implementations • Easy to understand and to extend • User- and developer-friendly interface • Core of the microarray package (
aroma
) •
Methods:
• Single root class
Object
all classes inherits from.
• Provide support for
reference variables.
R Coding Conventions (RCC),
e.g. naming conventions (
MyClassName,myObject, myFunction()
) • Free and open-source (5500 code lines).
• Proof of concept, rich documentation with many examples (~100 pages).
71
Paper F
H. Bengtsson.
aroma - An R Object-oriented Microarray Analysis environment.
Preprints in Mathematical Sciences 2004:18, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2004.
72
aroma package
Objectives:
• Provide simple methods for low-level analysis of microarray data • Wide support for file formats • User- and developer-friendly interface • Easy to understand and easy to extend • Easy to maintain
Methods:
• Make use of R.oo
• Make use of R.classes pkgs (20000 lines of code & > 400 pages of docs) • Follow R Coding Conventions • Free and open-source (18000 code lines) • Easy to install anywhere.
• Rich documentation with examples (>280 pages)
73
A short example
> gpr <- MicroarrayData$read("Slide1.gpr") > gpr [1] "GenePixData: Number of slides: 3. Number of fields: 43. Layout: Grids: 4x4 (=16), spots in grids: 18x18 (=324), total numberof spots: 5184. Spot names are specified. Spot ids are specified." > plotSpatial(gpr) > raw <- getRawData(gpr) > ma <- getSignal(raw) > idx <- (abs(ma$M) > 1.5) > plot(ma) > highlight(ma, idx, col="blue") > normalizeAffine(ma) > ...