Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg  Microarray intensities x1,…,xn Log-ratio with/without background correction Shrunken log-ratio (BHM) xi  f(xi ,...) log xj  f(xj.

Transcript Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg  Microarray intensities x1,…,xn Log-ratio with/without background correction Shrunken log-ratio (BHM) xi  f(xi ,...) log xj  f(xj.

Transformations…
what for?
which one?
Wolfgang Huber
Div.Molecular Genome Analysis
DKFZ Heidelberg
 Microarray intensities x1,…,xn
Log-ratio with/without
background correction
Shrunken log-ratio (BHM)
xi  f(xi ,...)
log
xj  f(xj ,...)
xi  
log
xj  
xi  xi  ci
Variance stabilized log-ratio
log
(=generalized log-ratio, “glog”)
xj 
2
2
xj  cj
2
2
 How do you like to think about (interprete) it?
 How do you estimate the parameters?
 What comes out (the “bottom line”?)
 ratios and fold changes
Fold changes are useful to describe
continuous changes in expression
3000
1500
1000
x3
x1.5
A
B
C
But what if the gene is “off” (below
detection limit) in one condition?
3000
200
0
?
?
A
B
C
 ratios and fold changes
Many interesting genes will be off in some
of the conditions of interest
1.If you want expression measure (“net
normalized spot intensity”) to be an
unbiased estimator of abundance
 many values  0
 need something more than (log)ratio
2. If you let expression measure be biased
(always>0)
 can keep ratios.
 how do you choose the bias?
 ratios and fold changes
Ratios are scale-free:
f(yi, yj )  yi / yj
or log(yi / yj )
But there is (at least) one absolute scale
in the data:
bg  sd(Yi | E(Yi )  0)
Can we use this to construct useful
functions
f(yi, yj, bg ) ?
 In the following:
 How to compare microarray
intensities with each other?
 How to incorporate measurement
uncertainty (“variance”)?
 How to simultaneously and
consistently deal with calibration
(“normalization”)?
 Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-fluorescent detection
Systematic
o similar effect on many
measurements
o corrections can be
estimated from data
Calibration
probe purity and length
distribution
spotting efficiency, spot size
cross-/unspecific hybridization
stray signal
Stochastic
o too random to be explicitely accounted for
o remain as “noise”
Error model
 modeling ansatz
measured intensity = offset +
gain
 true abundance
yik  aik  bik xk
aik  ai  ik
ai per-sample offset
ik ~ N(0,
bi2s12)
“additive noise”
bik  bi bk exp(hik )
bi per-sample
normalization factor
bk sequence-wise
probe efficiency
hik ~ N(0,s22)
“multiplicative noise”
 The two-component model
“multiplicative” noise
“additive” noise
raw scale
log scale
B. Durbin, D. Rocke, JCB 2001
 variance stabilizing transformations
Xu a family of random variables with
EXu=u, VarXu=v(u). Define
x
f (x ) 

1
v(u )
du
 var f(Xu )  independent of u
derivation: linear approximation
9.5 10.0
9.0
8.5
8.0
transformed
f(x) scale
11.0
 variance stabilizing transformations
0
20000
40000
rawxscale
60000
 variance stabilizing transformations
x
f (x ) 

1
v(u )
du
1.) constant variance (‘additive’)
v (u )  s2
2.) constant CV (‘multiplicative’)
v (u )  u 2  f  log u
3.) offset
v (u )  (u  u0 )2
4.) additive and multiplicative


f u
f  log(u  u0 )
u  u0
v (u )  (u  u0 )  s  f  arsinh
s
2
2
 the “glog” transformation
- - - f(x) = log(x)
——— hs(x) = asinh(x/s)
-200
0
200
400
600
800
intensity
arsinh( x )  log x  x 2  1

1000

lim  arsinh x  log x  log 2  0
x 
P. Munson, 2001
D. Rocke & B.
Durbin, ISMB 2002
parameter estimation
arsinh
Yki  ai
 k   ki ,
bi
 ki : N (0, c 2 )
measured intensity
= offset + straightforward
gain * true abundance
o maximum likelihood
estimator:
– but sensitive to deviations from normality
y

a

b
x
ik
ik
ik
ik
o model holds for genes that are unchanged;
differentially transcribed genes act as outliers.
aik  ai  Lik  ik
bik  bi bk exp(hik )
o robust varianta per-sample
of ML offset
estimator,
à la Least
b per-sample
normalization factor
Trimmed Sum ofL local
Squares
regression.
background
i
i
ik
provided by image
analysis
bk sequence-wise
labeling efficiency
o works as long as <50% of genes
are)
h
~
N(0,s
 ~ N(0, b s )
differentially transcribed
“multiplicative noise”
“additive noise”
ik
i
2
2
1
ik
2
2
minimize
n/2
4
 yi
2
i=1
()
 f(x(i) ) 
2
0
y
6
8
Least trimmed sum of squares regression
0
2
4
x
6
- least sum of squares
- least trimmed sum of squares
8
P. Rousseeuw, 1980s
difference red-green
evaluation: effects of different data transformations
rank(average)
 Normality: QQ-plot
evaluation: sensitivity / specificity in
detecting differential abundance
o Data: paired tumor/normal tissue from 19 kidney
cancers, in color flip duplicates on 38 cDNA slides
à 4000 genes.
o 6 different strategies for normalization and
quantification of differential abundance
o Calculate for each gene & each method:
t-statistics, permutation-p
o For threshold a, compare the number of genes
the different methods find, #{pi | pia}
evaluation: comparison of methods
one-sided test for up
one-sided test for down
more accurate quantification of differential
expression  higher sensitivity / specificity

evaluation: a benchmark for Affymetrix
genechip expression measures
o Data:
Spike-in series: from Affymetrix 59 x HGU95A,
16 genes, 14 concentrations, complex background
Dilution series: from GeneLogic 60 x HGU95Av2,
liver & CNS cRNA in different proportions and amounts
o Benchmark:
15 quality measures regarding
-reproducibility
-sensitivity
-specificity
Put together by Rafael Irizarry (Johns Hopkins)
http://affycomp.biostat.jhsph.edu
good
bad

affycomp results
(28 Sep 2003)
 ROC curves
 Stratification
position- and
sequence-specific
effects wi(s):
25
log Y  log x   wi (si )  
i1
Naef et al., Phys Rev E 68 (2003)
wi
i
collaboration with R. Irizarry
 glog versus "sliding z-score"
sliding z-score
 Availability
o implementation in R
o open source package vsn
on www.bioconductor.org
o Bioconductor is an
international
collaboration on open
source software for
bioinformatics and
statistical omics
 What to do with the gene lists:
the functional genomics pipeline @ DKFZ
Highthroughput
transcriptome
sequencing:
clones with
unannotated
full length
ORFs
Neoplastic
diseases
functional
characterization
association of mRNA profiles with
- genetic aberrations
- histopathology
- clinical behavior
 HT functional assays
Library of
"unknown"
transcripts
(S. Wiemann, D. Arlt)
GFP-ORF- protein
expression clone
BrdU
incorporation
DAPI:
identification
proliferation
+activator
-inhibitor
CFP:
Image
segmentation and
quantification
expression
BrdU:
proliferation
automated
microscope
Rainer
Pepperkok,
EMBL
SMP
Cell
Detection of modulators of cell proliferation
by automated image analysis
Measurement of fluorescence intensities
YFP channel
72.0
71.0
119.7
87.3
149.5
70.2
84.7
103.1
81.0
2621.8
74.1
156.8
169.0
105.5
156.0
76.5
135.2
86.2
77.7
92.6
104.6
481.2
539.0
95.0
156.7
DAPI
Cy5 channel
761.0
684.1
779.0
820.2
645.6
536.1
799.5
912.8
916.7
267.6
766.2
866.6
819.8
757.7
367.8
746.2
731.2
567.3
896.3
1095.4
633.3
567.7
663.9
726.2
842.1
231.6,
-4.8
68.5,
ORF-YFP
80.9,
Anti-BrdU/Cy5
overlay
YFP – Cy5
Dorit Arlt
0.012
 Statistical analysis of cellular assay data
0.010
0.008
0.006
0.004
control cells
0.000
inh
(p=10-8)
0.002
detect
transfection
effect:
transfected cells
0
50
100
150
200
brdU
6
4
A
B
2
C
0
D
-2
E
F
-4
G
-6
H
1
2
3
4
5
6
7
8
9
10
11
Plate summary plot
12
activation inhibition
dorit6
250
 Cellular assays: challenges
for statisticians
o Image analysis:
pattern recognition, classification
o Low-level analysis
what are good models for calibration,
“normalization”, data transformation?
o High-level analysis
models for the dependence of cellular processes
on over-/underexpression of genes
connect results from different assays,
microarray data
 Summary
o log-ratio Dlog: what about genes that are not
expressed in some of the conditions of
interest?
o generalized log-ratio Dh: a useful extrapolation
- interpretability
- sensitivity
- specificity
- computational convenience
o what to do with the gene lists?
systematic (high throughput) functional assays
 Acknowledgements
DKFZ Heidelberg
Molecular Genome
Analysis
MPI Molekulare Genetik
Anja von Heydebreck
Martin Vingron
Annemarie Poustka
Holger Sültmann
Andreas Buneß
Markus Ruschhaupt
Katharina Finis
Jörg Schneider
Klaus Steiner
Uni Heidelberg
Günther Sawitzki
Stefan Wiemann
Dorit Arlt
DFCI Harvard
Robert Gentleman
UMC Leiden
Judith Boer
RZPD
Anke Schroth
Bernd Korn
EMBL
Urban Liebel
...and many more!

Models are never correct, but some are useful
True relationship:
y x x 
1
2
2
Model: linear dependence
 N(0, 0.15 )
2
Model: quadratic dependence
 variance stabilization
raw scale
log
variance:
constant part
proportional part
glog
 ratio compression
Yue et al.,
(Incyte
Genomics)
NAR
(2001) 29
e41

Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg  Microarray intensities x1,…,xn Log-ratio with/without background correction Shrunken log-ratio (BHM) xi  f(xi ,...) log xj  f(xj.

Transcript Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg  Microarray intensities x1,…,xn Log-ratio with/without background correction Shrunken log-ratio (BHM) xi  f(xi ,...) log xj  f(xj.

Directory