Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg Microarray intensities x1,…,xn Log-ratio with/without background correction Shrunken log-ratio (BHM) xi f(xi ,...) log xj f(xj.
Download
Report
Transcript Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg Microarray intensities x1,…,xn Log-ratio with/without background correction Shrunken log-ratio (BHM) xi f(xi ,...) log xj f(xj.
Transformations…
what for?
which one?
Wolfgang Huber
Div.Molecular Genome Analysis
DKFZ Heidelberg
Microarray intensities x1,…,xn
Log-ratio with/without
background correction
Shrunken log-ratio (BHM)
xi f(xi ,...)
log
xj f(xj ,...)
xi
log
xj
xi xi ci
Variance stabilized log-ratio
log
(=generalized log-ratio, “glog”)
xj
2
2
xj cj
2
2
How do you like to think about (interprete) it?
How do you estimate the parameters?
What comes out (the “bottom line”?)
ratios and fold changes
Fold changes are useful to describe
continuous changes in expression
3000
1500
1000
x3
x1.5
A
B
C
But what if the gene is “off” (below
detection limit) in one condition?
3000
200
0
?
?
A
B
C
ratios and fold changes
Many interesting genes will be off in some
of the conditions of interest
1.If you want expression measure (“net
normalized spot intensity”) to be an
unbiased estimator of abundance
many values 0
need something more than (log)ratio
2. If you let expression measure be biased
(always>0)
can keep ratios.
how do you choose the bias?
ratios and fold changes
Ratios are scale-free:
f(yi, yj ) yi / yj
or log(yi / yj )
But there is (at least) one absolute scale
in the data:
bg sd(Yi | E(Yi ) 0)
Can we use this to construct useful
functions
f(yi, yj, bg ) ?
In the following:
How to compare microarray
intensities with each other?
How to incorporate measurement
uncertainty (“variance”)?
How to simultaneously and
consistently deal with calibration
(“normalization”)?
Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-fluorescent detection
Systematic
o similar effect on many
measurements
o corrections can be
estimated from data
Calibration
probe purity and length
distribution
spotting efficiency, spot size
cross-/unspecific hybridization
stray signal
Stochastic
o too random to be explicitely accounted for
o remain as “noise”
Error model
modeling ansatz
measured intensity = offset +
gain
true abundance
yik aik bik xk
aik ai ik
ai per-sample offset
ik ~ N(0,
bi2s12)
“additive noise”
bik bi bk exp(hik )
bi per-sample
normalization factor
bk sequence-wise
probe efficiency
hik ~ N(0,s22)
“multiplicative noise”
The two-component model
“multiplicative” noise
“additive” noise
raw scale
log scale
B. Durbin, D. Rocke, JCB 2001
variance stabilizing transformations
Xu a family of random variables with
EXu=u, VarXu=v(u). Define
x
f (x )
1
v(u )
du
var f(Xu ) independent of u
derivation: linear approximation
9.5 10.0
9.0
8.5
8.0
transformed
f(x) scale
11.0
variance stabilizing transformations
0
20000
40000
rawxscale
60000
variance stabilizing transformations
x
f (x )
1
v(u )
du
1.) constant variance (‘additive’)
v (u ) s2
2.) constant CV (‘multiplicative’)
v (u ) u 2 f log u
3.) offset
v (u ) (u u0 )2
4.) additive and multiplicative
f u
f log(u u0 )
u u0
v (u ) (u u0 ) s f arsinh
s
2
2
the “glog” transformation
- - - f(x) = log(x)
——— hs(x) = asinh(x/s)
-200
0
200
400
600
800
intensity
arsinh( x ) log x x 2 1
1000
lim arsinh x log x log 2 0
x
P. Munson, 2001
D. Rocke & B.
Durbin, ISMB 2002
parameter estimation
arsinh
Yki ai
k ki ,
bi
ki : N (0, c 2 )
measured intensity
= offset + straightforward
gain * true abundance
o maximum likelihood
estimator:
– but sensitive to deviations from normality
y
a
b
x
ik
ik
ik
ik
o model holds for genes that are unchanged;
differentially transcribed genes act as outliers.
aik ai Lik ik
bik bi bk exp(hik )
o robust varianta per-sample
of ML offset
estimator,
à la Least
b per-sample
normalization factor
Trimmed Sum ofL local
Squares
regression.
background
i
i
ik
provided by image
analysis
bk sequence-wise
labeling efficiency
o works as long as <50% of genes
are)
h
~
N(0,s
~ N(0, b s )
differentially transcribed
“multiplicative noise”
“additive noise”
ik
i
2
2
1
ik
2
2
minimize
n/2
4
yi
2
i=1
()
f(x(i) )
2
0
y
6
8
Least trimmed sum of squares regression
0
2
4
x
6
- least sum of squares
- least trimmed sum of squares
8
P. Rousseeuw, 1980s
difference red-green
evaluation: effects of different data transformations
rank(average)
Normality: QQ-plot
evaluation: sensitivity / specificity in
detecting differential abundance
o Data: paired tumor/normal tissue from 19 kidney
cancers, in color flip duplicates on 38 cDNA slides
à 4000 genes.
o 6 different strategies for normalization and
quantification of differential abundance
o Calculate for each gene & each method:
t-statistics, permutation-p
o For threshold a, compare the number of genes
the different methods find, #{pi | pia}
evaluation: comparison of methods
one-sided test for up
one-sided test for down
more accurate quantification of differential
expression higher sensitivity / specificity
evaluation: a benchmark for Affymetrix
genechip expression measures
o Data:
Spike-in series: from Affymetrix 59 x HGU95A,
16 genes, 14 concentrations, complex background
Dilution series: from GeneLogic 60 x HGU95Av2,
liver & CNS cRNA in different proportions and amounts
o Benchmark:
15 quality measures regarding
-reproducibility
-sensitivity
-specificity
Put together by Rafael Irizarry (Johns Hopkins)
http://affycomp.biostat.jhsph.edu
good
bad
affycomp results
(28 Sep 2003)
ROC curves
Stratification
position- and
sequence-specific
effects wi(s):
25
log Y log x wi (si )
i1
Naef et al., Phys Rev E 68 (2003)
wi
i
collaboration with R. Irizarry
glog versus "sliding z-score"
sliding z-score
Availability
o implementation in R
o open source package vsn
on www.bioconductor.org
o Bioconductor is an
international
collaboration on open
source software for
bioinformatics and
statistical omics
What to do with the gene lists:
the functional genomics pipeline @ DKFZ
Highthroughput
transcriptome
sequencing:
clones with
unannotated
full length
ORFs
Neoplastic
diseases
functional
characterization
association of mRNA profiles with
- genetic aberrations
- histopathology
- clinical behavior
HT functional assays
Library of
"unknown"
transcripts
(S. Wiemann, D. Arlt)
GFP-ORF- protein
expression clone
BrdU
incorporation
DAPI:
identification
proliferation
+activator
-inhibitor
CFP:
Image
segmentation and
quantification
expression
BrdU:
proliferation
automated
microscope
Rainer
Pepperkok,
EMBL
SMP
Cell
Detection of modulators of cell proliferation
by automated image analysis
Measurement of fluorescence intensities
YFP channel
72.0
71.0
119.7
87.3
149.5
70.2
84.7
103.1
81.0
2621.8
74.1
156.8
169.0
105.5
156.0
76.5
135.2
86.2
77.7
92.6
104.6
481.2
539.0
95.0
156.7
DAPI
Cy5 channel
761.0
684.1
779.0
820.2
645.6
536.1
799.5
912.8
916.7
267.6
766.2
866.6
819.8
757.7
367.8
746.2
731.2
567.3
896.3
1095.4
633.3
567.7
663.9
726.2
842.1
231.6,
-4.8
68.5,
ORF-YFP
80.9,
Anti-BrdU/Cy5
overlay
YFP – Cy5
Dorit Arlt
0.012
Statistical analysis of cellular assay data
0.010
0.008
0.006
0.004
control cells
0.000
inh
(p=10-8)
0.002
detect
transfection
effect:
transfected cells
0
50
100
150
200
brdU
6
4
A
B
2
C
0
D
-2
E
F
-4
G
-6
H
1
2
3
4
5
6
7
8
9
10
11
Plate summary plot
12
activation inhibition
dorit6
250
Cellular assays: challenges
for statisticians
o Image analysis:
pattern recognition, classification
o Low-level analysis
what are good models for calibration,
“normalization”, data transformation?
o High-level analysis
models for the dependence of cellular processes
on over-/underexpression of genes
connect results from different assays,
microarray data
Summary
o log-ratio Dlog: what about genes that are not
expressed in some of the conditions of
interest?
o generalized log-ratio Dh: a useful extrapolation
- interpretability
- sensitivity
- specificity
- computational convenience
o what to do with the gene lists?
systematic (high throughput) functional assays
Acknowledgements
DKFZ Heidelberg
Molecular Genome
Analysis
MPI Molekulare Genetik
Anja von Heydebreck
Martin Vingron
Annemarie Poustka
Holger Sültmann
Andreas Buneß
Markus Ruschhaupt
Katharina Finis
Jörg Schneider
Klaus Steiner
Uni Heidelberg
Günther Sawitzki
Stefan Wiemann
Dorit Arlt
DFCI Harvard
Robert Gentleman
UMC Leiden
Judith Boer
RZPD
Anke Schroth
Bernd Korn
EMBL
Urban Liebel
...and many more!
Models are never correct, but some are useful
True relationship:
y x x
1
2
2
Model: linear dependence
N(0, 0.15 )
2
Model: quadratic dependence
variance stabilization
raw scale
log
variance:
constant part
proportional part
glog
ratio compression
Yue et al.,
(Incyte
Genomics)
NAR
(2001) 29
e41