First Steps in microarray analysis

Download Report

Transcript First Steps in microarray analysis

From colored speckles to
expression values
o
o
o
o
quality control / optimization
calibration and error modeling
data transformations
software
Wolfgang Huber
Dep. of Molecular Genome
Analysis
DKFZ Heidelberg
microarrays
samples:
mRNA from
tissue
biopsies,
cell lines
probes:
gene-specific
DNA strands
detection
tissue A
tissue B
tissue C
ErbB2
0.02
1.12
2.12
VIM
1.1
5.8
1.8
ALDH4
2.2
0.6
1.0
CASP4
0.01
0.72
0.12
LAMA4
1.32
1.67
0.67
MCAM
4.2
2.93
3.31
a microarray slide
Slide: 25x75 mm
Spot-to-spot: ca. 150-350 mm
4x4 or 8x4 sectors
17...38 rows and
columns per sector
ca. 4600…46000
probes/array
sector: corresponds
to one print-tip
Terminology
sample: RNA (cDNA) hybridized to the array, aka
target, mobile substrate.
probe: DNA spotted on the array, aka spot,
immobile substrate.
sector: rectangular matrix of spots printed using
the same print-tip (or pin)
plate: set of 384 (768) spots printed with DNA
from the same microtitre plate of clones
slide, array
channel: data from one color (Cy3 = cyanine 3 =
green, Cy5 = cyanine 5 = red).
batch: collection of microarrays with the same
probe layout and quality
Raw data
scanner signal
2D image:
5 or 10 mm spatial resolution,
16 bit (65536) dynamic range per channel
ca. 30-50 pixels per probe (60 mm spot size)
40 MB per array
Image Analysis
spot intensities
2 numbers per probe (~100-300 kB)
… auxiliaries: background, area, std dev, …
Image analysis
1. Addressing. Estimate
location of spot centers.
2. Segmentation. Classify pixels
as foreground (signal) or
background.
3. Information extraction. For
each spot on the array and each
dye
• foreground intensities;
• background intensities;
• quality measures.
R and G for each spot on the array.
Local background
---- GenePix
---- QuantArray
---- ScanAlyze
Oligonucleotide chips
Affymetrix files
Main software from Affymetrix:
MAS - MicroArray Suite.
DAT file: Image file, ~107 pixels, ~50 MB.
CEL file: probe intensities, ~400k numbers
CDF file: Chip Description File. Describes
which probes go in which probe sets
(genes, gene fragments, ESTs).
Affymetrix image analysis
DAT image files  CEL files
Each probe cell: 10x10 pixels.
Gridding: estimate location of cell centers.
Signal:
Remove outer 36 pixels  8x8 pixels.
Probe cell signal, PM or MM, is the 75th
percentile of the 8x8 pixel values.
Background: Average of the lowest 2% probe
cells is taken as the background value and
subtracted.
Compute also quality values.
Data and notation
PMijg, MMijg = Intensity for perfect match and
mismatch probe j for gene g in chip i.
i = 1,…, n
one to hundreds of chips
j = 1,…, J
usually 16 or 20 probe pairs
g = 1,…, G
8…20,000 probe sets.
Tasks:
calibrate (normalize) the measurements from
different chips (samples)
summarize for each probe set the probe level data,
i.e., 20 PM and MM pairs, into a single
expression measure.
compare between chips (samples) for detecting
differential expression.
spot or probe intensity data
n one-color arrays
(Affymetrix, nylon)
Probes (genes)
two-color spotted arrays
conditions (samples)
Spot intensities are not mRNA concentrations
o tissue
contamination
o clone
identification and
mapping
o image
segmentation
o RNA
o PCR yield,
o signal
The point
is not that these
degradation
contamination
quantification
steps are
‘not perfect’;o ‘background’
it is
o amplification
o spotting
efficiencythat they
efficiency
correction
may vary from
array
o reverseto array,
o DNA-support
experiment
transcription
binding
experiment.
efficiency
o hybridization
o other array
efficiency and
manufacturingspecificity
related issues
to
Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-photodetection
Systematic
o similar effect on many
measurements
o corrections can be
estimated from data
Calibration
PCR yield
DNA quality
spotting efficiency,
spot size
cross-/unspecific hybridization
stray signal
Stochastic
o too random to be explicitely accounted for
o “noise”
Error model
measured intensity of probe k in sample i
yik  aik  bik xik
unspecific
gain
actual
abundance
aik, bik all unknown: need to
approximately determine from data
Implications
data
yik  aik  bik xik
No non-linear terms:
quantity of
interest
Saturation can be avoided in the experiments.
Data from well-performed experiments shows no evidence for gross
deviations from affine linearity.
The flexibility (and complexity) lies in the number of parameters:
twice the number of data points!
 parameters must not all be independent
 make simplifiying assumptions, reduce the number of independent
parameters to manageable size
Quality control: verify that assumptions hold for the data at hand
Normalization, calibration: estimate parameters for the data at
hand
Error modeling: control the error bars both of measured yik and of
estimated normalization parameters aik, bik
A typical set of assumptions
yik  aik  bik xik
aik  ai  Lik  ik
ai per-sample offset
Lik local background
bik  bi bk hik
bi per-sample
normalization factor
provided by image
analysis
bk sequence-wise
ik ~ N(0, bi2s12)
log hik ~ N(0,s22)
“multiplicative noise”
“additive noise”
labeling efficiency
Discussion and possible extensions
Sequence-wise factors bk need not be
explicitely determined if only interested in
relative expression levels
The assumptions bring down the number of
parameters to 2*d - the rest is modeled as
noise.
Array calibration terms ai, bi same for all
probes on array - could extend to include
print-tip or plate effects
Probe affinities bk same for all arrays - could
extend to include batch effects
Calibration
= normalization
= parameter estimation of model parameters
o data heteroskedastic
 variance stabilizing data transformation
o maximum likelihood estimator
o model holds for genes that are unchanged;
differentially transcribed genes act as outliers.
o heavy-tailed data distributions
 robust variant of ML estimator, à la
Least Trimmed Sum of Squares regression.
o works up to breakdown point of 50%
Gene expression measures
o average over multiple probes
e.g. Affymetrix GeneChip MAS 4.0: trimmed
mean
1
AvDiff 
#J
(PMj  MMj )

j J

o sort dj = PMj -MMj
o exclude highest and lowest value
o J := those pairs within 3 standard deviations
of the average
Gene expression measures: MAS 5.0
Instead of MM, use "repaired" version CT
CT =
=
MM
PM / "typical log-ratio"
if MM<PM
if MM>=PM
"Signal" =
Tukey.Biweight (log(PM-CT))
(… median)
Tukey Biweight: B(x) = (1 – x2/c2)2 if |x|<c, 0 otherwise
Expression measures:
Li & Wong
dChip fits a model for each gene
PMij  MMij  qi fj  ij ,
ij  N (0,  2 )
where
– qi: expression index for gene i
– fj: probe sensitivity
o estimate qi by maximum likelihood.
o need at least 10 or 20 chips.
Current version works with PMs only.
RMA method: Irizarry, Speed et al. (2002)
o Estimate one global background value
b=mode(MM). Otherwise ignore MM.
o Assume: PM = Ytrue + b
Estimate Y0 from PM and b as a conditional
expectation E[strue|PM, b].
o Use log2(Y).
o Nonparametric nonlinear calibration ('quantile
normalization') across a set of chips.
RMA expression measures
AvDiff – type:
1
ai 
Α
log (PMij  BGij )

j A

2
with A a set of “suitable” pairs.
Li-Wong – type:
log2 (PMij  BG )  ai  bj  ij
robust estimation of ai: median polish
(iterative: successively remove row and
column medians and accumulate terms until
the process stabilizes).
software
www.bioconductor.org
packages:
affy
Affymetrix pre-processing
(CEL, CDF)
marrayInput
two-color slides
vsn
calibration and variance
stabilization
o open-source („reproducible research“)
o based on R (biggest statistics library in the
world)
Quality control
verify the assumptions that underly the
+ calibration
+ error model
 visualization, diagnostic plots
 some examples what can go wrong
Background
usual assumption for spotted arrays:
total brightness =
specific part (from labeled sample cDNA)
+ background brightness (adjacent to spot)
?
Background
usual assumption for Affymetrix arrays:
PM intensity =
+
specific part (from labeled sample cDNA)
MM intensity
From: R. Irizarry et al.,
Biostatistics 2002
0
log(PM/MM)
PCR plates
Scatterplot, colored by PCR-plate
Two RZPD Unigene II filters (cDNA nylon membranes)
PCR plates
PCR plates: boxplots
array batches
print-tip effects
0.8
1.0
41 (a42-u07639vene.txt) by spotting pin
0.0
0.2
0.4
^
F
0.6
1:1
1:2
1:3
1:4
2:1
2:2
2:3
2:4
3:1
3:2
3:3
3:4
4:1
4:2
4:3
4:4
-0.8
-0.6
-0.4
-0.2
log(fg.green/fg.red)
0.0
0.2
spotting pin quality decline
after delivery of 5x105 spots
after delivery of 3x105 spots
H. Sueltmann DKFZ/MGA
spatial effects
R
Rb
R-Rb
color scale by rank
another
array:
print-tip
color
scale ~
log(G)
spotted cDNA arrays, Stanford-type
color
scale ~
rank(G)
1.2
1.0
30
0.8
20
0.6
10
1:nrhyb
40
1.4
50
1.6
1.8
60
Batches: array to array differences dij = madk(hik -hjk)
1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899100
10
20
30
1:nrhyb
40
50
60
arrays i=1…63; roughly sorted by time
Coefficient
of
variation
cDNA slide
data from
H. Sueltmann
Density representation of the scatterplot
(76,000 clones, RZPD Unigene-II filters)
Three-way comparisons
References
Normalization for cDNA microarray data: a robust composite method
addressing single and multiple slide systematic variation. YH Yang, S
Dudoit, P Luu, DM Lin, V Peng, J Ngai and TP Speed. Nucl. Acids Res.
30(4):e15, 2002.
Variance Stabilization Applied to Microarray Data Calibration and to the
Quantification of Differential Expression. W.Huber, A.v.Heydebreck,
H.Sültmann, A.Poustka, M.Vingron. Bioinformatics, Vol.18, Supplement 1,
S96-S104, 2002.
A Variance-Stabilizing Transformation for Gene Expression Microarray Data.
BP Durbin, JS Hardin, DM Hawkins, DM Rocke. Bioinformatics, Vol.18,
Suppl. 1, S105-110.
Exploration, Normalization, and Summaries of High Density Oligonucleotide
Array Probe Level Data. RA Irizarry, B Hobbs, F Collin, YD BeazerBarclay, KJ Antonellis, U Scherf, TP Speed (2002). Biostatistics.
http://biosun01.biostat.jhsph.edu/~ririzarr/papers/index.html
A more complete list of references is in:
Elementary analysis of microarray gene expression data. W. Huber, A. von
Heydebreck, M. Vingron, manuscript.
http://www.dkfz-heidelberg.de/abt0840/whuber/
DKFZ Heidelberg
Annemarie Poustka
Holger Sültmann
Acknowledgements
MPI Molekulare
Genetik
Anja von Heydebreck
Martin Vingron
Tim Beißbarth
Frank Bergmann
Andreas Buneß
Katharina Finis
Florian Haller
Patrick Herde
Uni Heidelberg
Yvonne Keßler
Jörg Schneider
Günther Sawitzki
Anke Schroth
Klaus Steiner
Stephanie Süß
Markus Vogt
Friederike Wilmer
… and
many more!
UMC Leiden
Judith Boer
Bioconductor
Project
Robert Gentleman
Sandrine Dudoit
Rafael Irizarry
Laurent Gautier