Low-level Analysis of Microarray Data

Download Report

Transcript Low-level Analysis of Microarray Data

Low-level Analysis of Microarray Data

Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01

2 

Why are microarrays important?

Genome projects have identified lots of new genes, but for many we do not know what they are doing.

"Functional Genomics"

Application areas:

Cancer, Inflammation, Infectious diseases Toxicology - drug (side) effects Developmental biology

3

Papers

A) H. Bengtsson, B. Calder, I. S. Mian, M. Callow, E. Rubin, and T. P. Speed.

Identifying Differentially Expressed Genes in cDNA Microarray Experiments: making aging visible

. 2001.

B) H. Bengtsson.

Identification and normalization of plate effect in cDNA microarray data

. 2002.

C) H. Bengtsson.

The R.oo package - object-oriented programming with references using standard R code

. 2003.

D) H. Bengtsson and O. Hössjer.

Methodological study of affine transformations of gene expression data with proposed normalization method

. 2003.

E) H. Bengtsson, G. Jönsson, and J. Vallon-Christersson.

Calibration and assessment of channel-specific biases in microarray data with extended dynamical range

. 2003.

F) H. Bengtsson. aroma -

An R Object-oriented Microarray Analysis environment

. 2004.

G) H. Bengtsson and O. Hössjer.

Affine calibration for microarrays with dilution series or spike-ins

. 2004.

Central Dogma of Molecular Biology

4 Idea of microarrays (and other gene expression techniques):

Measure the amount of mRNA to find genes that are expressed (active)

.

(measuring protein might be better, but is currently much harder)

7 

Which genes are differentially transcribed?

same-same tumor-normal

8

Statistics 101:

bias accuracy

9

Central dogma of data analysis

Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off!

X X X X X X X X X

10

Printing / spotting

11

Microarray slide preparation

J. Vallon-Christersson, Dept Oncology, Lund Univ.

RNA

RNA extraction & hybridization

(targets) Tumor sample Reference sample RNA Extract mRNA from samples Reverse transcription of mRNA to cDNA Label with Cy3 or Cy5 fluorescent dye Hybridize labeled cDNA to slide Wash slide cDNA cDNA Hybridize 12 (probes) Figure: Hybridization chamber.

Two-channel scanning

green laser excitation red laser emission overlay images

13  higher frequency, more energy  lower frequency, less energy

14

Combined color image for visualization

15

Signal quantification

1. Addressing

Locate spot centers.

2. Segmentation

Classification of pixels either as signal or background (using circles, seeded region growing or other).

3. Signal quantification

a) foreground estimates : R fg , G fg b) background estimates : R bg , G bg c) ... (shape, size etc) Terry Speed et al.

16 cDNA clones printing

Summary

green laser excitation red laser PCR product amplification purification

RNA Test sample Reference sample RNA

emission

cDNA cDNA

overlay images scanning

Hybridize Production

data: (R fg ,G fg ,R bg ,G bg , ...)

17 

Sources of variation

amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection Systematic o measurements o similar effect on many corrections can be estimated from data probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal Stochastic o plicitely accounted for o too random to be ex remain as “noise”

Normalization Error model

Exploratory data analysis: (R,G)

(log 2 R,log 2 G)

(M,A)

18 R vs G

R

= red channel signal

G

= green channel signal log(R) vs log(G) M vs A

M

= log

2

(

R

/

G

) (log-ratio),

A

= ½ ·log

2

(

R

·

G

) (log-intensity)

19

Exploratory data analysis: the need for normalization

In self-self comparisons, find stochastic as well as systematic deviations from M=0.

Idea

: Do various exploratory plots to understand the nature of these deviations: for example, M vs A, spatial plots, density & boxplots plots, print order plots etc.

20

21

22

23

24

Within-slide normalization

Classical methods:

No normalization Median normalization

Shift of log-ratios.

Curve-fit normalization methods.

A.k.a. lo(w)ess normalization.

etc.

25

Global median normalization

Assumption:

Changes roughly symmetric.

M’

=

M

– median(

M

) aroma: normalizeWithinSlide(ma, method=“m”)

Global curve-fit normalization cont.

curve-fit normalization

done for

all spots together

M i ’

=

M

i – c(

A

i ) (Biased towards the green channel & intensity dependent artifacts) 26 aroma: plot(ma); normalizeWithinSlide(ma, method=“l”); plot(ma)

Print-tip curve-fit normalization

Print-tip curve-fit normalized data {(

A,M

)}

n=1..N

:

M’ i,p = M i,p - c p

(

A i

) ;

p

= print tip 1-16 where

c p

is an

intensity dependent

function for print tip

p .

27 aroma: plot(ma); lowessCurve(ma, groupBy=“printtip”); boxplot(ma, groupBy=“printtip”)

28

Print-tip curve-fit normalization cont.

curve-fit normalization

done for each

print-tip group

separately aroma: normalizeWithinSlide(ma, “p”); boxplot(ma, groupBy=“printtip”); plotSpatial(ma)

29

Across-slides normalization

The

within-slide

normalization makes the red and the green signals comparable. What about signals from different arrays?

Across-slide normalization

attacks this problem.

Idea

: Log-ratios (M values) from different slides should have similar variance.

Example

: If the standard deviation of the log-ratios from one array is 0.50 and from another array 0.78, the log-ratios have to be rescaled to get the same standard deviation.

30

Paper A

H. Bengtsson, B. Calder, I. S. Mian, M. Callow, E. Rubin, and T. P. Speed.

Identifying Differentially Expressed Genes in cDNA Microarray Experiments: making aging visible

. Online report accompanying the discussion forum with the same name. Science SAGE KE, 2001 (12), vp8.

• Illustrates typical normalization of two-color microarray data • Short introduction on how to identify differentially expressed genes.

• Basic study on how many replicated arrays are needed.

• Comparison between two image-analysis methods.

31

Paper B

H. Bengtsson.

Identification and normalization of plate effect in cDNA microarray data.

Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2002.

32

Printing of spots

Hmm... why the horizontal stripes? 

1

6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips...

16

33

Print-order plot of log-ratios

The spots are order according to

when

they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on

all

arrays.

plotPrintorder(ma)

34

Remove (constant) plate biases?

Will remove some of the intensity dependent effects...

...and some of the spatial artifacts.

35

Intensity normalization plate by plate?

...plus a print-tip normalization?

Removes the plate effects...

...and most of the spatial artifacts.

Comparison of normalization methods

Median absolute deviation (MAD) for gene

i

with replicates

j=1,2,...,J

:

d i = 1.4826 · median | r ij |

where

r ij = M ij – median M ij

is residual

j

for gene

i

.

Ex: two different genes:

d a

<

d b

The

measure of reproducibility

(small is good) is a scalar defined as the mean of

all

genewise MADs:

M.O.R. =

d i / N

where

N

is the number of genes. 36

Remember that minimizing variance alone is not useful - we also need to consider bias

Results

• Doing platewise intensity dependent normalization lowers the gene variability by another ~10% from print-tip norm . • Background correction increases variability (but might reduce bias) • Using

measure of reproducibility

is helpful in deciding what to do.

Pl

– Constant platewise norm.,

Pl(A)

dep. slidewise norm.,

Pr(A)

– Intensity dep. platewise norm.,

Sl(A)

– Intensity dep. print-tip-wise norm.,

sPr(A)

– Intensity – Scaled intensity dep. print-tip-wise norm.

,

bg

– background corrected data.

40

Paper D

H. Bengtsson and O. Hössjer.

Methodological study of affine transformations of gene expression data with proposed normalization method

. Preprints in Mathematical Sciences 2003:38, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2003.

41

A general model

y c,i = f c (x c,i ) +  c,i genes

i

= 1

,

2

, . . . , I

RNA extracts

c

= 1

,

2

, . . . , C x c,i

the

true gene expression level y c,i

 c,i the

observed gene expression level

measurement noise, E[  c,i ] = 0 f c (·)

measurement function

is (unknown)

42

Measurement functions model the biochemical and physical measurement procedure

43

Recall the M vs A transform

Especially in two-channel microarray analysis the M vs A transform is useful: where

x R,i

and

x G,i

are the true gene expression levels. Since these are unknown, it is a common practice to plug in the observed expression levels

Linear models and M vs A

In the literature, it common to assume a linear measurement function, either explicitly or implicitly: where b c is the scale factor for channel intensities become

c = R,G

. The observed log-ratios and log 44 where  =b R /b G is the relative scale factor.

Systematic effects : The overall shift in the log-ratios is log 2  , which is constant.

Examples of bias with linear measurement functions 45 Balance channels:  = b R /b G = 1 ) log 2  = 0 Red too strong:  = b R /b G = 4/1 ) log 2  = +2 Green too strong:  = b R /b G = 1/4 ) log 2  = -2

Affine measurement functions

Assume that the measurement function is

affine

: where a c and b c are channel-specific bias and scale parameters.

The observed log-ratios become 46 where  i = a R  a G (x R,i / x G,i ) ,  =b R /b G , and A i is the observed log intensity for gene

i

.

Systematic effects

:

i)

Constant shift of log2  . ii) intensity-dependent bias.

Note that the second bias term is fold-change dependent! Moreover, if

a R

the second bias term is zero as before.

=

a

G = 0 (linear),

Examples of bias with affine measurement functions Assume different channel biases, i.e.

a

R ≠

a

G and  =

b

R /

b

G =1.

47 No bias: (a R ,a G )=(0,0) Small bias in red: (a R ,a G )=(20,0) Small bias in red, large bias in green: (a R ,a G )=(20,200)

Theoretical examples of affine transformations Assume different channel biases, i.e.

a

R ≠

a

G , but same  =

b

R /

b

G =0.57.

48 (a R ,a G )=(0,0)  = 0.57

(a R ,a G )=(20,200) = 0.57

(a R ,a G )=(80,140) = 0.57

The blue dash-dot curves are non-differentially expressed genes. Colored curves above and below are genes with fold-change log 2 (

x

R,i /

x

G

,

i ) =

±

1

, ±

2

, ...

49 

… and with data:

50

Robust affine normalization

Consider a two-channel (

C=

{R,G}) microarray with genes

i=1,...,I

for which we observe With  that = a R  a G and  = b R /b G , we have for non-differentially expressed genes (no noise) The red and green signals,

y

i = (y R,i ,y G,i ), then scatter around the line with intercept  and slope  . Next, robust estimates of  and  .

51

IWPCA – iterative reweighted PCA

Define the objective function w i is a weight, where r i (  ,  ;

y

i ) > 0 is the Euclidean distance between

y

i line L(  ,  ). The estimate of  and  is then and the With weights w i = 1 we obtain standard PCA (minimization in L 2 ). With weights we minimize in L 1 , if we let 

!

0.

Algorithm

: Estimate  and  . Update weights. Reestimate  and  and so on.

52

Additional constraints

With estimates of  = a R  a G and  = b R /b G , we impose the additional constraint that after normalization

no negative signal may exists

(less conservative constraint may also be used). In addition, let b R =1. The robust affine normalization is then Example: Estimates (a R ,a G ,  ) = (21.5,29.3,0.163).

53

Generalization to several channels/arrays

The robust affine normalization generalizes directly • to microarrays with multilple channels, i.e. C=3,4,... . • to multiple arrays, if the experimental design allows, i.e. K=2,3,... .

In addition: • No between-array normalization is needed (”it’s included”).

• Normalization to a common reference is straightforward. That is, if all arrays share the same reference, first all reference channels is normalized toward each other to form a ”baseline channel”, then test channels are normalized toward the baseline channel.

Usage (in aroma):

normalizeAffine(rg)

54

Paper E

H. Bengtsson, G. Jönsson, and J. Vallon-Christersson.

Calibration and assessment of channel-specific biases in microarray data with extended dynamical range.

Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2003.

55

The microarray technology

(again)

56

Scan protocol

By scanning the

same

array at

various

PMT gains (and

constant

laser power) we can learn about

h c

. When we scan at gains

v

and

w

, we obtain two different measurements of the

unknown

, but

fixed

amount of hybridized DNA (

x c

,

i

): for both channels

c

=

{ R

,

G }

and all genes

i = 1,...,I

.

Observed signals at various gains

zoom  57 Channel:

c

=

G

.

PMT voltage pairs (

v

,

w

): (800,500) (600,500) (700,500) (800,600) (700,600) (800,700) All PMT pairs converge at the same point (

e c ,e

c ) and not at the origin (0,0).

 Affine measurement function:

58

General model and estimation

The K observed signals in channel c for each gene i: scatter around the line with The line L(

a

c ,

b

c ) can be estimate robustly using IWPCA as before. Next, define Bias parameters

a

c are uniquely identified by the additional constraint that ||

d

c || ¼ 0, i.e.

59 Backtransform:

Multiscan calibration

Within-channel M vs A:

60

Results

By scanning the same microarray at various sensitivity levels one obtains: • calibrated signals with scanner biases removed • an extended dynamical range • improved signal-to-noise levels • better understanding of the noise structure (of the scanner)

61

Paper G

H. Bengtsson and O. Hössjer.

Affine calibration for microarrays with dilution series or spike ins.

Preprints in Mathematical Sciences 2004:19, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2004.

62

Model

gene

i=1,...,I

replicated spots

j=1,...,J

at various concentrations (dilution series).

channel

c

. First, assume that the concentration of fluorophores

b i,j

replicated spots of a gene

i

is the same: of all

z c,i,j = z c,i

Model the total amount of light

x c,i,j

from spot

(i,j)

as

x c,i,j = b i,j z c,i

63

Model cont.

We observe

y c,i,j = a c + b c x c,i,j +

 c,i,j

= a c + b c b i,j z c,i +

 c,i,j  c,i,j independent zero-mean noise with variance  c,i,j 2 .

a c

and

b c

offset and scale parameters for channel

c

In the paper, the variance function  c,i,j 2 =  c,i 2 =

k c

 i 2 is used. The offset parameter

a c

can now be uniquely estimated using

maximum likelihood techniques

utilizing numerical optimization and PCA.

b c

are estimated as before.

64

Probe calibration

Before After C=2, I=432, J=4 ten dilution series are highlighted

65

In addition the method gives...

Estimates of the

relative expression level of the dilution series

follow directly from the maximum likelihood procedure.

66

Result

Arrays should contain multiple spots for the same gene that have different dilutions (instead of identical replicates) Systematic treatment of bias (optical background), expression ratios for the dilution series ("genes"), and noise.

67

Good scientific software is like a good scientific publication

  

reproducible peer-reviewed easily accessible by other researchers, society

 

builds on the work of others other will build their work on top of it

commercialize those developments that are successful

68

Why Open Source?

so that you can find out how it is being used so that you can modify what algorithm is being used, and these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components

Transparency

Pursuit of reproducibilty

Efficiency of development

Training

69

Paper C

H. Bengtsson.

The R.oo package - object-oriented programming with references using standard R code.

In Kurt Hornik, Friedrich Leisch, and Achim Zeileis, editors, Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria, March 2003

70

R.oo package

Objectives:

• Better and more robust object-oriented design and implementation • More memory efficient and faster implementations • Easy to understand and to extend • User- and developer-friendly interface • Core of the microarray package (

aroma

) •

Methods:

• Single root class

Object

all classes inherits from.

• Provide support for

reference variables.

R Coding Conventions (RCC),

e.g. naming conventions (

MyClassName,myObject, myFunction()

) • Free and open-source (5500 code lines).

• Proof of concept, rich documentation with many examples (~100 pages).

71

Paper F

H. Bengtsson.

aroma - An R Object-oriented Microarray Analysis environment.

Preprints in Mathematical Sciences 2004:18, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2004.

72

aroma package

Objectives:

• Provide simple methods for low-level analysis of microarray data • Wide support for file formats • User- and developer-friendly interface • Easy to understand and easy to extend • Easy to maintain

Methods:

• Make use of R.oo

• Make use of R.classes pkgs (20000 lines of code & > 400 pages of docs) • Follow R Coding Conventions • Free and open-source (18000 code lines) • Easy to install anywhere.

• Rich documentation with examples (>280 pages)

73

A short example

> gpr <- MicroarrayData$read("Slide1.gpr") > gpr [1] "GenePixData: Number of slides: 3. Number of fields: 43. Layout: Grids: 4x4 (=16), spots in grids: 18x18 (=324), total numberof spots: 5184. Spot names are specified. Spot ids are specified." > plotSpatial(gpr) > raw <- getRawData(gpr) > ma <- getSignal(raw) > idx <- (abs(ma$M) > 1.5) > plot(ma) > highlight(ma, idx, col="blue") > normalizeAffine(ma) > ...