Transcript Document

Statistics for Microarrays

Experimental Design, Normalization, and Exploratory Data Analysis

A C

Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/

Estimation Biological question Differentially expressed genes Sample class prediction etc.

Experimental design Microarray experiment

16-bit TIFF files

Image analysis

(Rfg, Rbg) , (Gfg, Gbg)

Testing Normalization

R , G

Clustering Discrimination Biological verification and interpretation

Some Considerations for cDNA Microarray Experiments (I) Scientific (Aims of the experiment) • Specific questions and priorities • How will the experiments answer the questions Practical (Logistic) • Types of mRNA samples: reference, control, treatment, mutant, etc • Source and Amount of material (tissues, cell lines) • Number of slides available

Some Considerations for cDNA Microarray Experiments (II) Other Information • Experimental process prior to hybridization: sample isolation, mRNA extraction, amplification, labelling, … • Controls planned: positive, negative, ratio, etc.

• Verification method: Northern, RT-PCR, in situ hybridization, etc.

Aspects of Experimental Design Applied to Microarrays (I) Array Layout • Which cDNA sequences are printed • Spatial position Allocation of samples to slides • Design layouts • A vs B: Treatment vs control • Multiple treatments • Factorial • Time series

Aspects of Experimental Design Applied to Microarrays (II) Other considerations • Replication • Physical limitations: the number of slides and the amount of material • Sample Size • Extensibility - linking

Layout options

The main issue is the use of reference samples , typically labelled green. Standard statistical design principles can lead to more efficient layouts; use of dye-swaps can also help. Sample size determination is more than usually difficult, as there are 1,000s of possible changes, each with its own SD.

T1 T2

Natural design choice

T3 T4 T 1 T 2 T n-1 T n C Ref

Case 1: Meaningful biological control (C) Samples: Liver tissue from four mice treated by cholesterol modifying drugs.

Question 1: Genes that respond differently between the T and the C.

Question 2: Genes that responded similarly across two or more treatments relative to control.

Case 2: Use of universal reference Samples: Different tumor samples.

Question: To discover tumor subtypes.

Treatment vs Control Two samples e.g. KO vs. WT or mutant vs. WT Direct Indirect

T T C average (log (T/C)) Ref C Ref log (T / Ref) – log (C / Ref )

 2 /2 2  2

One-way layout: one factor, k levels I) Common Reference A B C II) Common reference A B C III) Direct comparison A ref C B ref Number of Slides Ave. variance Units of material Ave. variance A = B = C = 1 A = B = C = 2 A = B = C = 2

One-way layout: one factor, k levels I) Common Reference II) Common reference III) Direct comparison A B C A B C A ref N=6 C B Number of Slides ref N = 3 N=3 Ave. variance Units of material Ave. variance 2 A = B = C = 1 A = B = C = 2 1 0.67 A = B = C = 2 0.67

For k = 3, efficiency ratio (Design I / Design III) = 3. In general, efficiency ratio = 2k / (k-1). (But may not be achievable due to lack of independence.)

Illustration from one experiment

Design I A B C Ref Design III A C Box plots of log ratios: direct still ahead

Factorial experiments

CTL OSM •

Treated cell lines

Possible experiments

EGF OSM & EGF Here interest is not O O  or an E E interaction log(O&E/O)-log(E/C) in genes for which there is an (main) effect, but in which there is an , i.e. in genes for which is large or small.

2 x 2 factorial: some design options Indirect A balance of direct and indirect I) A B A.B

II) C A III) C A IV) C A # Slides Main effect A Main effect B Int A.B

C 0.5 0.5

1.5

B 0.67

0.43

0.67

A.B

B N = 6 0.5

0.5

1 A.B

B NA 0.3

0.67

A.B

Table entry: variance (assuming all log ratios uncorrelated)

Some Design Possibilities for Detecting Interaction Samples: (30 minutes, 1 hour, 4 hours, 24 hours) Question: treated tumor cell lines at 4 time points Which genes contribute to the enhanced inhibitory effect of OSM when it is combined with EGF? Role of time?

Design A: ctl Design B: ctl OSM 2

OSM EGF OSM & EGF EGF OSM & EGF

Combining Estimates Different ways of estimating the same contrast: e.g. A compared to P Direct = A-P Indirect = A-M + (M-P) or A-D + (D-P) or -(L-A) - (P-L) How do we combine these?

Time Course Experiments

• Number of time points • Which differences are of highest interest (e.g. between initial time and later times, between adjacent times) • Number of slides available

Design choices in time series. Entry: variance N=3 A) T1 as common reference T1 T2 B) Direct Hybridization T1 T2 T3 T3 T4 T4 T1T2 t vs t+1 T2T3 T3T4 t vs t+2 T1T3 T2T4 t vs t+3 T1T4 Ave

1 2 2 1 2 1 1.5

1 1 1 2 2 3 1.67

N=4 C) Common reference T1 T2 T3 Ref D) T1 as common ref + more T1 T2 T3 E) Direct hybridization choice 1 T1 T2 T3 F) Direct Hybridization choice 2 T1 T2 T3 T4 T4 T4 T4 2 .67

2 2 .67

1.67

.67

1.67

1 .75

.75

.75

1 .75

1 2 1 2 1 2 2 1.06

.75

.83

.75

.75

.75

.83

Replication

• Why?

• To reduce variability • To increase generalizability • What is it?

• Duplicate spots • Duplicate slides • Technical replicates • Biological replicates

Technical Replicates: Labeling

• 3 sets of self – self hybridizations • Data 1 and Data 2 were labeled together hybridized on two slides separately and • Data 3 were labeled separately Data 1 Data 1

Sample Size

• Variance of individual measurements (X) • Effect size(s) to be detected (X) • Acceptable false positive rate • Desired power (probability of detecting an effect of at least the specfied size)

Extensibility

• “Universal” • Provides common reference arbitrary undetermined number of (future) experiments extensibility experiments (within and between labs) • Linking experiments for of the series of necessary if common reference source diminished/depleted

Summary

• Balance of direct and indirect comparisons • Optimize precision of the estimates among comparisons of interest • Must satisfy scientific and physical constraints of the experiment

(BREAK)

Mini-Review: How to make a cDNA microarray

Print-tip group 1 cDNA clones

Spotted in duplicate

Pins collect cDNA from wells 384 well plate --

Contains cDNA probes

Glass Slide

Array of bound cDNA probes 4x4 blocks = 16 print-tip groups

Print-tip group 6

Building the chip

Ngai Lab arrayer , UC Berkeley Print-tip head

Microarray Experiment

Hybridization

Binding cDNA samples (targets) to cDNA probes on slide Hybridise for 5-12 hours

Quantification of expression

For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity )

Background matters

From Spot From GenePix

Quality Measurements

• Array – Correlation between spot intensities – Percentage of spots with no signals – Distribution of spot signal area • Spot – Signal / Noise ratio – Variation in pixel intensities – Identification of “bad spot” (spots with no signal) • Ratio (2 spots combined) – Circularity

Affymetrix Oligo Chips

• Only one “color” • Different technology, different normalization issues • Affy chip normalization is an active research area – see http://www.stat.berkeley.edu/users/terry /zarray/Affy/affy_index.html

Preprocessing: Data Visualization

• Was the experiment a success? • Are there any specific problems?

• What analysis tools should be used?

Tools for Microarray Normalization and Analysis

• Both commercial and free software • The labs for this course use the R package sma • Upcoming release (29 April 2002) of Bioconductor ( http://www.bioconductor.org/ )

Red/Green overlay images

Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artefacts such as dust or scratches Good: low bg, lots of d.e.

Bad: d.e.

high bg, ghost spots, little

Scatterplots: always log, always rotate

log 2 R vs log 2 G M=log 2 R/G vs A=log 2 √RG

Histograms

Signal/Noise = log 2 (spot intensity/background intensity )

Boxplots of log 2 R/G Liver samples from 16 mice: 8 WT, 8 ApoAI KO

Spatial plots: background from the two slides

Highlighting extreme log ratios

Top (black) and bottom (green) 5% of log ratios

Pin group (sub-array) effects

Lowess lines through points from pin groups Boxplots of log ratios by pin group

Boxplots and highlighting pin group effects

Print-tip groups

Clear example of spatial bias

Plate effects

Clearly visible plate effects

KO #8

Probes: ~6,000 cDNAs, including 200 related to lipid metabolism.

Arranged in a 4x4 array of 19x21 sub-arrays.

Time of printing effects

spot number Green channel intensities (log 2 G). Printing over 4.5 days.

The previous slide depicts a slide from this print run.

Preprocessing: Normalization

• Why?

To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples.

• How do we know it is necessary?

By examining self-self hybridizations, where no true differential expression is occurring.

We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….

Self-self hybridizations

False color overlay

Boxplots within pin-groups Scatter (MA-)plots

Similar patterns apparent in non self-self hybridizations

From the NCI60 data set (Stanford web site)

From Lawrence Berkeley National Laboratory

Normalization Methods (I)

• Normalization based on a global adjustment log 2 R/G -> log 2 R/G - c = log 2 R/(kG) Choices for k or c = log 2 k are c = median or mean of log genes). Or, total intensity normalization, where k = ∑R i / ∑G i .

• Intensity-dependent log 2 R/G -> log 2 normalization Here, run a line through the middle of the MA plot, shifting the M value of the pair (A,M) by c=c(A), i.e.

R/G - c (A) = log Scatterplot Smoothing. 2 R/(k(A)G).

One estimate of c(A) is made using the LOWESS function of Cleveland (1979): LOcally WEighted

Normalization Methods (II)

• Within print-tip group error. microarrays.

normalization In addition to intensity-dependent variation in log ratios, spatial bias can also be a significant source of systematic Most normalization methods do not correct for spatial effects produced by hybridization artefacts or print-tip or plate effects during the construction of the It is possible to correct for within print-tip groups, i.e.

log 2 R/G -> log 2 R/G - c i both print-tip and intensity dependent bias by performing LOWESS fits to the data (A) = log 2 R/(k i (A)G), where c grid only.

(A) is the LOWESS fit to the MA-plot for the ith

Normalization: Which Spots to use?

The LOWESS lines can be run through many different sets of points, and each strategy has its own implicit set of assumptions justifying its applicability. For example, the use of a global LOWESS approach can be justified by supposing that, when stratified by mRNA abundance, a) only a minority of genes are expected to be differentially expressed, or b) any differential expression is as likely to be up regulation as down-regulation. Pin-group LOWESS requires stronger assumptions: that one of the above applies within each pin-group. The use of other sets of genes, e.g. control or housekeeping genes, involve similar assumptions.

Use of Control Slides: M vs A Plot M = log R/G = logR - logG Lowess curve blanks Positive controls Negative controls A = ( logR + logG ) /2

Normalization makes a difference

Global scale, global lowess, pin-group lowess; spatial plot after, smooth histograms of M after

Normalization by controls: Microarray Sample Pool titration series Pool the whole library

Control set to aid intensity- dependent normalization Different concentrations in titration series Spotted evenly spread across the slide in each pin-group

Comparison of Normalization Schemes (courtesy of Jason Goncalves) No consensus on best normalization method Experiment done to assess the common normalization methods Based on reciprocal labeling experimental data for a series of 140 replicate experiments on two different arrays each with 19,200 spots

DESIGN OF RECIPROCAL LABELING EXPERIMENT Replicate experiment in which we assess the same mRNA pools but invert the fluors used.

The replicates are independent experiments and are scanned, quantified and normalized as usual

Comparison of Normalization Methods - Using 140 19K Microarrays

0.46

0.44

0.42

0.4

0.38

0.36

*** 0.34

0.32

0.3

Pre Normalized Global Intensity Subarray Intensity Global Ratio

Normalization Method

Sub-Array Ratio Global LOWESS Subarray LOWESS

Scale normalization: between slides

Boxplots of log ratios from 3 replicate self-self hybridizations.

Left panel: before normalization Middle panel: after within print-tip group normalization Right panel: after a further between-slide scale normalization.

The “NCI 60” experiments (no bg)

Some scale normalization seems desirable

Scale normalization: another data set

Only small differences in spread apparent. No action required.

One way of taking scale into account Assumption: All slides have the same spread in M True log ratio is m slides and j ij where i represents different represents different spots.

Observed is M ij , where M ij = a i m ij Robust estimate of a i is MAD i = median j { |y ij - median(y ij ) | }

A slightly harder normalization problem Global lowess doesn’t do the trick here

Print-tip-group normalization helps

But not completely Still a lot of scatter in the middle in a WT vs KO comparison

Effects of previous normalization Before normalization After print-tip-group normalization

Within print-tip-group box plots of M after print-tip-group normalization

Taking scale into account, cont.

Assumption: All print-tip-groups have the same spread in M True log ratio is m ij where i represents different print-tip-groups represents different spots.

and j Observed is M ij , where M ij = a i Robust estimate of a i is m ij MAD i = median j { |y ij - median(y ij ) | }

Effect of location & scale normalization Clearly care is needed in making decisions like this

A comparison of three M v A plots Unnormalized Print-tip normalization Print tip & scale n

The same normalization on another data set Before After .

Normalization: Summary • Reduces systematic (not random) effects • Makes it possible to compare several arrays • Use logratios (M vs A-plots) • Lowess normalization (dye bias) • MSP titration series – composite normalization • Pin-group location normalization • Pin-group scale normalization • Between slide scale normalization • Control Spots • Normalization introduces more variability • Outliers (bad spots) are handled with replication

Pre-processed cDNA Gene Expression Data On p genes for n slides: p is O(10,000), n is O(10-100), but growing,

Slides Genes

1 2 3 4 5

slide 1

0.46

-0.10 0.15

-0.45

-0.06

slide 2

0.30

0.49

0.74

-1.03

1.06

slide 3

0.80

0.24

0.04

-0.79

1.35

slide 4

1.51

0.06

0.10

-0.56

1.09

slide 5

0.90

0.46

0.20

-0.32

-1.09

...

...

...

...

...

Gene expression level of gene

5

in slide 4 = Log 2 ( Red intensity / Green intensity ) These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

First Steps: QQ-Plots

Used to assess whether a sample follows a particular (e.g. normal) distribution (or to compare two samples) Sample quantile is 0.125

A method for looking for outliers when data are mostly normal Value from Normal distribution which yields a quantile of 0.125

Theoretical

Acknowledgments

Terry Speed (UCB and WEHI) Jean Yee Hwa Yang (UCB) Sandrine Dudoit (UCB) Ben Bolstad (UCB) Natalie Thorne (WEHI) Ingrid Lönnstedt (Uppsala) Henrik Bengtsson (Lund) Jason Goncalves (Iobion) Matt Callow (LLNL) Percy Luu (UCB) John Ngai (UCB) Vivian Peng (UCB) Dave Lin (Cornell)