No Slide Title

Download Report

Transcript No Slide Title

GeneChip analysis

Microarrays Statistics Physics Biology

Andrew Harrison, UCL + Essex London Pain Consortium [email protected]

Microarrays (mRNA expression) Microarrays are a massively-parallel Northern Blot. Each array contains thousands of sets of different nucleotide sequences.

Each set of sequences on the array is complementary to the mRNA nucleotide sequence of a different gene.

Probe cells of an Affymetrix Gene chip contain millions of identical 25-mers 25-mer

Hybridization between biotin-labelled mRNA and the probes on the chip

A laser causes the biotin to fluoresce, which is then detected by a scanner

Affymetrix microarrays 5’ 3’

GGTG GGAATTGGGTCA

G

AAGGACTGTGGC TAGGCGC GGAATTGGGTCA

G

AAGGACTGTGGC GGAATTGGGTCA

C

AAGGACTGTGGC perfect match probe cells mismatch probe cells

actually scattered on chip

Affymetrix probe set

Data for the same gene Perfect Match (PM) Mismatch (MM)

Probe pair

Each Gene Chip contains tens of thousands of probe sets

Each chip emits over a broad range of intensities (a dynamic range of many hundreds)

Chip calibration

Correct Background, Normalise, Correct for Cross Hybridisation, Expression Measure High-level analysis, biological interpretation

Background Fluorescence needs to be removed

Chips need to normalised against each other.

Each chip is a different colour e.g. invariant genes, lowess, quantiles

Expression Measure

The intensities of the multiple probes within a probeset are combined into of expression ONE measure MAS, RMA, dChip

MAS 5.0 (Signal) takes the Tukey bi-weighted mean of the difference in logs of PM and MM.

1-9 are different chips.

dChip and RMA ‘model’ the systematic hybridisation patterns when calibrating an expression measure.

Chip calibration

Differentially expressed genes are identified T-tests Fold Changes Z-scores

T-statistics: Each gene is studied independently

t

of the other genes

mean

(

L

4 ) 

mean

(

control

) var(

L

4 ) 

N

L

4 var(

control

)

N

control

More significant change Less significant change

Once chips have gone through the calibration process, changes in gene expression between conditions or over time can be observed.

m=log 2 (Fold Change), a=log 2 (Average Intensity) The ratio of expression values between two conditions is known as

Fold Change

Variability of fold change is a function of intensity!!

MAS 5 shows large variability at low intensities RMA shows small variability at low intensities

Fold change is NOT the same as significance!!

At least one of these people have tripled in weight in the last two years Is such a change unusual?

Sliding Z

Quackenbush (2002) Z = m - mean(m) standard deviation (m)

How can you validate the results from statistics?

Perform calibrations where you know the answer to expect (spike-in experiments) Determine the statistical properties you expect your results to have, and see if the experiments match your assumptions - thought experiments Does the biology make sense?

The density of intensities of significant genes produced by T-statistics have a poor overlap with the intensities of all the other genes Histogram is the population density.

Line is the density of significant genes

The intensity histogram of Sliding Z scores matches very well to the population.

Within Bioconductor (within R), the package “affy” allows a choice of calibration protocol:

3 background corrections

Nothing, MAS, RMA

5 different normalisations

Constant, Invariant Genes, Lowess, Qspline, Quantiles

3 different expression measures

dChip (aka Li-Wong), MAS, RMA

There are 45 different permutations.

Which factors lead to certain calibration protocols sharing a large consensus of significant genes?

1, 2 & 3 are identical to themselves (consensus is 100%, colour white) 2 and 3 share the most in common (light grey) 1 and 2 share the least in common (dark grey)

The list of significantly changing genes derived from T-statistics is sensitive (dark) to the choice of calibration protocol (45 possibilities)

T-statistics: Each gene is studied independently

t

of the other genes

mean

(

L

4 ) 

mean

(

control

) var(

L

4 ) 

N

L

4 var(

control

)

N

control

More significant change Less significant change

T-tests are very sensitive to the choice of normalisation Significant: Fold change is small but variance is very small Not Significant: Fold change is a little larger but variance is also larger Condition A Condition B Normalisation need modify only one signal

Penalised t-test

t

pen

mean

(

L

4 ) 

mean

(

control

) var(

L

4 ) 

N

L

4 var(

control

)  

N

control

Z-scores Z-scores are less sensitive to the choice of calibration protocol

Clustering the calibration protocol matrix indicates that the major impact on consensus of significant genes for Z-scores is the choice of expression measure cluster Expression measures are Li-Wong (dChip), RMA and MAS

Microarray Analysis Suite Robust Multichip Average Li & Wong 45 different calibration protocols 3 background x 5 normalisations Major uncertainty in the calibration is the choice of expression measure

Recap The biggest uncertainty in the calibration of Affymetrix data is how to combine all the multiple probes into one value (mRNA expression per gene) Fold change is biased in intensity. T-tests are sensitive to the choice of calibration Z-scores of fold changes provide a reliable statistical measure for all intensities. ….. but why can’t we use fold changes?

Spike-in measurements of known concentrations The plateau is probably due to cross-hybridisation to the genomic population of mRNA log (transcript concentration) For RMA (which only uses the PM information) there remains considerable signal at very low concentrations.

The non-linearity means that Fold Change (Intensity) is NOT the same as Fold Change (Transcript) It is difficult to establish when a gene is NOT expressed

Cross Hybridisation MAS 5.0 (Affymetrix) corrects for cross-hybridisation by subtracting the MisMatch signal from the Perfect Match.

RMA ignore the mismatches because they hybridise to the Perfect Signal. But the perfect match contains a contribution from cross-hybridisation.

There is a need for a model of the physics of hybridisation (Naef and Magnasco 2003)

GC content is important AT bonds have two hydrogen bonds. GC have 3 hydrogen bonds

Van der Waals interactions between adjacent bases H-bond interactions between adjacent bases Nearest-neighbour interactions predict duplex kinetics and so sequence order is important (Santa Lucia) CTG GTC The binding energy of GAC is not the same as CAG

The fraction of overlap between transcript and probe depends upon the position along the probe (SantaLucia) Imagine if all your fragments were of length 20.

Imagine dropping the fragments randomly along a line of 25 Fraction 1 5 13 20 25 There will also be Duplex breathing and a torque between the duplex and the unbound fragment

Biotin labelling interferes with the hybridisation C & T (pyrimidines) are labelled. So GC* binds less strongly than CG, and AT* binding is weaker than TA. If the probe contains no C & T, it will hybridise well but with no fluorescence. If you have all C & T, it will have difficulty hybridising.

Size is important T e.g. perfect match #13 = A, so mismatch #13 is T, and the complementary base in mRNA is also T/U C Pyrimidines (C & T) are small There will be no steric hindrance between the pyrimidine in the mismatch and the pyrimidine in the mRNA of interest.

G A

Size is important T e.g. perfect match #13 = T, so mismatch #13 is A, and the complementary base in mRNA is also A C Purines (G & A) are large There will be a large steric hindrance between the purine in the mismatch and the purine in the mRNA of interest.

G A

Naef and Magnasco (2003) The difference in intensity between the PM and MM is sensitive to the choice of the central base of the probe!

There is a lot of physics to consider.

In order to simplify matters, there have been several attempts to generate simple mathematical models which incorporates the key physics.

The parameters in the models are then fitted using the data from many chips.

Zhang, Miles and Aldape (2003) Their model is named Position Dependent Nearest Neighbour (PDNN) PDNN has 24 weight factors for Gene Specific Binding, 24 factors for Non-Specific Binding and 16 stacking energy parameters They fit their model with a dataset of ~5,000,000 probe measurements (~40 chips)

Naef and Magnasco (2003) The model contains only position specific affinities for each base (fitted using ~80 chips) A low order function can be fitted to the hybridisation for a given base at a given position. The total hybridisation for the 25 base sequence is then the sum of the local hybridisations.

Wu and Irizarry report spike in yeast controls on a human chip.

This measures non-specific hybridisation directly Many unchanging genes do not express!

Theory is comparable to experiment Not as clean as Naef

b) Theory

GCRMA (Wu and Irizarry 2004) Lots of close sequences will hybridise to a given probe. Wu and Irizarry model the variation in hybridisation of these similar processes using a statistical model.

GCRMA determines the contribution to the PM from Signal and from Non-Specific Hybridisation Stickiness

GCRMA produces a linear relationship between intensity and concentration GCRMA

Do the genes identified by statistics make biological sense?

BE CAREFUL: YOU KNOW TOO MUCH!

Biologically relevant genes High-throughput experiments Anatomy, Developmental Biology, Neuroscience, Medicine, Pharmacology, Physiology

Does all this physics and statistics make a difference to the biological interpretation?

What does the output of the genome look like?

Data from comparing Spared Nerve Injury in adult and p10 rats Lucy Bee, Ramine Hosseini, Andrew Moss, Jonathan Smith and Maria Fitzgerald.

Different analysis strategies can have a dramatic effect on the ranking of significant genes Analysis Protocol Rank of Protein X in Experiment Y MAS and fold change GCRMA and fold change GCRMA and Z-scores 8900 485 90

Does all this physics and statistics make a difference to the biological interpretation?

Andrew Harrison (novice biologist) used Z-scores and GCRMA to produce a list of the 100 most significantly up regulated genes and 100 most significantly down regulated genes for Experiment Z Stephen McMahon (expert biologist) looked at fold changes produced by MAS.

Intensity distributions for the same data

GCRMA MAS

The major disagreement in fold changes is at low intensities, where MAS produces lots of false positives

Recap

Z-scores appear to be a reliable statistical protocol GCRMA is based on physical models of hybridisation and provides a linear relationship between transcript concentration and intensity.

A combination of Z-scores (of fold change) and GCRMA produces a list of statistically significant genes that are biologically relevant.

Academics have created R, and Bioconductor. It is free, available from the web and much of the software is well documented. Most importantly for you, it provides accurate results. Using it will mean that you are never naked in front of your peers. But they may be! www.bioconductor.org

Please go MAD once a month!

Microarray Analysis Discussion meetings are held at UCL, usually on the 1st Friday of every month at 4pm. They are open to everyone and attract people from across the South-East. Please come along!

Contact Jacky Pallas to be included on the mailing list: [email protected]

Many useful analysis papers can be found at www.biochem.ucl.ac.uk/~harry/MAD