Transcript PPT - Biochemistry and Molecular Biology
Affymetrix microarrays [email protected]
Following discussions with Caroline, Christine, Danielle, Eric, Hugh, Ilhem, Kevin, Lucy, Martin, Martino, Michael, Mike and Paul
Bioinformatix, Mathematix, Physix, Statistix, Transcriptomix Glass Spottix versus Affymetrix
Operation and tissue extraction mRNA preparation Chips are run Animal Biology Molecular Biology Chip calibration Differentially expressed genes are identified Computational Biology
Affymetrix scanners transform message into light The Engima scanner has low noise: SVDFGS is definitely SVDFGS Would you prefer to listen to Radio Glasgow with poor reception or Radio Budapest with clear reception?
Affymetrix microarrays
5’ 3’
GGTG GGAATTGGGTCA
G
AAGGACTGTGGC TAGGCGC GGAATTGGGTCA
G
AAGGACTGTGGC GGAATTGGGTCA
C
AAGGACTGTGGC perfect match probe cells mismatch probe cells
actually scattered on chip
Probe cells of an Affymetrix Gene chip contain millions of identical 25-mers
25-mer
Affymetrix Gene chip-Hybridization
Affymetrix Gene chip-Fluorescence
Perfect Match (PM) Mismatch (MM)
Affymetrix probe set
Probe cell (aka feature) Probe pair
Each gene is represented by 16 probe pairs (for chip rgu34a). Each pair has a perfect match (the 25 base oligonucleotide binds to the gene of interest) and a mismatch (the central base is changed).
Outliers?
Chip calibration
Correct Background, Normalise, Correct for Cross Hybridisation, Expression Measure High-level analysis, biological interpretation
Background Fluorescence needs to be corrected e.g. MAS and RMA algorithms
Camel distributions suggest that there are two populations (detected and not detected?).
Chips need to normalised against each other.
Each chip is a different colour e.g. invariant genes, lowess, quantiles
RMA uses Quantile normalisation at the probe level Order by ranks Chip 1 Chip 2 Chip 3 Average the intensities at each rank Chip 1 Chip 2 Chip 3 Chip 1 Chip 2 Chip 3 1 2 3 4 5 1 2 3 5 7 2 3 4 5 9 1.33 2.33 3.33 4.66 7 1.33 2.33 3.33 4.66 7 1.33 2.33 3.33 4.66 7 PA PB PC PD PE 1 2 4 3 5 7 2 5 3 1 5 3 4 2 9 Reorder by probe Chip 1 Chip 2 Chip 3 PA PB PC PD PE 1.33 2.33 4.66 3.33 7 7 2.33 4.66 3.33 1.33
4.66 2.33 3.33 1.33 7
Cross Hybridisation MAS 5.0 (Affymetrix) corrects for cross-hybridisation by subtracting the MisMatch signal from the Perfect Match.
RMA ignore the mismatches because they hybridise to the Perfect Signal.
Expression Measure The intensities of the multiple probes within a probeset are combined into
ONE
measure of expression
MAS 5.0 (Signal) takes the Tukey bi-weighted mean of the difference in logs of PM and MM.
1-9 are different chips.
dChip and RMA ‘model’ the systematic hybridisation patterns when calibrating an expression measure.
Once chips have gone through the calibration process, changes in gene expression between conditions or over time can be observed. m=log 2 (Fold Change), a=log 2 (Average Intensity) The change in expression between two conditions for all the genes on an array can be viewed on a MA plot
Sliding Z
Quackenbush (2002) Z = m - mean(m) standard deviation (m)
At low intensities, the sd is too low.
signal bg signal bg
Barenco 2003 Spike-in measurements show there remains considerable signal at low concentrations.
The non-linearity means that Fold Change (Intensity) is NOT the same as Fold Change (Transcript) This causes complications when comparing chips against mathematical models of changes in gene expression It is difficult to establish when a gene is NOT expressed The statistical space is also non-linear
Cross Hybridisation MAS 5.0 (Affymetrix) corrects for cross-hybridisation by subtracting the MisMatch signal from the Perfect Match.
RMA ignore the mismatches because they hybridise to the Perfect Signal. How can you measure cross-hybridisation without using the MisMatch signal?
There is a need for a model of the physics of hybridisation (Naef and Magnasco 2003)
GC content is important AT bonds have two hydrogen bonds. GC have 3 hydrogen bonds
Van der Waals interactions between adjacent bases H-bond interactions between adjacent bases Nearest-neighbour interactions predict duplex kinetics and so sequence order is important (Santa Lucia) The binding energy of GAC is not the same as CAG
The fraction of overlap between transcript and probe depends upon the position along the probe (Maibaum and SantaLucia) Imagine if all your fragments were of length 20.
Imagine dropping the fragments randomly along a line of 25 Fraction 1 5 13 20 25 There will also be Duplex breathing and a torque between the duplex and the unbound fragment
Biotin labelling interferes with the hybridisation C & T (pyrimidines) are labelled. So GC* binds less strongly than CG, and AT* binding is weaker than TA. If the probe contains no C & T, it will hybridise well but with no fluorescence. If you have all C & T, it will have difficulty hybridising.
C and T within your mRNA fragment but immediately outside your probe will fluoresce and not interfere with hybridisation
Naef and Magnasco 2003 - a key paper
Size is important T C e.g. perfect match #13 = A, so mismatch #13 is T, and the complementary base in mRNA is also T/U Pyrimidines (C & T) are small There will be no steric hindrance between the pyrimidine in the mismatch and the pyrimidine in the mRNA of interest.
G A Purines (G & A) are large There will be a large steric hindrance between the purine in the mismatch and the purine in the mRNA of interest.
Naef and Magnasco (2003)
From Mei et al. (2003, PNAS) Hybridisation with respect to A: C is red G is green T is yellow Affymetrix design their arrays using increasingly sophisticated models of the physical chemistry of hybridisation
Zhang, Miles and Aldape (2003) Their model is named Position Dependent Nearest Neighbour (PDNN) PDNN has 24 weight factors for Gene Specific Binding, 24 factors for Non-Specific Binding and 16 stacking energy parameters They fit their model with a dataset of ~5,000,000 probe measurements (~40 chips)
Naef and Magnasco (2003) The model contains only position specific affinities for each base (fitted using ~80 chips) A low order function can be fitted to the hybridisation for a given base at a given position. The total hybridisation for the 25 base sequence is then the sum of the local hybridisations.
If your probe contains lots of As in the centre: Position along probe There will be lots of AT bonds which means weak 2-hydrogen bonds The complementary sequence will contain lots of Ts (biotin interference)
If your probe contains lots of Cs in the centre: Position along probe There will be lots of GC bonds which means strong 3-hydrogen bonds The complementary sequence will contain lots of Gs (no biotin interference)
Wu and Irizarry report spike in yeast controls on a human chip.
This measures non-specific hybridisation directly Many unchanging genes do not express!
Theory is comparable to experiment Not as clean as Naef
Wu and Irizarry (2004) have written GCRMA (which is available now in Bioconductor) As theory is comparable to experiment, it can be used estimate the intrinsic stochastic uncertainty of the hybridisation process Lots of close sequences will hybridise to a given probe. Wu and Irizarry model the variation in hybridisation of these similar processes using a statistical model.
GCRMA determines the contribution to the PM from Signal and from Non-Specific Hybridisation
GCRMA suggests that many probes on the chip do not detect signal.
GCRMA produces a good linear relationship between intensity and concentration
Standard deviation of fold change as a function of intensity GCRMA GCRMA is more noisy than RMA because each PM has a noisy cross-hybridisation subtraction RMA MAS
GCRMA makes the global properties of chips much more comparable. In particular, it is much better than RMA at removing genes with little emission over and above the non-specific hybridisation.
GCRMA produces a linear relationship between light and transcript to much lower concentrations.
The subtraction of cross-hybridisation adds to the noise. However, this noise is much lower than MAS at low-middle concentrations
Can the algorithms be improved further?
Spike-in measurements show that at large intensities there is a non-linear relationship between transcript concentration and fluorescent signal
Hekstra
et al.
(2003) show that Affymetrix chips follow Langmuir adsorption isotherms i.e. they chemically saturate at large concentrations in a well understood manner.
The affinities show a slight kink, suggesting they can be improved by including saturation effects
The corrections are for non-specific hybridisation, yet some probes will be prone to specific cross hybridisation from other genes - see talk by Eric Outliers will need to be found and removed A more detailed physical model may reduce variance
Comparing the probes in a biological replicate Even after using GCRMA the variation does not look random at low intensities.
It looks like there is still a systematic bias, or there remains a background contribution to the PM signal
U ignore?
D There appear to be two populations of probes “D-detected” “U-undetected” At present, expression measures (GCRMA, RMA, MAS) combine all the probes within a probeset Should all the probes below the peak in variance be ignored?
Can we do better on the image processing?
Affymetrix data Solar system formation
Gung-Ho Conclusions
The calibration of Affymetrix chips is a very active and quickly evolving research area. All the references in my talk are from 2003 or later! GCRMA seems to have all the properties you would expect from a correct calibration protocol. It is available NOW in Bioconductor for FREE and will help biologists and analysts.
Affymetrix calibration requires bioinformatix, physix and statistix to work (and live) in harmony. Transcriptomica?
Our unification will allow us to face the common enemy together ….
Computer Scientists
Dadda (yesterday)
Quantile normalisation assumes the chips have the same underlying distribution of intensities. For some experiments, this is not the case (and what if you wish to compare 1000 chips?)