Bayesian hierarchical modelling of genomic data Sylvia Richardson Centre for Biostatistics Imperial College, London In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Helen.

Download Report

Transcript Bayesian hierarchical modelling of genomic data Sylvia Richardson Centre for Biostatistics Imperial College, London In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Helen.

Bayesian hierarchical modelling
of genomic data
Sylvia Richardson
Centre for Biostatistics
Imperial College, London
In collaboration with
Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)
Helen Causton and Tim Aitman (Hammersmith)
Peter Green (Bristol)
Philippe Broët (INSERM, Paris)
BBSRC Exploiting Genomics grant
1
Outline
• Introduction
• A fully Bayesian gene expression index
(BGX)
• Differential expression and array effects
• Mixture models
• Discussion
2
Part 1 Introduction
• Recent developments in genomics have
led to techniques
– Capable of interrogating the genome at
different levels
– Aiming to capture one or several stages of
the biological process
DNA
mRNA
protein
phenotype
3
Fundamental process
Protein-encoding genes are transcribed into mRNA (messenger),
and the mRNA is translated to make proteins
DNA -> mRNA -> protein
4
Pictures from http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html
What are gene expression data ?
DNA Microarrays are used to measure the relative abundance
of mRNA, providing information on gene expression in a particular
cell, under particular conditions
The fundamental principle used to measure the expression is
that of hybridisation between a sample and probes:
– Known sequences of single-stranded DNA representing genes are
immobilised on microarray
–Tissue sample (with unknown concentration of RNA) fluorescently labelled
– Sample hybridised to array
– Array scanned to measure amount of RNA present for each sequence
gene expression measure
The expression level of ten of thousands of probes are
measured on a single microarray !
gene expression profile
5
Variation and uncertainty
Gene expression data (e.g. Affymetrix) is
the result of multiple sources of variability
•
•
•
•
•
condition/treatment
biological
array manufacture
imaging
technical
• gene specific
variability of the
probes for a gene
• within/between array
variation
Structured statistical modelling allows
considering all uncertainty at once
6
Example of within vs between strains
gene variability
• 7 cross-bred strains of mice that differ only by a small
portion of chromosome 1
• Strains have different phenotypes related to
immunological disorders
• For each line, 9 animals used to obtain 3 pooled RNA
extracts from spleen
7 x 3 samples
Excellent experimental design to minimise
“biological variability between replicate animals”
Aim: to tease out differences between expression
profiles of the 7 lines of mice and relate these to
locations on chromosome 1
7
Biological variability is large !
Ratio within/total
Total variance calculated
over the 21 samples
Average (over the 7 groups) of
within strain variance
8
calculated from the 3 pooled samples
1000 genes most variable
between strains: hierarchical
clustering recovers the
cross-bred lines structure
Random set of 1000 genes
9
Common characteristics of genomics
data sets
• High dimensional data (ten of thousands of
genes) and few samples
• Many sources of variability (low signal/noise
ratio)
Common issues
•
•
•
•
Pre-processing and data reduction
Multiple testing
Need to borrow information
Importance to include prior biological knowledge
10
Part 2
• Introduction
• A fully Bayesian gene expression index
(BGX)
– Single array model
– Multiple array model
• Differential expression and array effects
• Mixture models
• Discussion
11
A fully Bayesian Gene eXpression
index for Affymetrix GeneChip arrays
Anne Mette Hein
SR, Helen Causton,
Graeme Ambler, Peter Green
PM
MM
PM
MM
PM
MM
PM
MM
Raw intensities
Background correction
Gene specific variability
(probe)
Gene index BGX
12
Affymetrix GeneChips:
Zoom Image of Hybridised Array
*
*
*
*
*
Hybridised
Spot
Each gene g represented by
probe set:
(J:11-20)
Expressed PM
Perfect match: PMg1,…, PMgJ
Mis-match:
Non-expressed PM
Image of Hybridised Array
Slide courtesy of Affymetrix
MMg1,…, MMgJ
expression measure
for gene g
13
Commonly used methods for estimation
expression levels from GeneChips
MAS5:
• uses PM and MMs. Imputes IMs from MMs to obtain all
PM-MMs positive
• gene expression measure : estimate obtained by applying
Tukey Biweight to the set of log(PM-MM) values in the probe
set
Characteristics: positive, robust, noisy at low levels
RMA:
• uses PMs only.
• Fits an model with additive gene and probe effects to logscale background corrected PMs using median polish
14
Characteristics: positive, robust, attenuated signal detection
Mean (left) and Empirical standard deviation (right)
over 7 conditions (arrays) for 45000 genes estimated by
2 different methods for quantifying gene expression
mean
Variability across conditions is conditioned by the
15
choice of summary measure ! Beware of filtering
Model assumptions and key biological
parameters
• The intensity for the PM measurement for probe
(reporter) j and gene g is due to binding
of labelled fragments that perfectly
match the oligos in the spot
The true Signal Sgj
of labelled fragments that do not
perfectly match these oligos
The non-specific hybridisation Hgj
• The intensity of the corresponding MM measurement is
caused
by a binding fraction Φ of the true signal Sgj
by non-specific hybridisation Hgj
16
BGX single array model:
g=1,…,G (thousands), j=1,…,J (11-20)
PMgj  N( Sgj + Hgj , τ2)
Background
noise, additive
MMgj  N(Φ Sgj + Hgj ,τ2)
fraction
Non-specific
hybridisation
signal
log(Sgj+1)  TN (μg , ξg2)
log(Hgj+1)  TN(λ, η2)
j=1,…,J
Gene specific
error terms:
exchangeable
Gene expression index (BGX):
Array-wide
distribution
qg=median(TN (μg , ξ g2))
“Pools” information over probes j=1,…,J
log(ξ g2)N(a, b2)
Priors: “vague” t2 ~ G(10-3, 10-3) F~ B(1,1),
mg ~ U(0,15)
h2 ~ G(10-3, 10-3), l ~ N(0,103)
“Empirical Bayes”
17
Inference:
• Implemented in WinBugs and C
• allows: - Joint estimation of parameters in full
Bayesian framework
• obtain: - posterior distributions of parameters
(and functions of these) in model:
mean
2.5-97.5% credibility interval
For each gene g:
Log-scale true signals:
Log-scale non-spec. hybr:
BGX: gene expr:
log(Sgj+1):
j=1,…,J
log(Hgj+1):
j=1,…,J
qg:
1 2 3
2 3 4
1.75 2 2.25
NB! A distribution
18
Computational issues
• We found mixing slow for gene specific parameters (μg ,
ξg2) and large autocorrelation
• For low signal (bottom 25%) more variability of Sgj and
Hgj , and less separation
So less information on (μg , ξg2) and longer runs are
needed
• For the full hierarchical model, the convergence of the
hyperparameters for the distribution of ξg2 was
problematic
• We studied sensitivity to a range of plausible values for
those and implemented an “empirical Bayes” version of
the model which was reproducible with sensible run
length
19
Posterior mean of qg using a run
of 30 000 versus those obtained
from runs of 5 000, 10 000 and
20 000 sweeps
Reproducibility
is obtained with
short runs for
large expression
values
Longer runs
are necessary for
low expression
values
20
Single array model performance:
Data set : varying concentrations (geneLogic):
• 14 samples of cRNA from acute myeloid leukemia
(AML) tumor cell line
• In sample k:
sample k:
conc. ck(pM):
each of 11 genes spiked in at concentration ck:
1
2
0.0 0.5
3
4
5 6 7
8
9
10
0.75 1.0 1.5 2.0 3.0 5.0 12.5 25
11 12 13 14
50 75 100 150
• Each sample hybridised to an array
Consider subset consisting of 500 normal genes
+ 11 spike-ins
21
Single array model:
examples of posterior distributions of BGX
expression indices
Each curve
(truncated normal
with median param.)
represents a
gene
Examples with data:
o:
log(PMgj-MMgj)
j=1,…,Jg
(at 0 if not defined)
Mean +- 1SD
22
Comparison with other expression measures
Single array model
performance:
11 genes spiked in at 13
(increasing)
concentrations
BGX index qg
increases with
concentration …..
… except for gene 7
(spiked-in??)
Indication of smooth
& sustained increase
over a wider range of
concentrations
23
2.5 – 97.5 % credibility intervals for the Bayesian expression index
11 spike-in genes at 13 different concentration (data set A)
Note how the variability
is substantially larger
for low expression level
Each colour corresponds to a different spike-in gene
24
Gene 7 : broken red line
Integrated modelling of Affymetrix data
Condition 2
Condition 1
PM
MM
PM
MM
PM
MM
PM
MM
Gene specific variability (probe)
Gene index BGX
Hierarchical model of replicate
(biological) variability and array effect
Distribution of expression
index for gene g , condition 1
PM
MM
PM
MM
PM
MM
PM
MM
Gene specific variability (probe)
Gene index BGX
Hierarchical model of replicate
(biological) variability and array effect
Distribution of expression
index for gene g , condition 2
Distribution of differential expression parameter
25
BGX Multiple array model:
conditions: c=1,…,C, replicates: r = 1,…,Rc
PMgjcr  N( Sgjcr+ Hgjcr , τcr2)
MMgjcr N(ΦSgjcr+ Hgjcr , τcr2)
log(Sgjcr+1)  TN (μgc , ξ gc2)
Gene and condition specific BGX
Background
noise, additive
Array specific
log(Hgjcr+1)  TN(λcr,ηcr2)
Array-specific
distribution of
non-specific
hybridisation
qgc=median(TN(μgc, ξ gc2))
“Pools” information over replicate probe sets
j = 1,…J, r = 1,…,Rc
26
Subset of AffyU133A spike-in data set
(AffyComp)
Consider:
• Six arrays, 1154 genes (every 20th and 42 spike-ins)
• Same cRNA hybridised to all arrays EXCEPT for spike-ins:
`1` `2` `3` … `12` `13` `14`
Spike-in genes:
1-3 4-6
7-9 … 34-36 37-39 40-42
Spike-in conc (pM):
Condition 1 (array 1-3): 0.0 0.25 0.50 … 128
256
512
Condition 2 (array 4-6): 0.25 0.50 1.00 … 256
512
0.00
2
-
Fold change:
-
2
2
…
2
27
BGX: measure of uncertainty provided
Posterior mean +- 1SD credibility intervals
diffg=bgxg,1- bgxg,2
}
Spike in 1113 -1154
above the blue line
28
Blue stars show RMA measure
Part 3
• Introduction
• A fully Bayesian gene expression index
(BGX)
• Differential expression and array effects
– Non linear array effects
– Model checking
• Mixture models
• Discussion
29
Differential expression and array
effects
Alex Lewin
SR, Natalia Bochkina, Tim Aitman
30
Data Set and Biological question
Biological Question
Understand the mechanisms of insulin resistance
Using animal models where key genes are knockout and
comparison made between gene expression of wildtype
(normal) and knockout mice
Data set A (MAS 5) ( 12000 genes on each array)
3 wildtype mice compared with 3 mice with Cd36 knocked out
Data set B (RMA) ( 22700 genes on each array)
8 wildtype mice compared with 8 knocked out mice
31
Condition 2
Condition 1
Start with given point
estimates of expression
Hierarchical model of replicate
Variability and array effect
Hierarchical model of replicate
Variability and array effect
Posterior distribution
Differential expression
parameter
(flat prior)
Mixture modelling
for classification
32
Exploratory analysis of array effect
Mouse data
set A
Condition 1 (3 replicates)
Needs
‘normalisation’
Spline curves
shown
Condition 2 (3 replicates)
33
Model for Differential Expression
• Expression-level-dependent normalisation
• Only few replicates per gene, so share
information between genes to estimate
variability of gene expression between the
replicates
• To select interesting genes:
– Use posterior distribution of quantities of interest,
function of, ranks ….
– Use mixture prior on the differential expression
parameter
34
Bayesian hierarchical model for
differential expression
Data: ygsr = log gene expression for gene g, replicate r
g = gene effect
δg = differential effect for gene g between 2 conditions
r(g)s = array effect (expression-level dependent)
gs2 = gene variance
• 1st level
yg1r  N(g – ½ δg + r(g)1 , g12),
yg2r  N(g + ½ δg + r(g)2 , g22),
Σrr(g)s = 0, r(g)s = function of g , parameters {a} and {b}
• 2nd level
Priors for g , δg, coefficients {a} and {b}
gs2  lognormal (μs, τs)
35
Details of array effects (Normalization)
• Piecewise polynomial with unknown break points:
r(g)s = quadratic in g for ars(k-1) ≤ g ≤ ars(k)
with coeff (brsk(1), brsk(2) ), k =1, … # breakpoints
– Locations of break points not fixed
– Must do sensitivity checks on # break points
• Joint estimation of array effects and differential
expression: In comparison to 2 step method
– More accurate estimates of array effects
– Lower percentage of false positive (simulation study)
36
Mouse Data set A
For this data set, cubic fits well
3 replicate arrays (wildtype
mouse data)
Model: posterior means
E(r(g)s | data) v. E(g | data)
Data: ygsr - E(g | data)
37
Bayesian Model Checking
• Check assumptions on gene variances, e.g.
exchangeable variances, what distribution ?
• Predict sample variance Sg2 new (a chosen checking
function) from the model specification (not using the
data for this)
• Compare predicted Sg2 new with observed Sg2 obs
‘Bayesian p-value’: Prob( Sg2 new > Sg2 obs )
• Distribution of p-values approx Uniform if model is
‘true’ (Marshall and Spiegelhalter, 2003)
38
• Easily implemented in MCMC algorithm
Data set A
39
Possible Statistics for Differential
Expression
δg ≈ log fold change
δg* = δg / (σ2 g1 / R1 + σ2 g2 / R2 )½ (standardised
difference)
• We obtain the posterior distribution of all {δg} and/or
{δ g * }
• Can compute directly posterior probability of genes
satisfying criterion X of interest:
pg,X = Prob( g of “interest” | Criterion X, data)
• Can compute the distributions of ranks
40
Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5
Criterion X
Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4
Plot of log fold change versus overall expression level
Genes with
pg,X > 0.5 (green)
# 280
pg,X > 0.8 (red)
# 46
The majority of the genes
have very small pg,X :
90% of genes
have pg,X < 0.2
Genes with low overall expression
have a greater range of fold change
than those with higher expression
pg,X = 0.49
41
Experiment: 8 wildtype mice compared to 8 knockout mice RMA
Criterion X: Gene is of interest if |log fold change| > log (1.5)
Plot of log fold change versus overall expression level
Genes with
pg,X > 0.5 (green)
# 292
pg,X > 0.8 (red)
# 139
The majority of the genes
have very small pg,X :
97% of genes
have pg,X < 0.2
42
Posterior probabilities and log fold change
Data set A : 3 replicates MAS5
Data set B : 8 replicates RMA
43
Credibility intervals for ranks
Data set B
100 genes with lowest
rank (most under/
over expressed)
Low rank,
high uncertainty
Low rank,
low uncertainty
44
Using the posterior distribution of δg*
(standardised difference)
• Compute
Probability ( | δg* | > 2 | data)
Bayesian analogue of a t test !
• Order genes
• Select genes such that
Probability ( | δg* | > 2 | data) > cut-off ( in blue)
By comparison, additional genes selected by a standard
T test with p value < 5% are in red)
45
Part 4
•
•
•
•
Introduction
A fully Bayesian gene expression index
Differential expression and array effects
Mixture models
– Classification for differential expression
– Bayesian estimate of False Discovery Rates
– CGH arrays: models including information on
clones spatial location on chromosome
• Discussion
46
Mixture and Bayesian estimation of
false discovery rates
Natalia Bochkina, Philippe Broët
Alex Lewin, SR
47
Multiple Testing Problem
• Gene lists can be built by computing separately a
criteria for each gene and ranking
• Thousands of genes are considered simultaneously
• How to assess the performance of such lists ?
Statistical Challenge
Select interesting genes without including too many false
positives in a gene list
A gene is a false positive if it is included in the list when it is
truly unmodified under the experimental set up
Want an evaluation of the expected false discovery rate (FDR)
48
Bayesian Estimate of FDR
• Step 1: Choose a gene specific parameter (e.g. δg ) or a
gene statistic
• Step 2: Model its prior (resp marginal) distribution using a
mixture model
-- with one component to model the unaffected genes
(null hypothesis) e.g. point mass at 0 for δg
-- other components to model (flexibly) the alternative
• Step 3: Calculate the posterior probability for any gene of
belonging to the unmodified component : pg0 | data
• Step 4: Evaluate FDR (and FNR) for any list
assuming that all the gene classification are independent
(Broët et al 2004) :
Bayes FDR (list) | data = 1/card(list) Σg  list p49g0
Mixture framework for differential expression
• To obtain a gene list, a commonly used method
(cf Lönnstedt & Speed 2002, Newton 2003, Smyth
2003, …) is to define a mixture prior for δg :
• H0
δg = 0 point mass at 0 with probability p0
δg ~ flexible 2-sided distribution to model
• H1
pattern of differential expression
Classify each gene following its posterior
probabilities of not being in the null: 1- pg0
Use Bayes rule or fix the FDR to get a cutoff
50
Mixture prior for differential expression
• In full Bayesian framework, introduce latent
allocation variable zg to help computations
• Joint estimation of all the mixture parameters
(including p0) avoids plugging-in of values (e.g.
p0) that are influential on the classification
• Sensitivity to prior settings of the alternative
distribution
• Performance has been tested on simulated data
sets
Poster by Natalia Bochkina
51
Performance of the mixture prior
yg1r = g - ½ δg + g1r , r = 1, … R1
yg2r = g + ½ δg + g2r , r = 1, … R2
(For simplification, we assume that the data has been pre
normalised)
Var(gsr ) = σ2gs ~ IG(as, bs)
δg ~ p0δ0 + p1G (1.5, l1) + p2G (1.5, l2)
H0
H1
Dirichlet distribution for (p0, p1, p2)
Exponential hyper prior for l1 and l2
52
Estimation
• Estimation of all parameters combines
information from biological replicates and
between condition contrasts
• s2gs = 1/Rs Σr (ygsr - ygs. )2 ,
s = 1,2
Within condition biological variability
• 1/Rs Σr ygsr = ygs. ,
Average expression over replicates
• ½(yg1.+ yg2.)
Average expression over conditions
• ½(yg1.- yg2.)
Between conditions contrast
53
DAG for the mixture model
zg
p
a 1 , b1
l1 , l2
δg
g
g = 1:G
½(yg1.- yg2.)
½(yg1.+ yg2.)
2g1
s2g1
2g2
s2g2
a2, b2
54
Simulated data
Plot of the true differences
ygr ~ N(δg , σ2g) (8 replicates)
σ2gs ~ IG(1.5, 0.05)
δg ~ (-1)Bern(0.5) G(2,2), g=1:200
δg = 0, g=201:1000
Choice of simulation parameters
inspired by estimates found in
analyses of biological data sets
55
Bayes
rule
True
classification
Null
Alternative
Mixture Prior
Null
Alternative
790 (800)
10
31
169 (200)
Margin
Null
764 (800)
21
FDR (black)
FNR (blue)
as a function of
1- pg0
Observed
and estimated
FDR/FNR
correspond well
Important feature
Post Prob (g  H1) = 1-
pg0
56
Comparison of mixture classification and
posterior probabilities for δg* (standardised differences)
Post Prob (g  H1)
10 = 6%
False
positive
In red, 200
genes with
δg ≠ 0
31 = 4%
False
negative
Probability ( | δg* | > 2 | data)
57
Wrongly classified by
mixture:
truly dif. expressed,
truly not dif.
expressed
Classification errors
are on the borderline:
Confusion between
size of fold change
and biological
variability
58
Another simulation
2628 data points
Many points added
on borderline:
classification
errors in red
Can we improve
estimation
of within condition
biological variability ?
59
DAG for the mixture model
zg
p
a1, b1
l1 , l2
δg
g
½(yg1.- yg2.)
½(yg1.+ yg2.)
2g1
2g2
a 2 , b2
g = 1:G
s2g1
s2g2
The variance
estimates are
influenced by
the mixture
parameters
Use only partial
information from
the replicates
to estimate
2gs and feed
forward
in the mixture ?
60
Mixture, full vs partial
Classification
altered for 57 points:
In
46 data points
with improved
classification when
‘feed back from
mixture is cut’
In
11 data points
with changed
but new incorrect
classification
61
Work in progress
Mixture models in CGH arrays
experiments
• Philippe Broët, SR
• Curie Institute oncology department
CGH = Competitive Genomic Hybridization
between fluorescein- labelled normal and pathologic
samples to an array containing clones designed
to cover certain areas of the genome
62
Aim: study genomic alterations
Loss
Tumor supressor gene
Gain
Oncogene
In oncology, where carcinogenesis is associated with complex
chromosomic alterations, CGH array can be used for detailed
analysis of genomic changes in copy number (gains or loss of
genetic information) in the tumor sample.
Amplification of an oncogene or deletion of a tumor
suppressor gene are considered as important mechanisms for
tumorigenesis
63
Specificity of CGH array experiment
A priori biological knowledge from conventional CGH :
• Limited number of states for a sequence :
- presence, - deletion, - gain(s)
corresponding to different intensity ratios on the slide
Mixture model to capture the underlying discrete states
• Clones located contiguously on chromosomes are
likely to carry alterations of the same type
Use clone spatial location in the allocation model
• Some CGH custom array experiments target
restricted areas of the genome
Large proportion of genomic alterations are expected
64
3 component mixture model with spatial allocation
ygr  N(θg , g2) , normal versus tumoral change, clone g
replicate measure r
θg  wg0N(μ0 ,h02) + wg1N(μ1 ,h12) + wg2N(μ2 ,h22)
presence
deletion
gain
μ0 : known central estimate obtained from reference clones
Introduce centred spatial autoregressive Markov random fields,
{ug0}, {ug1}, {ug2} with nearest neighbours along the chromosomes
x
x x
Spatial neighbours of g
g -1 g g+1
Define mixture proportions to depend on the chromosomic location
via a logistic model: wgk = exp(ugk) / Σm exp(ugm)
favours allocation of nearby clones to same component
Work in progress
65
Curie Institute CGH platform
Focus on Investigating deletion areas on chromosome 1
(tumour suppressor locus)
Data on 190 clones
μ0
Presence ?
Ref value
μ0 = - 0.11
Deletion ?
66
Mixture model posterior
probability p of clone being deleted
Classification with
cut-off at p ≥ 0.8
Short arm
67
Summary
Bayesian gene expression measure (BGX)
Good range of resolution, provides credibility intervals
Differential Expression
Expression-level-dependent normalisation
Borrow information across genes for variance estimation
Gene lists based on posterior probabilities or mixture classification
False Discovery Rate
Mixture gives good estimate of FDR and classifies
Flexibility to incorporate a priori biological features, e.g. dependence on
chromosomic location
Future work
Mixture prior on BGX index, with uncertainty propagated to mixture
parameters, comparison of marginal and prior mixture approaches,
clustering of profiles for more general experimental set-ups
68
Papers and technical reports:
Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004)
BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data
(to appear in Biostatistics)
Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003)
Bayesian Modelling of Differential Gene Expression
(under revision for Biometrics)
Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004)
A mixture model-based strategy for selecting sets of genes in multiclass response
microarray experiments. Bioinformatics 22, 2562-2571.
Broët, P., Richardson, S. and Radvanyi, F. (2002)
Bayesian Hierarchical Model for Identifying Changes in Gene Expression from
Microarray Experiments. Journal of Computational Biology 9, 671-683.
Available at
http ://www.bgx.org.uk/
Thanks
69