Statistical Analysis of Gene Expression Data Sylvia Richardson Centre for Biostatistics Imperial College, London In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Tim.

Download Report

Transcript Statistical Analysis of Gene Expression Data Sylvia Richardson Centre for Biostatistics Imperial College, London In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Tim.

Statistical Analysis of Gene Expression
Data
Sylvia Richardson
Centre for Biostatistics
Imperial College, London
In collaboration with
Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)
Tim Aitman (Hammersmith)
Peter Green (Bristol)
Biological Atlas
of Insulin Resistance
BBSRC
www.bgx.org.uk
1
BGX
Statistical modelling and biology
• Extracting the ‘message’ from microarray
data needs statistical as well as biological
understanding
• Statistical modelling – in contrast to data
analysis – gives a framework for formally
organising assumptions about signal and
noise
• Our models are structured, reflecting data
generation process:
– Bayesian hierarchical modelling approach
– Inference based on posterior distribution of
quantities of interest
2
BGX
What are gene expression data ?
****
• DNA Microarrays are used to measure the relative abundance
of mRNA, providing information on gene expression in a
particular cell type, under specific conditions
• Gene expression data (e.g. Affymetrix) results from the
scanning of arrays where hybridisation between a sample and
a large number of probes has taken place:
*
– gene expression measure for each gene
• The expression level of ten of thousands of
probes are measured on a single microarray:
– gene expression profile
• Typically, gene expression profiles are obtained
for
several samples, in a single or related
experiments:
– gene expression data matrix
3
BGX
Common characteristics of data sets
in transcriptomic
• High dimensional data (ten of thousands of genes)
and few samples
• Many sources of variability (low signal/noise ratio)
•
•
•
•
•
condition/treatment
biological
array manufacture
imaging
technical
• within/between array
variation
• gene specific variability
of the probes for a gene
(e.g. for Affymetrix)
4
BGX
Analysing gene expression data
• Gene expression data can be used
in several types of analysis:
Gene expression
data matrix
Samples
-- Comparison of gene expression
under different experimental
conditions, or in different tissues
-- Building a predictive model for
classification or prognosis based
on gene expression measurements
-- Exploration of patterns in gene
expression matrices
Gene expression level
5
BGX
Common statistical issues
• Pre-processing and data reduction
– account for the uncertainty of the signal?
– making arrays comparable: “normalisation”
• Realistic assessment of uncertainty
• Multiplicity: control of “error rates”
• Need to borrow information
• Importance to include prior biological knowledge
Illustrate how structured statistical modelling can
help to tease out signal from noise and strengthen
inference in the context of differential expression
studies
6
BGX
Outline
• Background
• Modelling uncertainty in the signal
• Bayesian hierarchical models for
differential expression experiments
– posterior predictive checks
– use of posterior distribution of parameters
of interest to select genes of interest
• Further structure: mixture models
7
BGX
I – Modelling uncertainty in the signal:
A fully Bayesian Gene expression index for
Affymetrix Gene Chip arrays (Anne Mette Hein)
Data: Affymetrix chip:
- Each gene g is represented by a probe set,
consisting of a number of probe pairs (reporters) j
Perfect match (PM) and Mismatch (MM)
Aim: Formulate a model to combine PM and MM values into a
new expression value for the gene -- BGX
- Base the model on biological assumptions
- Combine good features of Li and Wong (dChip) and
RMA (Robust Multichip Analysis, Irrizarry et al)
Use a flexible Bayesian framework that will allow
• to get a measure of uncertainty of the expression
• to integrate further components of the experimental design
8
BGX
Single array model: Motivation
Key observations:
• PMs and MMs both increase
with spike-in concentration (MMs
slower than PMs)
• Spread of PMs increase with
level
Conclusions:
MMs bind fraction of signal
Multiplicative (and additive)
error; transformation needed
• Considerable variability in
PM (and MM) response within
a probe set
Varying reliability in gene
expression estimation for
different genes
• Probe effects approximately
additive on log-scale
Estimate gene expression
measure from PMs and
MMs on log scale
9
BGX
BGX single array model
PMgj  N( Sgj + Hgj , τ2)
Background
noise, additive
MMgj  N(Φ Sgj + Hgj , τ2)
fraction
Gene and probe specific S and H
(g:1,…,”1000s”, j=1,…,”tens”)
Non-specific hybridisation:
array wide distribution:
j=1,…,J (20), g=1,…,G
Expression measure for
gene g is built from:
j=1,…,J (20)
log(Sgj+1)  TN(μg,σg2)
“BGX”
expression
measure
Shrinkage:
exchangeability
log(Hgj+1)  TN(λ, η2)
log(σg2)N(a, b2)
Remaining priors: “vague”
“Emp. Bayes”
10
BGX
BGX model: inference
Hein et al, Biostatistics, 2005
For each gene g: obtain a distribution for signal (log scale)
BGX: gene expression
PM
MM
mg:
• Implemented in WinBugs and C++ (MCMC)
• All parameters estimated jointly in full Bayesian
framework
• Posterior distributions of parameters (and
functions) obtained
The single array model can be extended to estimate signal
from several biological replicates, as well as differential
signal between conditions
11
BGX
Single array model:
examples of posterior distributions of BGX indices
Each curve
represents a
gene
Examples with data:
o:
log(PMgj-MMgj)
j=1,…,J
(at 0 if not defined)
Mean  1SD
12
BGX
Comparison with other expression measures
11 genes spiked in at 13
(increasing)
concentrations
BGX index μg
increases with
concentration …..
… except for gene 7
(incorrectly spiked-in??)
Indication of smooth
& sustained increase
over a wider range of
concentrations
13
BGX
95% credibility intervals for Bayesian
gene expression index
11 spike-in genes
at 13 different
concentrations
Each colour
corresponds to a
different spike-in
gene
Gene 7 : broken red
line
Note how the variability
is substantially larger
for low expression level
14
BGX
II – Modelling differential expression
Condition 1
Condition 2
Start with given point
estimates of expression
Hierarchical model of replicate
variability and array effect
Differential expression
parameter
Hierarchical model of replicate
variability and array effect
Posterior distribution
(flat prior)
Mixture modelling
for classification
15
BGX
Data Sets and Biological question
Biological Question
• Understand the mechanisms of insulin resistance
• Using animal models where key genes are knockout
A) Cd36 Knock out Data set (MAS 5) 3 wildtype (“normal”)
mice compared with 3 mice with Cd36 knocked out
( 12000 genes on each array )
B) IRS2 Knock out Data set (RMA) 8 wildtype (“normal”) mice
compared with 8 mice with IRS2 gene knocked out
( 22700 genes on each array)
16
BGX
Exploratory analysis showing array effect
Mouse data
set A
Condition 1 (3 replicates)
Needs
‘normalisation’
Spline curves
shown
Condition 2 (3 replicates)
17
BGX
Bayesian hierarchical model for differential
expression (Lewin et al, Biometrics, 2005)
Data: ygcr = log gene expression gene g, replicate r, condition c
g = gene effect
dg = differential effect for gene g between 2 conditions
r(g)c = array effect – modelled as a smooth (spline) function of g
gc2 = gene specific variance
yg1r  N(g – ½ dg + r(g)1 , g12)
yg2r  N(g + ½ dg + r(g)2 , g22)
Σrr(g)c = 0, r(g)c = function of g , parameters {c,d}
• 1st level
• 2nd level
Exchangeable
variances
“Flat” priors for g , dg, {c,d}
gc2  lognormal (ac, bc)
18
BGX
Directed Acyclic Graph for the differential
expression model (no array effect represented)
a 1 , b1
dg
g
½(yg1.- yg2.)
½(yg1.+ yg2.)
2g1
s2g1
2g2
s2g2
a2, b2
19
BGX
Differential expression model
Joint modelling of array effects and differential
expression:
• Performs normalisation simultaneously with
estimation
• Gives fewer false positives
How to check some of the modelling assumptions?
Posterior predictive checks
How to use the posterior distribution of dg to
select genes of interest ?
Decision rules
20
BGX
Bayesian Model Checking
• Check assumptions on gene variances, e.g.
exchangeable variances, what distribution ?
• Predict sample variance sg2 new (a chosen checking
function) from the model specification (not using the
data for this)
• Compare predicted sg2 new with observed sg2 obs
‘Bayesian p-value’: Prob( sg2 new > sg2 obs )
• Distribution of p-values approx Uniform if model is
‘true’ (Marshall and Spiegelhalter, 2003)
21
BGX
• Easily implemented in MCMC algorithm
new
2g1
Bayesian model checking
a1, b1
new
s2g1
dg
g
½(yg1.- yg2.)
½(yg1.+ yg2.)
obs
2g1
s2g1
2g2
s2g2
a 2 , b2
22
BGX
Mouse
Data set A
23
BGX
Use of tail probabilities for selecting
gene lists
dg : log fold change
tg = dg / (σ2 g1 / n1 + σ2 g2 / n2 )½ standardised difference
(n1 and n2 # replicates in each condition)
-- Obtain the posterior distribution of dg and/or tg
-- Compute directly posterior probability of genes satisfying
criterion X of interest, e.g. dg > threshold or tg > percentile
pg,X = Prob( g of “interest” | Criterion X, data)
-- Compute the distributions of ranks, ….
Interesting statistical issues on relative merits and properties
of different selection rules based on tail probabilities
24
BGX
Using the posterior distribution of tg
(standardised difference)
(Natalia Bochkina)
• Compute
Probability ( | tg | > 2 | data)
Bayesian T test
• Order genes
• Select genes such that
Data set B
Probability ( | tg | > 2 | data) > cut-off ( in blue)
By comparison, additional genes selected by a standard
T test with p value < 5% are in red)
25
BGX
Credibility intervals for ranks
100 genes with lowest
rank (most under/
over expressed)
Low rank,
high uncertainty
Low rank,
low uncertainty
26
BGX
III – Mixture and Bayesian estimation
of False Discovery Rates (FDR)
• Mixture models can be used to perform a
model based classification
• Mixture models can be considered at the
level of the data (e.g. clustering time profiles)
or for the underlying parameters
• Mixture models can be used to detect
differentially expressed genes if a model of
the alternative is specified
• One benefit is that an estimate of the
uncertainty of the classification: the False
Discovery Rate is simultaneously obtained
27
BGX
Mixture framework for differential
expression
yg1r = g - ½ dg + g1r , r = 1, … R1
yg2r = g + ½ dg + g2r , r = 1, … R2
(We assume that the data has been pre normalised)
Var(gcr ) = σ2gc ~ IG(ac, bc)
dg ~ p0δ0 + p1G (-x|1.5, 1) + p2G (x|1.5, 2)
H0
H1
Dirichlet distribution for (p0, p1, p2)
Exp(1) hyper prior for 1 and 2
Explicit modelling
of the alternative
28
BGX
Mixture for classification of DE genes
• Calculate the posterior probability for any gene of
belonging to the unmodified component : pg0 | data
• Classify using a cut-off on pg0 :
i.e. declare gene is DE if 1- pg0 > pcut
Bayes rule corresponds to pcut = 0.5
• Bayesian estimate of FDR (and FNR) for any list
(Newton et al 2003, Broët et al 2004) :
Bayes FDR (list) | data = 1/card(list) Σg  list pg0
29
BGX
Performance of the mixture prior
• Joint estimation of all the mixture parameters
(including p0) using MCMC algorithms avoids
plugging-in of values that are influential on the
classification
• Estimation of all parameters combines
information from biological replicates and between
condition contrasts
• Performance has been tested on simulated data
sets
30
BGX
π0 = 0.8, 500 DE
π0 = 0.9, 250 DE
Plot of true
difference in
each case
π0 = 0.80, 500 DE
π0 = 0.95, 125 DE
π0 = 0.99, 25 DE
31
BGX
Examples of
simulated data
for each case
32
BGX
Results averaged over 50 replications
Av. π^0 = 0.80
Av. π^0 = 0.90
Av. π^0 = 0.78
Av. π^0 = 0.95
Good estimates
of p0 = Prob(null)
for each case
Av. π^0 = 0.99
33
BGX
Comparison of estimated (dotted lines) and observed (full) FDR
(black) and FNR (red) rates as cut-off for declaring DE is varied
Bayesian mixture:
• good estimates of
FDR and FNR
• easy way to
choose efficient
classification rule
34
BGX
In summary
Integrated gene expression analysis
• Uses the natural hierarchical structure of the data: e.g.
probes within genes within replicate arrays within
condition to synthesize, borrow information and provide
realistic quantification of uncertainty
• Posterior distributions can be exploited for inference
with few replicates: choice of decision rules
• Framework where biological prior information, e.g. on
the structure of the probes or on chromosomic location,
can be incorporated
• Model based classification, e.g. through mixtures,
provides interpretable output and a structure to deal
with multiplicity
General framework for investigating other questions
35
BGX
Many interesting questions in the analysis of
gene expression data
-- Comparison of gene expression
under different experimental
conditions, or in different tissues
-- Building a predictive model
for classification or prognosis
based on gene expression
measurements, finding
“signatures”
-- Integrated gene
expression analysis
-- Investigate high dimensional
classification rules (prediction
with large number of variables)
and “large p small n” regression
problems (shrinkage or variable
selection)
36
BGX
Association of gene expression with prognosis
Expression plot of 115 prognostic
genes comprising The Ovarian
Cancer Prognostic Profile
Investigate properties of high
dimensional classification rules
(prediction with large number of
variables) and “large p small n”
regression problems (shrinkage
or variable selection)
37
BGX
Other questions ….
-- Comparison of gene expression
under different experimental
conditions, or in different tissues
-- Building a predictive model for
classification or prognosis based on
gene expression measurements,
finding “signatures
-- Exploration of patterns
and association networks in
gene expression matrices
-- Integrated gene
expression analysis
-- Investigate high dimensional
classification rules (prediction with
large number of variables) and
“large p small n” regression
problems (shrinkage or variable
selection)
-- Perform unsupervised
model based clustering
-- Estimate graphical models
38
BGX
Exploration of patterns in gene expression matrices
samples
under different experimental
conditions, or in different tissues
-- Classification of gene expression
genes
-- Comparison of gene expression
profiles and association of gene
expression with other factors, e.g.
prognosis (prediction problem)
Development of central
nervous systems in rats
(9 time points)
Perform unsupervised
model based clustering (e.g.
semi-parametric using basis
functions, mixtures or DP
processes)
39
BGX
Thanks
BBSRC Exploiting Genomics grant
Colleagues
Natalia Bochkina, Anne Mette Hein, Alex Lewin
(Imperial College)
Peter Green (Bristol University)
Philippe Broët (INSERM, Paris)
Papers and technical reports: www.bgx.org.uk/
40
BGX