Transcript Slide 1

Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model

Alex Lewin

1

, Philippe Bro ët

2

and Sylvia Richardson

1

,

1

Department of Epidemiology, Imperial College, London;

2

INSERM, Paris Background

Microarray experiments measure gene expression for thousands of genes at a time. The aim is to identify a smaller number of genes which differentiate between two or more experimental conditions. When considering so many observations, multiple testing must be taken into account.

False Discovery Rate

Suppose N hypotheses are to be tested. The table summarizes the outcome.

True Negative True Positive Total Declare Negative U T N-R Declare Positive V S R Total N 0 N-N 0 N Traditional methods such as Bonferroni control the Family Wise Error Rate (FWER). With thousands of hypothesis being tested at the same time, it is better to use the False Discovery Rate (FDR). FWER = P(V>0) FDR = E(V/R) Conditional on R>0, Storey (2001) showed the FDR can be written as P(True negative | Declare positive). Storey’s method of estimating FDR starts with p values and estimates the probability of the null hypothesis being true. In a Bayesian model, the probability of each gene following the null model is estimated, and these probabilities are used to calculate the estimated FDR.

Storey FDR = P(True negative) P(Declare positive | True negative) / P(Declare positive) = P(null hypothesis true) x P-value x N/R Bayes FDR = mean [P(gene belongs to null), for genes declared positive] The Bayesian estimate can be calculated for

any set of genes

, not just those based on ranking. This is not possible in approaches relying on ordered p-values. This is illustrated later on the Hedenfalk et al. (2001) breast cancer dataset.

Multi-class Microarray Experiments

Multi-class response experiments are those in which more than two experimental conditions are to be compared. Each gene has repeat measurements under several conditions, forming a gene expression profile.

The null hypothesis is no change across conditions (a constant gene profile). Typically there are many different non-constant profiles, leading to more complex alternatives than in two class comparison situations.

We have simulated data for 2 sets of profiles for 500 genes, the first with fairly heterogeneous profiles, the second with more homogeneous profiles. Each gene has 8 repeat measurements under 3 conditions. Profiles for 30 genes are shown here. Each line represents one gene. Black lines are genes with no change across conditions, blue and red lines are those with changes.

Bayesian Mixture Model

We use a Gaussian mixture with unknown number of components to model the null and alternative hypotheses for the D statistics. The model is estimated using reversible jump MCMC.

D g ~ w 0 N( μ 0 , σ 0 2 ) +  j=1:k w j N( μ , j σ j 2 ) μ 0 = 0 μ j ~ Uniform( μ 0 ,upper range) The null hypothesis is modelled using one Gaussian component with mean fixed to zero.

{w} ~ Dirichlet(1,1,…,1) σ j -2 ~ Gamma, exchangeable k ~ Poisson(2) The alternative is modelled

semi parametrically

by k Gaussian components with k allowed to vary. The predictive distribution for the alternative is

integrated over

k.

Latent variables z g = 0,1,…,k with probability w 0 ,w 1 ,…,w k The FDR is calculated from the latent variables {z},

integrated over

all values of k.

For any given group S of R genes, FDR = (1/R)  g in S P(z g = 0)

Simulation Study

We have performed 50 simulations of the 2 sets of gene profiles described previously, and calculated the Bayes and Storey estimates of FDR each time. For the Storey method we start with p-values derived from the F-statistics summarizing the gene profiles.

The Bayesian mixture fit has support for up to 4 components for Case A and up to 3 for Case B (density plots shown above). In both cases the FDR is well estimated by the Bayesian mixture, with true and estimated Bayes curves following each other closely.

The Storey method performs well for the more heterogeneous profiles, but is less good as the overlap between the profiles increases.

Transformation of F-statistics

We summarize each gene profile by an F-statistic. These are transformed to a statistic D which is approximately Gaussian if there is no change across conditions and positively skewed otherwise. Histograms of the D statistics for the simulated data are shown below. The D statistics for the homogeneous profiles are close to those for the Hedenfalk breast cancer data. Predictive densities from the Bayesian mixture model are also shown.

Breast Cancer Data

The publicly available microarray dataset from the Hedenfalk et al. (2001) breast cancer study consists of gene-expression data for three classes of tumour, BCRA1 related cancer, BCRA2-related cancer, and sporadic cases of breast cancer.

We estimate the Bayesian mixture FDR for

pre-defined groups of genes

with known functions: apoptosis, cell cycle regulation and cytoskeleton (26, 21 and 25 genes respectively). The estimated FDR for group of genes involved in cell cycle regulation is considerably smaller than for the other groups, indicating that many of these genes show differences between the different classes of tumour considered here. Apoptosis FDR:70% Cycle regulation FDR:26% Cytoskeleton FDR:66% References: Storey, J.D. (2001).

A direct approach to false discovery rates

. JRSS B,

64

, 479.

Hedenfalk, I. et al. (2001).

Gene-expression profiles in hereditary breast cancer

. N Engl J Med,

344

, 539.

Preprint of this work: Broët, P., Lewin, A., Richardson, S., Dalmasso, C. and Magdelenat, H. (2004).

A model-based approach for detecting distinctive gene expression profiles in multiclass response microarray experiments.

Submitted.

Available at http://www.bgx.org.uk

Email: [email protected]