Transcript Document

The Problem of Detecting
Differentially Expressed Genes
Gene 1
Gene 2
.
.
.
.
Gene N
Sample 2
.
.
.
.
Sample 1
Sample M
Sample 2
.
.
.
.
Sample 1
Gene 1
Gene 2
.
.
.
.
Gene N
Class 1
Class 2
Sample M
Fold Change is the Simplest Method
Calculate the log ratio between the two classes
and consider all genes that differ by more than
an arbitrary cutoff value to be differentially
expressed. A two-fold difference is often
chosen.
Fold change is not a statistical test.
Test of a Single Hypothesis
(1) For gene consider the null hypothesis of no association
between its expression level and its class membership.
(2) Decide on level of significance (commonly 5%).
(3) Perform a test (e.g Student’s t-test) for each gene.
(4) Obtain P-value corresponding to that test statistic.
(5) Compare P-value with the significance level. Then
either reject or retain the null hypothesis.
Two-Sample t-Statistic
Student’s t-statistic:
Wj 
y1 j  y 2 j
2
1j
s
n1  s
2
2j
n2
Two-Sample t-Statistic
Pooled form of the Student’s t-statistic, assumed common
variance in the two classes:
Wj 
y1 j  y 2 j
sj 1 n1  1 n2
Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001):
Wj 
y1 j  y 2 j
sj 1 n1  1 n2  a0
Types of Errors in Hypothesis Testing
PREDICTED
Retain Null
TRUE
Null
Non-null
Reject Null
False positive
False negative
Multiplicity Problem
When many hypotheses are tested, the
probability of a false positive increases sharply
with the number of hypotheses.
Further: Genes are co-regulated,
subsequently there is correlation between the
test statistics.
Example
Suppose we measure the expression of 10,000
genes in a microarray experiment.
If all 10,000 genes were not differentially expressed, then
we would expect for:
P= 0.05
for each test, 500 false positives.
P= 0.05/10,000 for each test, .05 false positives.
Controlling the Error Rate
• Methods for controlling false positives e.g.
Bonferroni are too strict for microarray
analyses
• Use the False Discovery Rate instead (FDR)
(Benjamini and Hochberg 1995)
Methods for dealing with the Multiplicity Problem
• The Bonferroni Method
controls the family wise error rate (FWER) i.e.
the probability that at least one false positive error will be
made
 Too strict for gene expression data, tries to make it
unlikely that even one false rejection of the null is made,
may lead to missed findings
• The False Discovery Rate (FDR)
emphasizes the proportion of false positives among the
identified differentially expressed genes.
 Good for gene expression data – says something about
the chosen genes
False Discovery Rate
Benjamini and Hochberg (1995)
# (falsepositives)
FDR 
# (rejectedhypotheses)
The FDR is essentially the expectation of the
proportion of false positives among the identified
differentially expressed genes.
Possible Outcomes for N Hypothesis Tests
Accept Null
Reject Null
Total
Null True
N00
N01
N0
Non-True
N10
N11
N1
Total
N - Nr
Nr
N
N 01
FDR  E{
}
Nr  1
where
Nr  1  max( Nr ,1)
N 10
FNDR  E{
}
( N  Nr )  1
Positive FDR
pFDR  E{N01 / N r | N r  0}
 FDR/pr{Nr  0}
Lindsay, Kettenring, and Siegmund (2004).
A Report on the Future of Statistics.
Statist. Sci. 19.
Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR
• Genovese and Wasserman (2002)
• Storey (2002, 2003)
• Storey and Tibshirani (2003a, 2003b)
• Storey, Taylor and Siegmund (2004)
• Black (2004)
• Cox and Wong (2004)
Benjamini-Hochberg (BH) Procedure
Controls the FDR at level a when the P-values following
the null distribution are independent and uniformly distributed.
(1) Let
p(1)   p( N )
(2) Calculate
be the observed P-values.
kˆ  argmaxk : p( k )  ak / N  .
1 k  N
(3) If
kˆ exists then reject null hypotheses corresponding to
p(1)    p( k ) . Otherwise, reject nothing.
Example: Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are carried out
leading to the following ordered P-values:
0.00017 0.00448 0.00671 0.00907 0.01220
0.33626 0.39341 0.53882 0.58125 0.98617
(a) With a = 0.05, the Bonferroni test rejects any hypothesis
whose P-value is less than a / 10 = 0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such that P(k) < ka / N.
Here k = 5, thus we reject the first five hypotheses.
q-VALUE
q-value of a gene j is expected proportion of false positives
when calling that gene significant.
P-value is the probability under the null hypothesis of
obtaining a value of the test statistic as or more extreme than
its observed value.
The q-value for an observed test statistic can be viewed as
the expected proportion of false positives among all genes
with their test statistics as or more extreme than the observed
value.
LIST OF SIGNIFICANT GENES
Call all genes significant if pj < 0.05
or
Call all genes significant if qj < 0.05
to produce a set of significant genes so that a proportion
of them (<0.05) is expected to be false (at least for a
large no. of genes not necessarily independent)
BRCA1 versus BRCA2-mutation positive
tumours (Hedenfalk et al., 2001)
BRCA1 (7) versus BRCA2-mutation (8) positive
tumours, p=3226 genes
P=.001
gave
51 genes differentially expressed
P=0.0001 gave 9-11 genes
Using q<0.05, gives 160 genes are taken
to be significant.
It means that approx. 8 of these 160
genes are expected to be false positives.
Also, it is estimated that 33% of the
genes are differentially expressed.
Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the order of the
number of permutations.
If we perform B permutations, then the P-value will be estimated
with a resolution of 1/B.
If we assume that each gene has the same null distribution and
combine the permutations, then the resolution will be 1/(NB)
for the pooled null distribution.
Using just the B permutations of the class labels for the
gene-specific statistic Wj , the P-value for Wj = wj is assessed as:
pj 
(b )
0j
#{b :| w
|| w j |}
B
where w(b)0j is the null version of wj after the bth permutation
of the class labels.
If we pool over all N genes, then:
B
#{i :| w || w j |, i  1,...,N}
b 1
NB
pj  
(b )
0i
Null Distribution of the Test Statistic: Example
Class 1
Gene 1
Gene 2
A1(1) A2(1) A3(1)
A1(2) A2(2) A3(2)
Class 2
B4(1) B5(1) B6(1)
B4(2) B5(2) B6(2)
Suppose we have two classes of tissue samples, with three samples
from each class. Consider the expressions of two genes, Gene 1 and
Gene 2.
Class 1
Gene 1
Gene 2
A1(1) A2(1) A3(1)
A1(2) A2(2) A3(2)
Class 2
B4(1) B5(1) B6(1)
B4(2) B5(2) B6(2)
To find the null distribution of the test statistic for Gene 1, we
proceed under the assumption that there is no difference between the
classes (for Gene 1) so that:
Gene 1
A1(1) A2(1) A3(1)
A4(1) A5(1) A6(1)
And permute the class labels:
Perm. 1 A1(1) A2(1) A4(1)
...
There are 10 distinct permutations.
A3(1) A5(1) A6(1)
Ten Permutations of Gene 1
A1(1) A2(1) A3(1)
A4(1) A5(1) A6(1)
A1(1) A2(1) A4(1)
A3(1) A5(1) A6(1)
A1(1) A2(1) A5(1)
A3(1) A4(1) A6(1)
A1(1) A2(1) A6(1)
A3(1) A4(1) A5(1)
A1(1) A3(1) A4(1)
A2(1) A5(1) A6(1)
A1(1) A3(1) A5(1)
A2(1) A4(1) A6(1)
A1(1) A3(1) A6(1)
A2(1) A4(1) A5(1)
A1(1) A4(1) A5(1)
A2(1) A3(1) A6(1)
A1(1) A4(1) A6(1)
A2(1) A3(1) A5(1)
A1(1) A5(1) A6(1)
A2(1) A3(1) A4(1)
As there are only 10 distinct permutations here, the
null distribution based on these permutations is too
granular.
Hence consideration is given to permuting the labels
of each of the other genes and estimating the null
distribution of a gene based on the pooled permutations
so obtained.
But there is a problem with this method in that the
null values of the test statistic for each gene does not
necessarily have the theoretical null distribution that
we are trying to estimate.
Suppose we were to use Gene 2 also to estimate
the null distribution of Gene 1.
Suppose that Gene 2 is differentially expressed,
then the null values of the test statistic for Gene 2
will have a mixed distribution.
Class 1
Class 2
Gene 1
Gene 2
A1(1) A2(1) A3(1)
A1(2) A2(2) A3(2)
B4(1) B5(1) B6(1)
B4(2) B5(2) B6(2)
Gene 2
A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Permute the class labels:
Perm. 1 A1(2) A2(2) B4(2)
...
A3(2) B5(2) B6(2)
Example of a null case: with 7 N(0,1) points and
8 N(0,1) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
Example of a null case: with 7 N(0,1) points and
8 N(10,9) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
The SAM Method
Use the permutation method to calculate the null
distribution of the modified t-statistic (Tusher et al., 2001).
B
w0( j )  (1 / B) w
b 1
(b )
0( j )
( j  1,..., N ).
The order statistics t(1), ... , t(N) are plotted against their null
expectations above.
A good test in situations where there are more genes being
over-expressed than under-expressed, or vice-versa.
The FDR and other error rates
PREDICTED
Retain Null
Reject Null
Null
a
Non-null
c (false negative) d
TRUE
b
(false positive)
R
FDR ~
b
R
FNDR ~
c
ac
The FDR and other error rates
PREDICTED
Retain Null
Reject Null
Null
a
Non-null
c (false negative) d
TRUE
b
(false positive)
R
FDR ~
b
R
FNR =
c
cd
Toy Example with 10,000 Genes
Test Result
non-DE
DE
True
non-DE A = 9025
DE
C = 100
Total
9125
B = 475
D = 400
875
Total
9500
500
10000
FDR = B / (B + D) = 475 / 875 = 54%
FNDR = C / (A + C) = 100 / 9125 = 1%
Two-component mixture model
f (wj )   0 f0 (wj )  1 f1(wj )
0
is the proportion of genes that are not
differentially expressed, and
1  1   0
Use of the P-Value as a Summary Statistic
(Allison et al., 2002)
Instead of using the pooled form of the t-statistic, we can work
with the value pj, which is the P-value associated with tj
in the test of the null hypothesis of no difference in expression
between the two classes.
The distribution of the P-value is modelled by the h-component
mixture model
h
f ( pj )   if ( pj;ai1,ai 2) ,
i 1
where a11 = a12 = 1.
Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in expression for the
jth gene, pj will have a uniform distribution on the unit interval;
ie the b1,1 distribution.
The ba1,a2 density is given by
a1 1
f (u;a1,a2 )  {u
a2 1
(1  u)
/ B(a1,a2 )}I(0,1) (u),
where
B(a1 ,a 2 )  (a1 )(a 2 ) / (a1  a 2 ).
• Efron B, Tibshirani R, Storey JD, Tusher V
(2001) Empirical Bayes analysis of a microarray
experiment. JASA 96,1151-1160.
• Efron B (2004) Large-scale simultaenous
hypothesis testing: the choice of a null
hypothesis. JASA 99, 96-104.
• Efron B (2004) Selection and Estimation for
Large-Scale Simultaneous Inference.
• Efron B (2005) Local False Discovery Rates.
• Efron B (2006) Correlation and Large-Scale
Simultaneous Significance Testing.
• McLachlan GJ, Bean RW, Ben-Tovim Jones L,
Zhu JX. Using mixture models to detect
differentially expressed genes. Australian Journal
of Experimental Agriculture 45, 859-866.
• McLachlan GJ, Bean RW, Ben-Tovim Jones L. A
simple implentation of a normal mixture approach
to differential gene expression in multiclass
microarrays. Bioinformatics 26. To appear.
Two component mixture model
π0 is the proportion of genes that are not differentially
expressed. The two-component mixture model is:
f (wj )   0 f0 (wj )  1 f1(wj )
Using Bayes’ Theorem, we calculate the posterior
probability that gene j is not differentially
expressed:
 0 (w j ) 
 0 f0 (w j )
f (wj )
If we conclude that gene j is differentially
expressed if:
ˆ0(wj)  c0,
then this decision minimizes the (estimated)
Bayes risk
Risk  (1  c) 0e01  c 1e10
where e01 is the probability of a false positive and
e10 is the probability of a false negative.
where
Estimated FDR
where
Similarly, the false positive rate is given by
N
N
j 1
j 1
FPˆ R  ˆ0 ( w j ) I[ 0,c0 ] (ˆ0 ( w j )) / ˆ0 (w j )
and the false non-discovery rate and false negative
rate by:
N
FNDˆ R  ˆ1 ( w j ) I ( c0 , ) (ˆ0 ( w j )) /( N  N r )
j 1
N
N
j 1
j 1
FNˆ R  ˆ1 (w j ) I (c0 ,) (ˆ0 (w j )) / ˆ1 (w j )
Glonek and Solomon (2003)
F0: N(0,1), π0=0.9
F1: N(1,1), π1=0.1
F0: N(0,1), π0=0.6
F1: N(1,1), π1=0.4
Reject H0 if z≥2
Reject H0 if z≥2
τ0(2) = 0.99972
but FDR=0.17
τ0(2) = 0.251
but FDR=0.177
-4
-2
0
2
4
0.0
0.1
0.2
0.3
0.4
-4
-2
0
2
4
0.0
0.1
0.2
0.3
0.4
Gene Statistics: Two-Sample t-Statistic
Student’s t-statistic
Pooled form of the Student’s
t-statistic, assumed common
variance in the two classes
Modified t-statistic of
Tusher et al. (2001)
wj 
wj 
wj 
y1 j  y 2 j
2
1j
s
n1  s
2
2j
n2
y1 j  y 2 j
sj 1 n1  1 n2
y1 j  y 2 j
sj 1 n1  1 n2  a0
The Procedure
1. Obtain the z-score for each of the genes
zj  
1
1  p 
j
2. Rank the genes on the basis of the z-scores, starting
with the largest ones (the same ordering as with the Pvalues, pj).
3. The posterior probability of non-differential
expression of gene j, is given by  0(zj).
4. Conclude gene j to be differentially expressed if
ˆ 0(zj) < c0
τ0(zj) ≤ c
 0 f0 ( z j )
c
 0 f 0 ( z j )   1 f1 ( z j )
1 c   0 
 

f0 ( z j )
c  1 
f1 ( z j )
If π0/π1 ≤ 9,
then
1 c
9
f0 ( z j )
c
f1 ( z j )
e.g. c = 0.2,
f1 ( z j )
f0 ( z j )
 36
τ0(zj) ≤ c
Suppose π0/ π1 ≥ 0.9
then τ0(zj) ≤ 0.2 if
f1 ( z j )
f0 ( z j )
 36
Much stronger level of evidence against the null
than in standard one-at-a-time testing.
e.g.
Zj ~ N(μ,1)
H0 : μ = 0 vs H1 : μ = 2.8
Rejecting H0 if |Zj| ≥ 1.96 yields a two-sided
test of size 0.05 and power 0.80.
f1(1.96)/f0(1.96) = 4.8.
Z-scores, null case
Z-scores, +2
Z-scores, +1
Z-scores, +3
Use of Wilson-Hilferty Transformation as
in Broet et al. (2004)
zj
j
25
20
15
0
5
10
Frequency
-1
0
1
2
3
4
Transformed t-statistics
Plot of approximate z-scores obtained by
Wilson-Hilferty transformation of simulated values of F-statistic
with 1 and 8 degrees of freedom.
EMMIX-FDR
A program has been written in R which
interfaces with EMMIX to implement the
algorithm described in McLachlan et al
(2006).
We fit a mixture of two normal components
to test statistics calculated from the gene
expression data.
When we equate the sample mean and variance
of the mixture to their population counterparts, we obtain:
z  ˆ0 ˆ 0  ˆ1ˆ1
s  ˆ0ˆ  ˆ ˆ  ˆ0ˆ1 (ˆ 0  ˆ1 )
2
z
2
0
2
1 1
2
When we are working with the theoretical null,
we can easily estimate the mean and variance
of the non-null component with the following
formulae.
ˆ12  {sz2  ˆ0  ˆ0 (1  ˆ0 )ˆ12} /(1  ˆ0 )
ˆ1  z /(1  ˆ0 )
Following the approach of Storey and Tibshirani (2003)
we can obtain an initial estimate for π0 as follows:
 ( ) #{z j : z j  } /{N( )}
( 0)
0
This estimate of π0 is used as input to EMMIX as
part of the initial parameter estimates. Thus no random
or k-means starts are required, as is usually the case.
There are two different versions of EMMIX used, the
standard version for the empirical null and a modified
version for the theoretical null which fixes the mean and
variance of the null component to be 0 and 1 respectively.
Theoretical and empirical nulls
Efron (2004) suggested the use of two kinds
of null component: the theoretical and the
empirical null. In the theoretical case the
null component has mean 0 and variance 1
and the empirical null has unrestricted mean
and variance.
Examples
We examined the performance of EMMIXFDR on three well-known data sets
from the literature.
• Alon colon cancer data (1999)
• Hedenfalk breast cancer data (2001)
• van’t Wout HIV data (2003)
Hedenfalk Breast Cancer Data
Hedenfalk et al. (2001) used cDNA arrays to obtain gene
expression profiles of tumours from carriers of either the
BRCA1 or BRCA2 mutation (hereditary breast cancers), as
well as sporadic breast cancer.
We consider their data set of M = 15 patients, comprising
two patient groups: BRCA1 (7) versus BRCA2 - mutation
positive (8), with N = 3,226 genes.
The problem is to find genes which are differentially
expressed between the BRCA1 and BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
Two component model for the Breast Cancer Data
Fit
 0 N (0,1)  1N (1 , )
2
1
to the N values of wj (based on pooled two-sample tstatistic)
j th gene is taken to be differentially expressed if:
ˆ0 (wj )  c0
60
Null
20
40
Non–Null
(DE genes)
0
Frequency
80
100
Fitting two component
mixture
model
to Hedenfalk data
Histogram of z-scores
for 3226 Hedenfalk
genes
-4
-2
0
z
2
4
Estimates of π0 for Hedenfalk data
• 0.52 (Broet, 2004)
• 0.64 (Gottardo, 2006)
• 0.61 (Ploner et al, 2006)
• 0.47 (Storey, 2002)
Using a theoretical null, we estimated π0 to be
0.65.
We used
pj = 2 F13(-|wj|)
Similar results where pj obtained by
permutation methods.
Ranking and Selecting the Genes
Gene j
Gene 1
.
.
.
Gene R
.
.
.
.
Gene R+R1
.
.
.
Gene N
Pj
zj
Local FDR
ˆ ( z )
0
j
0.002
.
.
.
0.1
0.12
0.18
.
.
0.20
FDR
= Sum/R
= 0.06
c0 = 0.1
Proportion of
False Negatives
= 1 – Sum1/ R1
Estimated FDR for various levels of c0
c0
R
FDˆ R
0.5
906
0.26
0.4
694
0.21
0.3
518
0.16
0.2
326
0.11
0.1
143
0.06
c0
Nr
FDˆ R
0.1
143
0.06
0.32
0.88
0.004
0.2
338
0.11
0.28
0.73
0.02
0.3
539
0.16
0.25
0.60
0.04
0.4
742
0.21
0.22
0.48
0.08
0.5
971
0.27
0.18
0.37
0.12
FNˆ DR FNˆ R
FPˆ R
Estimated FDR and other error rates for various levels
of threshold c0 applied to the posterior probability of nondifferential
expression for the breast cancer data (Nr=number of selected genes)
Comparison of identified DE genes
Our method (143)
Hedenfalk (175)
24
6
12
29
101
39
8
Storey and Tibshirani (160)
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
Uniquely Identified Genes:
Differentially Expressed between BRCA1 and BRCA2
Gene
UBE2B, DDB2
(UBE2V1)
RAB9, RHOC
ITGB5, ITGA3
PRKCBP1
HDAC3, MIF
KIF5B, spindle body
pole protein
CTCL1
TNAFIP1
HARS, HSD17B7
GO Term
DNA repair
(cell cycle)
small GTPase signal transduction
integrin mediated signalling pathway
regulation of transcription
negative regulation of apoptosis
cytoskeleton organisation
vesicle mediated transport
cation transport
metabolism
The Procedure
1. Obtain the z-score for each of the genes
zj  
1
1  p 
j
2. Rank the genes on the basis of the z-scores, starting
with the largest ones (the same ordering as with the Pvalues, pj).
3. The posterior probability of non-differential
expression of gene j, is given by  0(zj).
4. Conclude gene j to be differentially expressed if
ˆ 0(zj) < c0
100
80
60
0
20
40
Frequency
-4
-2
0
2
4
z-scores
Breast cancer data: plot of fitted two-component
normal mixture model with theoretical N(0,1) null and non-null
components (weighted respectively by the estimated proportion of null and
non-null genes) imposed on histogram of z-scores.
100
80
60
0
20
40
Frequency
-4
-2
0
2
4
z-scores
Hedenfalk breast cancer data:
plot of fitted two-component normal mixture model with empirical null
and non-null components (weighted respectively by the estimated proportion
of null and non-null genes) imposed on histogram of z-scores.
Histogram of N=3,226 z-Values from the Breast Cancer Study.
The theoretical N(0,1) null is much narrower than the central peak,
which has (δ0,σ0)=(-.02,1.58). In this case the central peak seems to
include the entire histogram.
Alon Colon Cancer Data
• Affymetrix arrays, samples from colon tissues from 62
patients
• Dataset of M = 62 patients: 40 tumor vs 22 normal, with
N = 2,000 genes.
Alon et al. (1999) PNAS, 96, 6745-6750
Alon data theoretical null
We estimated π0 to be 0.39 using the
theoretical null. With the empirical null, it
is 0.53.
Six smooth muscle genes are included in the
2,000 genes and each has an estimated
posterior probability of non-DE less than
0.015.
60
50
40
30
0
10
20
Frequency
-6
-4
-2
0
2
4
6
z-scores
Colon cancer data: plot of fitted two-component
normal mixture model with theoretical N(0,1) null and non-null
components (weighted respectively by the estimated proportion of null and
non-null genes) imposed on histogram of z-scores.
Colon cancer data: plot of fitted two-component
normal mixture model with empirical null and non-null
components (weighted respectively by the estimated
proportion of null and non-null genes) imposed on
histogram of z-scores.
van’t Wout HIV Data
• Four experiments using the same RNA preparation on 4
different slides
• Expression levels of transcripts assessed in CD4-T-cell
lines 24 hours after infection with HIV type 1
• Two samples – one corresponds to HIV infected cells and
one to non-infected cells
• Dataset of M = 8 samples: 4 HIV-infected vs 4 noninfected, with N = 7,680 genes. Analysed by Gottardo et al
(2006)
van’t Wout et al (2003), J Virology 77, 1392-1402
Gottardo et al (2006), Biometrics 62, 10-18
350
300
250
200
150
0
50
100
Frequency
-6
-4
-2
0
2
4
6
z-scores
HIV data: plot of fitted two-component
normal mixture model with empirical null and non-null
components (weighted respectively by the estimated proportion of null and
non-null genes) imposed on histogram of z-scores.
Conclusions
• Mixture-model based approach to finding DE genes
can yield new information
• Gives a measure of the posterior probability that a
gene is not DE (i.e. a local FDR rather than global)
• Provides estimates of global rates – the FDR and
the false negative rate FNR (FNR=1-sensitivity)
Mixture Model-Based Approach: Advantages
• Provides a local FDR and a measure of the
FNR, as well as the global FDR for the
selected genes.
• We show that it can yield new information
Allison Mice Simulation
Allison generated data for 10 mice over 3000
genes. The data are generated in six groups of
500 with a value ρ of 0, 0.4, or 0.8 in the offdiagonal elements of the 500 x 500 covariance
matrix used to generate each group.
For a random 20% of the genes, a value d of
0, 4, or 8 is added to the gene expression
levels of the last five mice.
Theoretical null, ρ=0.8, d=4
Empirical null, ρ=0.8, d=4
Theoretical null, ρ=0.8, d=8
Empirical null, ρ=0.8, d=8
Suppose 0(w) is monotonic decreasing in w.
Then
ˆ0( wj )  c0 for wj  w0.
1  F 0( w0)
ˆ
FDR  ˆ 0
1  Fˆ ( w0)
E 0(w) | w  w0
Suppose 0(w) is monotonic decreasing in w. Then
ˆ0( wj )  c0
for
wj  w0.
1  F 0( w0)
ˆ
FDR  ˆ 0
1  Fˆ ( w0)
where
F 0( w0)  ( w0)


ˆ
w
0
1

Fˆ ( w0)  ˆ 0( w0)  ˆ1
 ˆ1 
For a desired control level a, say a = 0.05, define

w
ˆ R(w)  a
w0  argmin FD
If
1  F 0( w)
ˆ 0
1  Fˆ ( w)

(1)
is monotonic in w, then using (1)
to control the FDR [with ˆ 0  1 and Fˆ (w0) taken
to be the empirical distribution function] is
equivalent to using the Benjamini-Hochberg
procedure based on the P-values corresponding
to the statistic wj.