Multiple Testing in Microarray Data Analysis

Download Report

Transcript Multiple Testing in Microarray Data Analysis

Multiple Testing in Microarray
Data Analysis
Mi-Ok Kim
Outline
1. Hypothesis Testing
2. Issue in Multiple Testing in Microarray Analysis
1) Type I Error
2) Power
3) P-values
3. Permutation
1. Hypothesis Testing
• H0 : Null hypotheis
vs.
• T : test statistics
C : critical value
H1 : Alternative Hypothesis
• If |T|>C, H0 is rejected. Otherwise H0 is retained
• Ex ) H0 : 1 = 2 vs. H1 : 1  2 T = (x1- x2) / pooled se
If |T| > z(1- /2), H0 is rejected at the significance level 
• C
1. Hypothesis Testing
Truth
H0
H1
Hypothesis Result
Retained
Rejected
Type I error
Type II error
• Type I error rate = false positives ( : significance level )
• Type II error rate = false negatives
• Power : 1–Type II error rate
• P-values : p=inf{ | H0 is rejected at the significance level  }
2. Issues in Multiple Comparison
• Q : Given n treatments, which two treatments are
significantly different ? (simultaneous testing)
cf) Is treatment A different from treatment B ?
• Ex ) m treatment means : 1,…,n
Hj : i = j where ij
Tj = (xi- xj) / pooled SE
• Type I error when testing each at 0.05 significance level
one by one : 1 – (0.95)n
• Inflated Type I error, ex)  =1 – (0.95)10 = 0.401263
• Remedies : Bonferroni Method
Type I error rate =  / # of comparison
3. Issues in Multiple Testing in Microarray
Analysis
• the identification of differentially expressed genes.
ex) a study of differentially expressed genes expression in
tumor biopsy specimens from leukemia patients ( ALL /
AML ) that includes 6,817 genes and 30 samples
• rows : genes ( m )
• columns : samples ( n )
• Hj : jth gene is not differentially expressed
• Simultaneously testing m null hypotheses Hj , j=1, …, m,
to determine which hypotheses to reject while controlling a
suitably defined Type I error and maximizing power
3-1) Type I Error Rates
Truth
•
•
•
•
Hypothesis Result
#retained
#rejected
H0
U
V
H1
T
S
Total
m-R
R
Total
m0
m1
m
Per-comparison error rate ( PCER ) = E(V) / m
Per-family error rate ( PFER ) = E(V)
Family-wise error rate = pr ( V ≥ 1 )
False discovery rate ( FDR ) = E(Q), Q V/R , if R > 0
0,
if R = 0
3-1) Type I Error Rates
Under the complete null hypothesis, each Hj has Type I error
rate j.
•
•
•
•
PCER = E(V) / m = (1+...+m)/m
PFER = E(V) = 1+...+m
FWER= pr ( V ≥ 1 ) = 1 - Pr (Hj , j=1, …, m, not rejected )
FDR = E(V / R) = FWER
PCER = (1+...+m)/m ≤ max (1+...+m)
≤ PWER = FDR ≤ PFER=
1+...+m
3-1) Type I Error Rate
Assume Hj , j=1, …, m, with their test statistics Tj , j=1,…, m,
which has a MN with mean =(1,…,m)and identity
covariance vector
Let Rj = I ( Hj is rejected) and rj is observed value of Rj
Let j = Pr ( Hj rejected under Hj ).
PFER = j=1m j
PCER = j=1m j / m
FWER = 1- j=1m (1- j)
FDR = r1=01…r1=01(j=1m0rj / j=1mrj) jrj (1- j) 1-rj
3-2) Strong vs. Weak Control
• Expectations and probabilities are conditional on which
hypotheses are true.
• Strong control : control of which Type I error rate under
any combination of true and false hypotheses, ie. any value
of m0
• Weak control : control of the Type I error rate only when all
the null hypotheses are true, ie. Under the complete null
hypothesis ∩j=1m Hj
• In the microarray setting, where it is very unlikely that no
genes are differentially expressed, it seems particularly
important to have a strong control of the Type I error rate.
3-3) Power
• Within the class of multiple testing procedures that control
a given Type I error rate at an acceptable level , maximize
power, that is, minimize a suitably defined Type II error
rate.
• Any-pair power : Pr ( S ≥ 1 ) = the probability of rejecting
at least one false null hypothesis
• Per-pair power : average power = E(S) / m1
• All-pair power : Pr ( S = m1 ) = the probability of rejecting
all false null hypothesis
3-4) Multiple Testing Procedures based on Pvalues that control the family-wise error rate
• For a single hypothesis H1,
p1=inf{  | H1 is rejected at the significance level  }
If p1 < , H1 is rejected. Otherwise H1 is retained
• Adjusted p-values for multiple testing (p*)
pj*=inf{  | H1 is rejected at FWER= }
If pj* < , Hj is rejected. Otherwise Hj is retained
• Single-Step, Step-Down and Step-Up procedure
3-4-1) Single-Step Procedure
• For a strong control of FWER,
single-step Bonferroni adjusted p-values : pj*= min( mpj,1)
single-Step Sidak adjsted pvalues : pj*= 1- (1-pj)m
• For a weak control of FWER,
single-step minP adjusted p-values
pj*= min 1≤k≤m (Pk ≤ pj | complete null)m
single-step maxP adjusted p-values
pj*= max 1≤k≤m (|Tk| ≤ Cj | complete null)m
• Under subset pivotal property, weak control = strong
control
3-4-2) Step-Down Procedure
• Order the observed unadjusted p-values such that pr1 ≤ pr2
≤ … ≤ prm
• Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm
• Holm’s procedure
j* = min { j | prj >  / (m-j+1) }, reject Hrj for j=1, .., j*-1
• Adjusted step-down Holm’s p-values
prj *= max{ min( (m-k+1) prk , 1) }
prj *= max{ 1-(1-prk)(m-k+1) }
prj *= max{ Pr( min rk<l<rm Pl ≤ prk | complete null) }
prj *= max{ Pr( max rk<l<rm |Tl| ≤ Crk | complete null) }
3-4-3) Step-Up Procedure
• Order the observed unadjusted p-values such that pr1 ≤ pr2
≤ … ≤ prm
• Accordingly, order Hr1 ≤ Hr2 ≤ … ≤ Hrm
• j* = max { j | prj ≤  / (m-j+1) }, reject Hrj for j=1, .., j*
• Adjusted step-down Holm’s p-values
prj *= min{ min( (m-k+1) prk , 1) }
3-5) Resampling Method
• Rows – genes , Columns – samples
• Bootstrap or permutation based method
• Estimate the joint distribution of the test statistics under the
complete null hypothesis by permuting the columns of the
gene expression data matrix (permuting columns)
• For the bth permutation, b=1, …, B,
compute test statistics t1,b, …, tm,b
prj *= j=1B I (| tj,b | ≥ Cj ) / B
ex ) Colub (1999)
3-5) Resampling Method
• Efron et al. (2000) and Tusher et al. (2001)
• Compute a test statistics tj for each gene j and define order
statistics t(j) such that t(1) ≥ t(2) ≥ .. ≥ t(m)
• For each b permutation, b=1, ..,B, compute the test statistics
and define the order statistics t(1),b ≥ t(2),b ≥ .. ≥ t(m),b
• From the permutations, estimate the expected value (under
the complete null) of the order statistics by t*(j)=  t(j),b /B
• Form a Q-Q plot of the observed t(j) vs. the expected t*(j)
• Efron et al. – for a fixed threshold , genes with |t(j)-t*(j)| ≥

• Tusher et al. - for a fixed threshold , let j*=max{j: t(j)-t*(j)
≥ , t*(j) > 0}