Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Amaratunga, Johnson & Johnson Pharmaceutical Research & Development.

Download Report

Transcript Test of significance for small samples Javier Cabrera Director, Biostatistics Institute Rutgers University Dhammika Amaratunga, Johnson & Johnson Pharmaceutical Research & Development.

Test of significance for small samples
Javier Cabrera
Director, Biostatistics Institute Rutgers University
Dhammika Amaratunga,
Johnson & Johnson Pharmaceutical Research & Development
1
Outline
•
•
•
•
•
Microarray Experiments and Differential expression
Small sample size issues
Conditional t approach
Comparison with other methods
Extensions
Reference: Exploration and Analysis of DNA
Microarray and Protein Array Data. Wiley.2004.
Amaratunga, Cabrera.
Software: DNAMR and DNAMRweb
http://www.rci.rutgers.edu/~cabrera/DNAMR
2
Genes: A gene is a segment of DNA whose sequence of
bases (nucleotides) codes for a specific protein.
AKAP6: CATCATGCAGCAGGTCAAACAAGG
CATCTCCTAGTATTGCATCCTACA……
The central dogma of molecular biology
A gene is expressed via the process:
DNA 
mRNA
transcription

protein
translation
replication
3
Microarray experiment
cDNA or
oligonucleotide
preparation
Glass slide
Biological sample
mRNA
Reverse
transcribe
and label
Print or
synthesize
Microarray
5k-50k
genes
arrayed in
rectangular
grid; one
spot per
gene
+
Sample
Hybridize, wash and scan
Image
Quantify spot intensities
Gene expression data
4
Differential gene expression
An organism’s genome is the complete
set of genes in each of its cells. Given
an organism, every one of its cells has
a copy of the exact same genome, but
 different cells express different genes
 different genes express under different
conditions
 differential gene expression leads to
altered cell states
5
Differential Expression for small samples
C1
C2
C3
T1
T2
T3
G1 4.67 4.44 4.42 4.73 4.85 4.69
G2 3.13 2.54 1.96 0.97 2.38 3.36
G3 6.22 6.77 5.32 6.40 6.94 6.87
G4 10.74 10.81 10.69 10.75 10.68 10.68
G5 3.76 4.16 5.27 3.05 3.20 2.85
G6 6.95 6.78 6.33 6.81 6.95 7.01
G7 4.98 4.61 4.56 4.57 4.90 4.44
G8 2.72 3.30 3.24 3.22 3.42 3.22
G9 5.29 4.79 5.13 3.31 4.67 5.27
G10 5.12 4.85 3.79 4.13 3.12 4.79
G11 4.67 3.50 4.77 4.09 3.86 2.88
G12 6.22 6.42 5.02 6.38 6.54 6.80
G13 2.88 3.76 2.78 2.98 4.81 4.15
.......
1. Preprocessed data.
2. Perform a t-test for each gene.
3. Select the most significant subset.
6
The pooled variances T-test
The t test statistic for testing for a mean effect is:
1/ 2
Tg  ( X g 2  X g 1 ) /( s g (1/ n1  1/ n2 ) )
where sg, the pooled standard error, is the positive
square root of:
s g2  (( n1  1) s12g  ( n2  1) s22g ) /( n1  n2  2)
If there is no mean effect,
Tg ~ t( n1  n2  2)
(Student / Fisher)
7
Plot t vs sp
Distribution of sp
300 21983
Differentially expressed
genes have smaller sp.
Random Data
Is this effect Statistical or Biological?
8
500 Simulation: 1000 Genes
4 Controls + 4 Treats iid Normal(0, 2)
100 genes are differentially express with mean diff = +1 or -1
2=1 CONSTANT, a0.05
False Discoveries
True Discoveries
T-test
44
22
z-test
43
29
2 from Chi-square(df=3), a0.05
False Discoveries
True Discoveries
T-test
43
28
z-test
53
13
9
The effect of small sample size
Often the sample size per group is small.
 unreliable variances (inferences)
 dependence between the test statistics (tg) and
the standard error estimates (sg)
 borrow strength across genes (LPE/EB)
 regularize the test statistics (SAM)
 work with tg|sg (Conditional t).
10
Analysis results
Top 10 genes (sorted by t-test p-value)
Gene
G6546
G19945
G21586
G18970
G7432
G19057
G17361
G8525
G425
G8524
Fold Dir
2.36
D
3.25
U
1.64
U
2.52
U
3.70
D
1.85
U
4.34
D
5.57
D
18.11
D
4.74
D
p
0.000004
0.000005
0.000008
0.000019
0.000033
0.000046
0.000067
0.000067
0.000078
0.000109
p(Bonf)
0.0964
0.1102
0.1765
0.4220
0.7248
1.0000
1.0000
1.0000
1.0000
1.000011
For each a
a
Tg ( s ) 
SAM: Determining c
rg
a
sg  s
v1 (a) =mad{ Tg}
Tg
cv(a1) s1
cv(a)
v2(a) v3(a) v4(a) v5(a) v6(a) v7(a)
cv(a2) s2
cv(a3) s3
cv(a4) s4
Min
ĉ
cv(a5) s5
cv(a6) s6
cv(a7) s7
sg
12
SAM: Gene selection
1.0
Delta Fal.Pos.50% Fal.Pos.90% Called FDR50% FDR90%
1847
1951
3514
0.526
0.555
0.2
1351
1515
2996
0.451
0.506
0.3
949
1130
2550
0.372
0.443
0.4
645
812
2182
0.296
0.372
0.5
400
538
1787
0.224
0.301
0.6
249
366
1567
0.159
0.233
0.7
143
228
1306
0.109
0.175
0.8
76
135
1112
0.068
0.121
0.9
39
80
931
0.042
0.086
1.0
20
45
746
0.027
0.061
1.1
10
28
628
0.017
0.045
1.2
6
16
537
0.011
0.030
1.3
4
8
446
0.008
0.019
389
0.80.005 1.0
0.014
0.004
0.010
0.6
P(|SAM|> t)
0
0.0
0.2
-5
0.4
5
D
d
0.1
0.8
Tg (cˆ)
-3
-2
-1
0
db
1
2
T( g ) (cˆ)
T( g ) (cˆ) = Expected value of Tg (cˆ)
under permutations
3
1.4
0.0
0.2
2
0.4
5
0.6
1.5
1
3 Sd 311
Pooled
1.6
1
2
269
0.002
0.009
1.7
0
2
226
0.000
0.008
1.8
0
1
186
0.000
0.003
1.9
0
1
154
0.000
0.004
2.0
0
1
139
0.000
0.004
13
Conditional t: Basic Model
 Let Xgij denote the preprocessed intensity
measurement for gene g in array i of
group j.
 Model: Xgij = mgj + g egij
 Effect of interest: tg= mg2 - mg1
 Error model: egij ~ F(location=0, scale=1)
 Gene mean-variance model:(mg1,g2) ~ Fm,
with marginals: mg1 ~ Fm and g2 ~ F 14
Possible approaches
Parametric: Assume functional forms for F and
Fm, and apply either a Bayes or Empirical Bayes
procedure.
Nonparametric:
2
F̂
X
Estimate
F
:
edf
,
,
of
{(
,
s
1.
or Fˆ (t ) is n
m,
g )}
m ,
g1
Estimate F : edf , F̂ , of { ( X gij  X gj ) / sg }
For small samples Fˆ (t ) is
is not
not aagood
goodestimator
estimatorofofFF (t )
Use method of moments = Target estimation
2. Proceed via resampling and estimate the distribution:
t |sp (Conditional t).
15
Procedure
(1) Draw a gene, g, at random from {1, …, G}.
2
Call it g*. ( X g *1 , sg* ) ~ F̂m , .
(2) Take a random sample (with replacement)
*
of size n1+n2 from F̂ : rij ~ Fˆ
(3) Combine these to form pseudo-data:
*
*
X ij  X g *1  sg * rij
(4) Calculate the pooled standard error s* and
*
t test statistic t* for the pseudo-data {Xij }.
16
Procedure (cont.)
(5) Repeat steps (1)-(4) a large number (10,000)
of times.
(6) Given a, estimate the “critical envelope”,
ta(sg), as the (a/2) and (1-a/2) quantile curves
in the tg vs sg relationship.
(7) Genes that fall outside the critical envelope
defined by ta(sg) are deemed significant at level a.
(Overall unconditional Type I error rate = a)
17
Roadblock
Fˆ (t ) is not a good estimator of F (t )
Let {Xij} be a sample from the model with 2 ~ F
and let the variance obtained from the {Xij} be s2
Then Var(s2) > Var(2)
For example, if we assume that F = c32, n=4 and
e ~ N(0,1), then Var(2)=6 and Var(s2)=15.
Fix by target estimation: Method of moments.
not a good
estimator of F (t )
Shrink Fˆ (t ) is
towards
the center
18
Example: Checking for the distribution of g
2
, 2.  2 ~ c 22 , 3.  2 ~ c 62
Compare the distr. of sg vs simulation with: 1.  2 ~ c 0.5
1. Df=0.5
2. Df=2
Case 1
1.0
1.0
0.8
0.6
0.4
0.2
0.2
500
0.4
E7
0.6
E7
1500
1000
Frequency
600
400
0.0
0
0
0.0
200
Frequency
0.8
1
1. Case
Df=0.5
800 1000
E7 Data
Mice
Data
Case 2
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.5
1.0
Sp
2.0
2.5
0.5
1.0
3.0
3.0
0
1
2
3
4
S2
1.0
E7
600
0
1
2
Sp
3
4
0
1
2
3
Sp
4
5
6
0.0
0
0.2
200
0.4
400
Frequency
2.5
0.8
800 1000
500
400
300
200
100
2.0
3.Case
Df=6
3
Case 3
3. Df=6
0
1.5
S1
Sp
2
2.Case
Df=2
Frequency
1.5
0.6
0.0
1
2
3
4
5
S3
19
Another Example
2
, 2.  2 ~ c 32 , 3.  2 ~ c 62
Compare the distr. of sg vs simulation with: 1.  2 ~ c 0.5
Case
Df=6
Df=63
Tox
0.2 0.3
0.0
0.6
0.8
S1
1.0
1.2
1.4
1
2
S2
3
4
mean diff vs Sp
Mean diff
-1 0 1
0.4
Tox
0.2 0.3
500
0
Df=6 3
Case
0.1
Frequency
1500
Frequency
600
1000
2500
Case
2
Df=3
Df=3
0.4
3
Sp
0
0.0
0 200
0.2
0.2 0.4 0.6 0.8 1.0 1.2 1.4
Sp
2
0.5
-2
0.4
-3
0.3
0.5
0.2
0.1
Tox
0.2 0.3
0.1
0.0
0
0.1
Df=3
Case
2
0.4
0.5
0.4
600
Frequency
200
400
Frequency
600
1000
0 200
0.0
Df=0.5
Case
1
0.5
Case 1
Df=0.5
Tox Data
0
1
2
Sp
3
4
0
0
100
200
Sp
300
400
100
200
S3
300
400
0.0
0.1
0.2
0.3
0.4
Sp
20
0.5
Fixing the variance distribution
The idea is to estimate the function h:[0:1][0,1] defined by
h(F(x)) = F̂ (x). Since h is strictly monotonic, it can be inverted
in order to obtain an estimate of F(x).
Procedure:
(1) Assume that F̂ (x) is the true distribution of  and draw a
random sample, s*2, from F̂ .
(2) Take a random sample (with replacement) of size N from F̂ :
rij* ~ Fˆ for i=1,…, n , j=1,2.
j
*
* *
X

s
rij
(3) Combine these to form pseudo-data: ij
21
Fixing the variance distribution (contd)
(4) Calculate the pooled standard error s** for the pseudo-data
{Xij*}.
(5) Repeat steps (B1)-(B4) a large number (say 100,000) of times
and record, for each iteration, the pair of values {(s*2, s**2)}.
(6) Let F̂ * (x) be the empirical distribution of the s**2g’s. Then the
estimator of h is obtained by mapping the empirical distribution
F̂ into F̂ * . More precisely
hˆ( y  Fˆ ( x )) Fˆ * ( Fˆ 1 ( y )) and hˆ 1 ( y ) Fˆ ( Fˆ *1 ( y )) .
Hence the bias-corrected estimator of F is:
F ( x ) Fˆ ( Fˆ *1 ( Fˆ ( x ))) .
Proceed as before …
22
Plot t vs sp
Differentially expressed genes may have large sp
191 22092
23
500 Simulation: 1000 Genes
4 Controls + 4 Treats iid Normal(0, 2)
100 genes are differentially express with mean diff = +1 or -1
2=1 CONSTANT
False Discoveries
True Discoveries
T-test
44
22
z-test
43
29
C-t
45
30
2 from Chi-square(df=3)
False Discoveries
True Discoveries
T-test
43
28
z-test
53
13
C-t
42
38
24
Using 8 iid samples from Khan Data, we make changes to 50 genes
to make them differentially expressed for high level.
T-test
SAM
Ct
25
Generating p-values
To generate p-values, recall that the Ct procedure generates curves,
ca(s).
Start with a set of curves, ca1 ( s g )   ca k ( s g ) , for a set of
prespecified values, a1   a k .
Now consider the relationship between vi=log(-log(ai)) and
ui=log (ca i ( s g ))
To assign an approximate p-value to the gth gene, if |tg |  ca k ( sg ) ,
interpolate the relationship between the {ui} and the {vi}.
26
Extensions
 F test:
- Condition on the sqrt(MSE)
 Multiple comparisons:
- Tukey, Dunnett, Bump.
- Condition on the sqrt(MSE)
 Gene Ontology.
- Test for the significance of groups.
- Use Hypergeometric Statistic,
mean t, mean p-value, or other.
- Condition on log of the number of genes
per group
27
20
10
0
Sqrt(F)
30
Conditional F
0.2
0.4
0.6
0.8
Sqrt(MSE)
1.0
1.2
28
1.5
2.0
GO Ontology: Conditioning on log(n)
0.0
0.5
1.0
|T|
Abs(T)
0
2
4
Sd
Log(n)
6
29
Reference
The Details:
Exploration and Analysis of DNA Microarray
and Protein Array Data. Wiley . Jan 2004.
Amaratunga, Cabrera.
Email
[email protected]
[email protected]
Webpage for DNAMR and DNAMRweb
http://www.rci.rutgers.edu/~cabrera/DNAMR
30
Target Estimation
Target Estimation:
Cabrera, Fernholz (1999)
- Bias Reduction.
- MSE reduction.
Recent Applications:
- Ellipse Estimation (Multivariate Target).
- Logistic Regression:
• Cabrera, Fernholz, Devas (2003)
• Patel (2003) Target Conditional MLE (TCMLE)
Implementation in StatXact (CYTEL) and
logXact Proc’s in SAS(by CYTEL).
31
Target Estimation
E (T)
g( )
T(x1,x2,…,xn)
E (T) = 
32
Target Estimation:
Suppose we have an estimator ˆ  T ( x1 ,..., xn ) of a paramter 
Target estimator  : Solve E (T ) ˆ
h( )  E (T ) then   h 1 (ˆ)
Algorithms:
- Stochastic approximation.
- Simulation and iteration.
- Exact algorithm for TCMLE
33