Transcript Document

Supervised Classification
Selection bias in gene extraction on the
basis of microarray gene-expression data
Ambroise and McLachlan
Proceedings of the National Academy of Sciences
Vol. 99, Issue 10, 6562-6566, May 14, 2002
http://www.pnas.org/cgi/content/full/99/10/6562
Supervised Classification of Tissue Samples
We OBSERVE the CLASS LABELS z1, …, zn where
zj = i if jth tissue sample comes from the ith class
(i=1,…,g).
AIM: TO CONSTRUCT A CLASSIFIER c(y) FOR
PREDICTING THE UNKNOWN CLASS LABEL z
OF A TISSUE SAMPLE y.
e.g.
g = 2 classes
C1 - DISEASE-FREE
C2 - METASTASES
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
Supervised Classification (Two Classes)
Sample 1
.......
Sample n
.......
Gene 1
Gene p
Class 1
(good prognosis)
Class 2
(poor prognosis)
Microarray to be used as routine
clinical screen
by C. M. Schubert
Nature Medicine
9, 9, 2003.
The Netherlands Cancer Institute in Amsterdam is to become the first institution
in the world to use microarray techniques for the routine prognostic screening of
cancer patients. Aiming for a June 2003 start date, the center will use a panoply
of 70 genes to assess the tumor profile of breast cancer patients and to
determine which women will receive adjuvant treatment after surgery.
Selection Bias
Bias that occurs when a subset of the
variables is selected (dimension
reduction) in some “optimal” way, and
then the predictive capability of this
subset is assessed in the usual way; i.e.
using an ordinary measure for a set of
variables.
Selection Bias
Discriminant Analysis:
McLachlan (1992 & 2004, Wiley, Chapter 12)
Regression:
Breiman (1992, JASA)
“This usage (i.e. use of residual of SS’s etc.) has
long been a quiet scandal in the statistical
community.”
Nature Reviews Cancer, Feb. 2005
LINEAR CLASSIFIER
FORM
c( y)  0  β y
 β0  β1 y1   β p y p
T
for the production of the group label z of
a future entity with feature vector y.
FISHER’S LINEAR DISCRIMINANT FUNCTION
z  sign c( y)
where
1
β  S ( y1  y2 )
1
T 1
 0   ( y1  y2 ) S ( y1  y2 )
2
and
y1 , y2 , and S are the sample means and pooled sample
covariance matrix found from the training data
Microarrays also to be used in the
prediction of breast cancer by Mike West
(Duke University) and the Koo
Foundation Sun Yat-Sen Cancer Centre,
Taipei
Huang et al. (2003, The Lancet, Gene
expression predictors of breast cancer).
LINEAR CLASSIFIER
FORM
C ( y )  0  β y
 β0  β1 y1   β p y p
T
for the production of the group label z of
a future entity with feature vector y.
FISHER’S LINEAR DISCRIMINANT FUNCTION
z  sign C ( y )
where β
 S (y 1  y 2 )
1
1
T 1
0   (y 1  y 2 ) S (y 1  y 2 )
2
and
y 1 , y 2 , and S are the sample means and pooled sample
covariance matrix found from the training data
SUPPORT VECTOR CLASSIFIER
Vapnik (1995)
C ( y )  β0  β1 y1   β p y p
where β0 and β are obtained as follows:
min
β , 0
subject to  j
1
2
n
β    j
2
 0,
z j C(y j ) 1   j
j 1
( j  1,, n)
1 ,,  n relate to the slack variables
   separable case
n
βˆ   ˆ j z j y j
j 1
with non-zero ˆ j only for those observations j for which the
constraints are exactly met (the support vectors).
n
C ( y )   ˆ j z j y Tj y  ˆ0
j 1
n
  ˆ j z j y j , y  ˆ0
j 1
Support Vector Machine (SVM)
REPLACE
y
by h( y
)
n
C ( y )   ˆ j h( y j ), h( y )  ˆ0
j 1
n
  ˆ j K ( y j , y )  ˆ0
j 1
where the kernel function K ( y j , y )  h( y j ), h( y )
is the inner product in the transformed feature space.
HASTIE et al. (2001, Chapter 12)
The Lagrange (primal function) is
LP 
1
2
n
n


n
β    j   j z j C ( y j )  (1   j )    j j
2
j 1
j 1
j 1
which we maximize w.r.t. β, β0, and ξj.
Setting the respective derivatives to zero, we get
n
β   j z j y j
(2)
j 1
n
   j z j
(3)
j 1
 j    j
( j  1,, n).
(4)
with  j  0,  j  0, and  j  0 ( j  1, , n).
(1)
By substituting (2) to (4) into (1), we obtain the Lagrangian dual function
n
LD   j 
j 1
n
n


2
1
j 1 k 1
j
k
We maximize (5) subject to 0   j   and
z j zk y y k
T
j
n

j 1
j
(5)
z j  0.
In addition to (2) to (4), the constraints include
 j z j C (x j )  (1   j )  0
(6)
 j j  0
(7)
z j C (x j )  (1   j )  0
(8)
for j  1,, n.
Together these equations (2) to (8) uniquely characterize the solution
to the primal and dual problem.
Leo Breiman (2001)
Statistical modeling:
the two cultures (with discussion).
Statistical Science 16, 199-231.
Discussants include Brad Efron and David Cox
GUYON, WESTON, BARNHILL &
VAPNIK (2002, Machine Learning)
LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero
CVE (cross-validated error rate)
COLON DATA:
Using only 4 genes, CVE is 2%
Since p>>n, consideration given to
selection of suitable genes
SVM: FORWARD or BACKWARD (in terms of
magnitude of weight βi)
RECURSIVE FEATURE ELIMINATION (RFE)
FISHER: FORWARD ONLY (in terms of CVE)
GUYON et al. (2002)
LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero
CVE (cross-validated error rate)
COLON DATA:
Using only 4 genes, CVE is 2%
GUYON et al. (2002)
“The success of the RFE indicates that RFE has a
built in regularization mechanism that we do not
understand yet that prevents overfitting the
training data in its selection of gene subsets.”
Example: Microarray Data
Colon Data of Alon et al. (1999)
n=62 (40 tumours; 22 normals)
tissue samples of
p=2,000 genes in a
2,000  62 matrix.
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
Figure 2: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of leukemia tissue samples
Figure 3: Error rates of Fisher’s rule with stepwise forward
selection procedure using all the colon data
Figure 4: Error rates of Fisher’s rule with stepwise forward
selection procedure using all the leukemia data
Figure 5: Error rates of the SVM rule averaged over 20 noninformative
samples generated by random permutations of the class labels of the
colon tumor tissues
ADDITIONAL REFERENCES
Selection bias ignored:
XIONG et al. (2001, Molecular Genetics and Metabolism)
XIONG et al. (2001, Genome Research)
ZHANG et al. (2001, PNAS)
Aware of selection bias:
SPANG et al. (2001, Silico Biology)
WEST et al. (2001, PNAS)
NGUYEN and ROCKE (2002)
Error Rate Estimation
Suppose there are two groups G1 and G2
c(y) is a classifier formed from the
data set
(y1, y2, y3,……………, yn)
The apparent error is the proportion of
the data set misallocated by c(y).
Cross-Validation
From the original data set, remove y1 to
give the reduced set
(y2, y3,……………, yn)
Then form the classifier c(1)(y ) from this
reduced set.
Use c(1)(y1) to allocate y1 to either G1 or
G2.
Repeat this process for the second data
point, y2.
So that this point is assigned to either G1 or
G2 on the basis of the classifier c(2)(y2).
And so on up to yn.
Ten-Fold Cross Validation
1
Test
2
3
4
5
6
7
Training
8
9
10
BOOTSTRAP APPROACH
Efron’s (1983, JASA) .632 estimator
B.632  .368  AE  .632  B1
where B1 is the bootstrap when rule
the training sample.
* is applied to a point not in
Rk
A Monte Carlo estimate of B1 is
n
B1   Ej n
j 1
K
where
Ej   IjkQjk
k 1
K
I
jk
k 1
with
and
Ijk  10
Qjk
if xj  kth bootstrap sample
otherwise

*
 1 if R k misallocates xj
 0 otherwise
Toussaint & Sharpe (1975) proposed the
ERROR RATE ESTIMATOR
A(w)  (1 - w)AE  wCV2E
where
w  0.5
McLachlan (1977) proposed w=wo where wo is
chosen to minimize asymptotic bias of A(w) in the
case of two homoscedastic normal groups.
Value of w0 was found to range between 0.6
and 0.7, depending on the values of p, , and n1 .
n2
.632+ estimate of Efron & Tibshirani (1997, JASA)
B.632   (1 - w)AE  wB1
where
.632
w
1  .368r
B1  AE
r
  AE
(relative overfitting rate)
g
   pi (1  qi )
(estimate of no information error rate)
i 1
If r = 0, w = .632, and so B.632+ = B.632
r = 1, w = 1, and so B.632+ = B1
Ten-Fold Cross Validation
1
Test
2
3
4
5
6
7
Training
8
9
10
MARKER GENES FOR HARVARD DATA
For a SVM based on 64 genes, and using 10-fold CV,
we noted the number of times a gene was selected.
No. of genes
55
18
11
7
8
6
10
8
12
17
Times selected
1
2
3
4
5
6
7
8
9
10
MARKER GENES FOR HARVARD DATA
No. of Times
genes selected
55
1
18
2
11
3
7
4
8
5
6
6
10
7
8
8
12
9
17
10
tubulin, alpha, ubiquitous
Cluster Incl N90862
cyclin-dependent kinase inhibitor 2C
(p18, inhibits CDK4)
DEK oncogene (DNA binding)
Cluster Incl AF035316
transducin-like enhancer of split 2,
homolog of Drosophila E(sp1)
ADP-ribosyltransferase (NAD+; poly
(ADP-ribose) polymerase)
benzodiazapine receptor (peripheral)
Cluster Incl D21063
galactosidase, beta 1
high-mobility group (nonhistone
chromosomal) protein 2
cold inducible RNA-binding protein
Cluster Incl U79287
BAF53
tubulin, beta polypeptide
thromboxane A2 receptor
H1 histone family, member X
Fc fragment of IgG, receptor,
transporter, alpha
sine oculis homeobox
(Drosophila) homolog 3
transcriptional intermediary
factor 1 gamma
transcription elongation factor
A (SII)-like 1
like mouse brain protein E46
minichromosome maintenance
deficient (mis5, S. pombe) 6
transcription factor 12 (HTF4,
helix-loop-helix transcription
factors 4)
guanine nucleotide binding
protein (G protein), gamma 3,
linked
dihydropyrimidinase-like 2
Cluster Incl AI951946
transforming growth factor,
beta receptor II (70-80kD)
protein kinase C-like 1
Breast cancer data set in van’t Veer et al.
(van’t Veer et al., 2002, Gene Expression Profiling Predicts
Clinical Outcome Of Breast Cancer, Nature 415)
These data were the result of microarray experiments
on three patient groups with different classes of
breast cancer tumours.
The overall goal was to identify a set of genes that
could distinguish between the different tumour
groups based upon the gene expression information
for these groups.
Breast tumours have a genetic signature. The expression
pattern of a set of 70 genes can predict whether a tumour
is going to prove lethal, despite treatment, or not.
“This gene expression profile will outperform all currently
used clinical parameters in predicting disease outcome.”
van ’t Veer et al. (2002), van de Vijver et al. (2002)
Number of Genes
Error Rate for Top
70 Genes (without
correction for
Selection Bias as
Top 70)
Error Rate for Top
70 Genes (with
correction for
Selection Bias as
Top 70)
Error Rate for
5422 Genes (with
correction for
Selection Bias)
1
0.50
0.53
0.56
2
0.32
0.41
0.44
4
0.26
0.40
0.41
8
0.27
0.32
0.43
16
0.28
0.31
0.35
32
0.22
0.35
0.34
64
0.20
0.34
0.35
70
0.19
0.33
-
128
-
-
0.39
256
-
-
0.33
512
-
-
0.34
1024
-
-
0.33
2048
-
-
0.37
4096
-
-
0.40
5422
-
-
0.44
van de Vijver et al. (2002) considered a further 234 breast
cancer tumours but have only made available the data for
the top 70 genes based on the previous study of van ‘t Veer
et al. (2002)
Number of
Genes
From 70 genes
From original
24481 genes (set
missing values to 0)
From original 24481 genes
( using KNN for missing
values, k=10)
1
0.29491525
0.4023327
0.4199797
2
0.17288136
0.3850913
0.3825558
4
0.20000000
0.3747465
0.3756592
8
0.13220339
0.3033469
0.3061866
16
0.10508475
0.2314402
0.2319473
32
0.08474576
0.2038540
0.2240365
64
0.09491525
0.2038540
0.1915822
70
0.09491525
128
0.1634888
0.1600406
256
0.1462475
0.1507099
512
0.1359026
0.1438134
1024
0.1324544
0.1496957
2048
0.1521298
0.1364097
4096
0.1481744
0.1403651
8192
0.1550710
0.1605477
16384
0.1683570
0.1738337
24481
0.1683570
0.1772819
0.5
0.3
0.2
0.1
0.0
CV Error Rates
0.4
Unbiased (kNN)
biased
0
5
log2(Number of Genes)
10
15
Nearest-Shrunken Centroids
(Tibshirani et al., 2002)
The usual estimates of the class means
overall mean y of the data, where
n
yi   zijyj / ni
j 1
and
n
y   yj / n.
j 1
yi are shrunk toward the
The nearest-centroid rule is given by
where yv is the vth element of the feature vector y and yiv  ( y )v .
In the previous definition, we replace the sample mean yiv of
the vth gene by its shrunken estimate
v
where
1
1 2
mi  (ni 1  n )
Comparison of Nearest-Shrunken Centroids with SVM
Apply (i) nearest-shrunken centroids and
(ii) the SVM with RFE
to colon data set of Alon et al. (1999), with
N = 2000 genes and M = 62 tissues
(40 tumours, 22 normals)
Nearest-Shrunken Centroids applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates
SVM with RFE applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates