Nonlinear kernel: If the X is a vector space but the
Download
Report
Transcript Nonlinear kernel: If the X is a vector space but the
Linear Discriminant Analysis
Two approaches – Fisher & Mahalanobi
For two-group discrimination - essentially
equivalent to multiple regression
For multiple groups - essentially a special
case of canonical correlation
JBR
1
LDA – Fisher’s Approach
Based on the idea of
a discriminant score
Linear combination of
the variables which
would produce the
maximally different
scores across the
groups
JBR
2
LDA – Mahalanobi’s Approach
For two group - Uses
the idea of finding the
locus of points
equidistant from the
group means
For # groups > 2 We
find the distance to
each group centroid
and assign each point
to the closest centroid
JBR
3
LDA – Iris Data set
Using Proc Discrim from
SAS
Proc DISCRIM data=iris_train out=iris_out_dis
testdata=iris_test distance manova ncan=2 ;
title 'Discriminant Analysis - IRIS data set';
class species;
var sepallen sepalwid petallen petalwid;
run;
Discriminant Analysis - IRIS data set
The DISCRIM Procedure
Classification Summary for Test Data: WORK.IRIS_TEST
Classification Summary using Linear Discriminant Function
Generalized Squared Distance Function
2
_
-1 _
D (X) = (X-X )' COV (X-X )
j
j
j
Posterior Probability of Membership in Each species
2
2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j
k
k
Hite rate = .9467
Error Rate = .0533
With Different training set
Hit rate = 1.
Number of Observations and Percent Classified into species
From
species
SETOSA
VERSICOLOR
VIRGINICA
Total
Priors
JBR
30
07:58 Sunday, November 28, 2004
SETOSA
24
100.00
VERSICOLOR
0
0.00
VIRGINICA
Total
0
0.00
24
100.00
0
0.00
23
92.00
2
8.00
25
100.00
0
0.00
2
7.69
24
92.31
26
100.00
24
32.00
25
33.33
26
34.67
75
100.00
0.33333
0.33333
0.33333
4
LDA – Microarray Data
train <- sample(1:7129, 100)
z<-lda(fmat.train[,train],fy)
z.predict.test<-predict(z,fmat.test[,1:3000])$class
table(fy2,z.predict.test)
30 of first 60 genes
30 of all 7129 genes
fy2 ALL AML
ALL 16 4
AML 10 4
Hit rate = .5882
fy2 ALL AML
ALL 12 8
AML 8 6
Hit Rate = .5294
First 60 genes
fy2 ALL AML
ALL 15 5
AML 6 8
Hit rate = .6765
100 of all 7129 Genes
fy2 ALL AML
ALL 17 3
AML 5 9
Hit rate = .8235
30 of all 7129 genes
fy2 ALL AML
ALL 14 6
AML 3 11
Hit rate = .7353
First 3000 Genes
fy2 ALL AML
ALL 20 0
AML 9 5
Hit rate = .7353
JBR
5
Compare LDA to SVM
(1st 3000 Genes)
fy2
pred ALL AML
ALL 20 13
AML 0 1
JBR
fy2
z.predict.test ALL AML
ALL 20 9
AML 0 5
6
LDA - Goodness of fit
Proportional Chance Criterion (PPC)
T-test where t=(observed hits-expected hits)/√(n*h*(1h)) [h=hit rate associated with the PPC]
Expected # of hits =
n(prob 1st group)^2+n(1-prob first group)^2
For the microarray example
Expected # of hits = 17.52899 (.5156 hit rate)
T= 2.5637
Gives us a P-value close to .0075
LDA looks do a sufficient job
JBR
7
LDA- Problems
R was nice enough to give me this
warning when # of variables was over 36
Warning message:
variables are collinear in: lda.default(x,
grouping, ...)
JBR
8