Nonlinear kernel: If the X is a vector space but the

Download Report

Transcript Nonlinear kernel: If the X is a vector space but the

Support Vector Machines
Classification
Venables & Ripley Section 12.5
CSU Hayward Statistics 6601
Joseph Rickert
&
Timothy McKusick
December 1, 2004
JBR
1
Support Vector Machine
What is the SVM?
The SVM is a
generalization of the
Optimal Hyperplane
Algorithm
JBR
Why is the SVM
important?
 It allows the use of
more similarity
measures than the
OHA
 Through the use of
kernel methods it
works with non
vector data
2
Simple Linear Classifier
X=Rp
f(x) = wTx + b
Each x  X is classified into 2
classes labeled y  {+1,-1}
y = 1 if f(x)  0 and
y = -1 if f(x) < 0
S = {(x1,y1),(x2,y2),...}
Given S, the problem is to learn
f (find w and b) .
For each f check to see if all
(xi,yi) are correctly
classified i.e. yif(xi)  0
Choose f so that the number of
errors is minimized
JBR
3
But what if the training set is not linearly
separable?
f(x) = wTx + b defines two half planes
{x:f(x)  1} and {x: f(x)  -1}
Classify with the “Hinge” loss
function: c(f,x,y) = max(0,1-yf(x))
c (f,x,y)  as distance from correct
half plane 
If (x,y) is correctly classified with
large confidence then c(f,x,y) = 0
wTx+b > 1
yf(x)  1: correct with large conf
0  yf(x) < 0: correct with small conf
yf(x) < 0: misclassified
yf(x)
JBR
1
wTx+b < - 1
margin = 2/||w||
4
SVMs combine requirements of large margin
and few misclassifications
by solving the problem:
New formulation:
min 1/2||w||2 + Cc(f,xi,yi)
w.r.t w,x and b
 C is parameter that controls
tradeoff between margin and
misclassification
 Large C  small margins but
more samples correctly
classified with strong
confidence
 Technical difficulty: hinge loss
function c(f,xi,yi) is not
differentiable
JBR
Even better formulation:
use slack variables xi
 min 1/2||w||2 + Cxi
w.r.t w,x and b
under the constraint
xi  c(f,xi,yi)
(*)
 But (*) is equivalent to
xi  0
for i = 1...n
T
xi - 1 + yi(w xi + b)  0
 Solve this quadratic
optimization problem with
Lagrange Multipliers
5
Support Vectors
Lagrange Multiplier
formulation:
 Find a that minimizes:
W(a)=(-1/2) yiyjaiajxiTxj +
ai
a=0
a=C
under the constraints:
ai = 0 and 0  ai  C
 The points with positive
Lagrange Multipliers,ai > 0,
are called Support Vectors
 The set of support vectors
0 <a<C
contains all the information
used by the SVM to learn a
discrimination function
JBR
6
Kernel Methods: data not represented
individually, but only through a set of
pairwise comparisons
X
a set of
objects(proteins)
F(s) = (aatcgagtcac, atggacgtct, tgcactact)
Each object represented by a sequence
S
K=
1 0.5
0.3
0.5 1
0.6
0.3 0.6
1
Each number in the kernel matrix is a
measure of the similarity or “distance”
between two objects.
JBR
7
Kernels
Properties of Kernels
 Kernels are measures of
similarity: K(x,x’) large
when x and x’ are similar
 Kernels must be:
 Positive definite
 Symmetric
  kernel K,  a Hilbert Space
F and a mapping F: X  F
 K(x,x’) = <F(x),F(x’)>
 x,x’  X
 Hence all kernels can be
thought of as dot products
in some feature space
JBR
Advantages of Kernels
 Data of very different
nature can be analyzed in a
unified framework
 No matter what the objects
are, n objects are always
represented by an n x n
matrix
 Many times, it is easier to
compare objects than
represent them numerically
 Complete modularity
between function to
represent data and
algorithm to analyze data
8
The “Kernel Trick”
 Any algorithm for vector data that can be
expressed in terms of dot products can be
performed implicitly in the feature space
associated with the kernel by replacing each dot
product with the kernel representation
 e.g. For some feature space F let:
d(x,x’) = ||F(x) - F(x’)||
 But
||F(x)-F(x’)||2 = <F(x),F(x)> + <F(x’),F(x’)> - 2<F(x),F(x’)>
 So
d(x,x’) =(K(x,x)+K(x’,x’)-2K(x,x’))1/2
JBR
9
Nonlinear Separation
JBR
X
Nonlinear kernel:
 X is a vector space
 the kernel F is
nonlinear
 linear separation in
the feature space F
can be associated
with non linear
separation in X
F
F
10
SVM with Kernel
Final formulation:
 Find a that minimizes: W(a)=(-1/2)yiyjaiajxiTxj +ai
under the constraints: ai = 0 and 0  ai  C
 Find an index i, 0 < ai < C and set:
b = yi - yjajk(xixj)
 The classification of a new object x  X is then
determined by the sign of the function
f(x) = yiaik(xix)+ b
JBR
11
iris data set (Anderson 1935)
150 cases, 50 each of 3 species of iris
Example from page 48 of The e1071 Package.
First 10 lines of Iris
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width
1
5.1
3.5
1.4
0.2
2
4.9
3.0
1.4
0.2
3
4.7
3.2
1.3
0.2
4
4.6
3.1
1.5
0.2
5
5.0
3.6
1.4
0.2
6
5.4
3.9
1.7
0.4
7
4.6
3.4
1.4
0.3
8
5.0
3.4
1.5
0.2
9
4.4
2.9
1.4
0.2
10
4.9
3.1
1.5
0.1
JBR
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
12
SVM ANALYSIS OF IRIS DATA
# SVM ANALYSIS OF IRIS DATA SET
# classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
summary(model)
Parameter “C” in
Lagrange Formulation
Call:
svm(formula = Species ~ ., data = iris)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors: 51
( 8 22 21 )
Number of Classes: 3
Radial Kernel
exp(-g|u - v|)2
JBR
Levels:
setosa versicolor virginica
13
Exploring the SVM Model
# test with training data
x <- subset(iris, select = -Species)
y <- Species
pred <- predict(model, x)
# Check accuracy:
table(pred, y)
# compute decision values:
pred <- predict(model, x,
decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
JBR
y
pred setosa versicolor virginica
setosa
50
0
0
versicolor
0 48
2
virginica
0
2
48
setosa/versicolor setosa/virginica
versicolor/virginica
[1,] 1.196000 1.091667 0.6706543
[2,] 1.064868 1.055877 0.8482041
[3,] 1.181229 1.074370 0.6438237
[4,] 1.111282 1.052820 0.6780645
14
Visualize classes with MDS
+
0.0
0.5
1.0
+
-0.5
cmdscale : multidimensional scaling
or
principal coordinates analysis
cmdscale(dist(iris[, -5]))[,2]
# visualize (classes by color, SV by
crosses):
plot(cmdscale(dist(iris[,-5])),
col = as.integer(iris[,5]),
ch = c("o","+")[1:150 %in%
model$index + 1])
o
o
o
o o
o
o
oo
oo
o o
oo+
o
o
o
oo o
+ ooooo+
o
ooo o
o ooo+
ooo
oo
+o
+
-1.0
+
o
o +
+
o o o o oo o
+
o
oo+
++ o oo
+
+
o
o oo
o+
o
+
o++
o o+
oo
oo + +
+
+o ++++ o ooo
o oooo
+
+++
o
+
o o
+
o o
+
o+ +
+
o
o
o
+o
+
oo+ o o
+o
+
+o
+
+
-3
o
+
+
black: sertosa
red: versicolor
green: virginica
o
+
-2
-1
0
1
2
3
4
cmdscale(dist(iris[, -5]))[,1]
JBR
15
iris split into training and test sets
first 25 of each case training set
## SECOND SVM ANALYSIS OF
IRIS DATA SET
## classification mode
# default with factor response
# Train with iris.train.data
model.2 <- svm(fS.TR ~ ., data =
iris.train)
# output from summary
summary(model.2)
JBR
Call:
svm(formula = fS.TR ~ ., data
= iris.train)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.25
Number of Support Vectors:
32
( 7 13 12 )
Number of Classes: 3
Levels:
setosa veriscolor virginica
16
iris test results
# test with iris.test.data
x.2 <- subset(iris.test, select = fS.TE)
y.2 <- fS.TE
pred.2 <- predict(model.2, x.2)
# Check accuracy:
table(pred.2, y.2)
# compute decision values and
probabilities:
pred.2 <- predict(model.2, x.2,
decision.values = TRUE)
attr(pred.2,
"decision.values")[1:4,]
JBR
y.2
pred.2
setosa veriscolor virginica
setosa
25
0
0
veriscolor 0 25
0
virginica 0
0
25
setosa/veriscolor setosa/virginica
veriscolor/virginica
[1,] 1.253378 1.086341 0.6065033
[2,] 1.000251 1.021445 0.8012664
[3,] 1.247326 1.104700 0.6068924
[4,] 1.164226 1.078913 0.6311566
17
1.0
iris training and test sets
+
1.0
+
+
+
o
o oo
+ o
o
++
o
+
+ o
+o
o
++
++
+
-2
0.5
+o + o
+ o
o
o+
o +o
+ oo
o
o
+o
+ + ++ o
o
+ +
oo
+ o
+ o+ o o
++ o
+
o
+o o
+
+
++oo
oo
oo
+
+
o+oo
oo
o
o
o+
+
o
-1.0
oo o+
o
+ o
+
+
o
o
+
o
o
o
0.0
o +
o+
o
o
+
o
o
-0.5
o
cmdscale(dist(iris.test[, -5]))[,2]
0.5
0.0
-0.5
o
+
+
+
+
0
+
2
cmdscale(dist(iris.train[, -5]))[,1]
JBR
+
o
o
+
o o
o
o
o+
ooo
o
+
+
oo
o
o
o+
o
o
-1.0
cmdscale(dist(iris.train[, -5]))[,2]
+
o
4
-3
-2
-1
0
1
2
3
cmdscale(dist(iris.test[, -5]))[,1]
18
Microarray Data
from Golub et al. Molecular Classification of Cancer: Class Prediction by
Gene Expression Monitoring, Science, Vol 286, 10/15/1999
Expression levels of
predictive genes .
 Rows: genes
 Columns: samples
 Expression levels (EL) of each
gene are relative to the mean
EL for that gene in the initial
dataset
 Red if EL > mean
 Blue if EL < mean
 The scale indicates s above or
below the mean
 Top panel: genes highly
expressed in ALL
 Bottom panel: genes more
highly expressed in AML.
JBR
19
Microarray Data Transposed
rows = samples, columns = genes
Training Data
 38 Samples
 7129 x 38 matrix
 ALL: 27
 AML 11
Test Data
 38 Samples
 7129 x 34 matrix
 ALL: 20
 AML 14
JBR
Microarray Data Transposed
rows = samples, columns = genes
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[,1]
-214
-139
-76
-135
-106
-138
-72
-413
5
-88
-165
-67
-92
-113
-107
[,2]
-153
-73
-49
-114
-125
-85
-144
-260
-127
-105
-155
-93
-119
-147
-72
[,3]
-58
-1
-307
265
-76
215
238
7
106
42
-71
84
-31
-118
-126
[,4] [,5][,6]
88 -295 -558
283 -264 -400
309 -376 -650
12 -419 -585
168 -230 -284
71 -272 -558
55 -399 -551
-2 -541 -790
268 -210 -535
219 -178 -246
82 -163 -430
25 -179 -323
173 -233 -227
243 -127 -398
149 -205 -284
[,7]
199
-330
33
158
4
67
131
-275
0
328
100
-135
-49
-249
-166
[,8]
-176
-168
-367
-253
-122
-186
-179
-463
-174
-148
-109
-127
-62
-228
-185
[,9] [,10]
252
206
101
74
206 -215
49
31
70
252
87
193
126
-20
70 -169
24
506
177
183
56
350
-2
-66
13
230
-37
113
1
-23
20
SVM ANALYSIS OF MICROARRAY DATA
classification mode
# default with factor response
y <-c(rep(0,27),rep(1,11))
fy <-factor(y,levels=0:1)
levels(fy) <-c("ALL","AML")
# compute svm on first 3000
genes only because of
memory overflow problems
model.ma <- svm(fy ~.,data =
fmat.train[,1:3000])
JBR
Call:
svm(formula = fy ~ ., data =
fmat.train[, 1:3000])
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.0003333333
Number of Support Vectors: 37
( 26 11 )
Number of Classes: 2
Levels:
ALL AML
21
Visualize Microarray Training Data
with Multidimensional Scaling
# visualize Training Data
# (classes by color, SV by crosses)
# multidimensional scaling
pc <cmdscale(dist(fmat.train[,1:3000
]))
+
+
+ ++
+
+
0
20000
40000
+
+o
+
+
++
+
-20000
+
+
+
+
+ +
+
++
+
+
++
++
+
+
+
+
+
+
+
-40000
pc[,2]
plot(pc,
col = as.integer(fy),
pch = c("o","+")[1:3000 %in%
model.ma$index + 1],
main="Training Data ALL 'Black' and
AML 'Red' Classes")
Training Data ALL 'Black' and AML 'Red' Classes
+
-40000
-20000
0
20000
pc[,1]
JBR
22
Check Model with Training Data
Predict outcomes of Test Data
# check the training data
x <- fmat.train[,1:3000]
pred.train <- predict(model.ma, x)
# check accuracy:
table(pred.train, fy)
# classify the test data
y2 <-c(rep(0,20),rep(1,14))
fy2 <-factor(y2,levels=0:1)
levels(fy2) <-c("ALL","AML")
x2 <- fmat.test[,1:3000]
pred <- predict(model.ma, x2)
# check accuracy:
table(pred, fy2)
JBR
Training data
fy
correctly classified
pred.train ALL AML
ALL 27 0
AML 0 11
fy2
pred ALL AML
ALL 20 13
AML 0 1
Model is
worthless
so far
23
Conclusion:
The SVM appears to be a powerful
classifier applicable to many different
kinds of data
But
Kernel design is a full time job
Selecting model parameters is far from
obvious
The math is formidable
JBR
24