Mathematical Programming in Support Vector Machines
Download
Report
Transcript Mathematical Programming in Support Vector Machines
Proximal Support Vector Machine Classifiers
KDD 2001
San Francisco August 26-29, 2001
Glenn Fung & Olvi Mangasarian
Data Mining Institute
University of Wisconsin - Madison
Key Contributions
Fast new support vector machine classifier
An order of magnitude faster than standard classifiers
Extremely simple to implement
4 lines of MATLAB code
NO optimization packages (LP,QP) needed
Outline of Talk
(Standard) Support vector machines (SVM)
Classify by halfspaces
Proximal support vector machines (PSVM)
Classify by proximity to planes
Linear PSVM classifier
Nonlinear PSVM classifier
Full and reduced kernels
Numerical results
Correctness comparable to standard SVM
Much faster classification!
2-million points in 10-space in 21 seconds
Compared to over 10 minutes for standard SVM
Support Vector Machines
Maximizing the Margin between Bounding
Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wjj 2
Proximal Vector Machines
Fitting the Data using two parallel
Bounding Planes
w
x 0w = í + 1
A+
A-
0
xw= í à 1
2
jj wjj 2
Standard Support Vector Machine
Algebra of 2-Category Linearly Separable Case
Given m points in n dimensional space
Represented by an m-by-n matrix A
Membership of each A i in class +1 or –1 specified by:
An m-by-m diagonal matrix D with +1 & -1 entries
Separate by two bounding planes, x 0w = í æ1 :
A i w= í + 1; for D i i = + 1;
A i w5 í à 1; for D i i = à 1:
More succinctly:
D (Aw à eí ) = e;
where e is a vector of ones.
Standard Support Vector Machine
Formulation
Solve the quadratic program for some ÷ > 0:
min
÷
2
k
k
y
2
2
1
2kw; í
k 22
y; w; í
s. t. D (Aw à eí ) + y > e
+
(QP)
,
where D i i = æ1, denotes A + or A à membership.
Margin is maximized by minimizing 12kw; í k 22
PSVM Formulation
We have from the QP SVM formulation:
min
w; í
s. t.
Solving for
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= e
D (Aw à eí ) + y =
y in terms of w and í
÷
2ke à
(QP)
gives:
D (A w à eí )k 22 + 12kw; í k 22
This simple, but critical modification, changes the nature
of the optimization problem tremendously!!
Advantages of New Formulation
Objective function remains strongly convex
An explicit exact solution can be written in terms
of the problem data
PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space
Exact leave-one-out-correctness can be obtained in
terms of problem data
Linear PSVM
We want to solve:
min
w; í
÷
2ke à
D (A w à eí
2
)k 2
+
1
2kw;
í
2
k2
Setting the gradient equal to zero, gives a
nonsingular system of linear equations.
Solution of the system gives the desired PSVM
classifier
Linear PSVM Solution
h i
w
í
=
I
(÷
0
+ H H)
à1
0
H De
Here, H = [A à e]
The linear system to solve depends on:
0
HH
which is of the size (n + 1) â (n + 1)
n is usually much smaller than
m
Linear Proximal SVM Algorithm
Input
Define
A; D
H = [A à e]
Calculate
0
v = H De
h i
Solve
Classifier:
( ÷I + H 0H )
w
í
= v
si gn(w 0x à í )
Nonlinear PSVM Formulation
Linear PSVM: (Linear separating surface: x 0w
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= í)
(QP)
s. t.
D (Aw à eí ) + y = e
By QP “duality”, w = A 0Du. Maximizing the margin
in the “dual space” , gives:
min ÷2ke à D (A A 0D u à eí ) k 22+ 12ku; í k 22
u; í
Replace A A 0 by a nonlinear kernel K (A; A 0) :
min ÷2ke à D (K (A ; A 0)D u à eí ) k 22+ 12ku; í k 22
u; í
The Nonlinear Classifier
The nonlinear classifier:
K (x 0; A 0)D u = í
K (A; A 0) : R m â n â R n â m7
à ! R mâ m
Where K is a nonlinear kernel, e.g.:
Gaussian (Radial Basis) Kernel :
0
K (A; A ) i j = "
à ö kA i à A j k 22
; i ; j = 1; . . .; m
The i j -entry of K (A; A 0) represents the “similarity”
of data points A i and A j
Nonlinear PSVM
Similar to the linear case,
setting the gradient equal to zero, we obtain:
hi
u
í
=
I
(÷
0
+ H H)
à1
0
H De
Defining slightly different: H = [K (A; A 0) à e]
Here, the linear system to solve is of the size
(m + 1) â (m + 1)
However, reduced kernels techniques can be used (RSVM)
to reduce dimensionality.
Non Linear Proximal SVM Algorithm
Input
Define
K = K ( A; A 0)
H = [A
K à e]
Calculate
Solve
A; D
0
v = H De
h i
w
( ÷I + H 0H ) u
í = v
Classifier: si gn(K
à àí )í )
si gn(w
(x 0; A0x0)u
u = Du
Linear & Nonlinear PSVM MATLAB Code
function [w, gamma] = psvm(A,d,nu)
% PSVM: linear and nonlinear classification
% INPUT: A, d=diag(D), nu. OUTPUT: w, gamma
% [w, gamma] = psvm(A,d,nu);
[m,n]=size(A);e=ones(m,1);H=[A -e];
v=(d’*H)’
%v=H’*D*e;
r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v
w=r(1:n);gamma=r(n+1);
% getting w,gamma from r
Linear PSVM
Comparisons with Other SVMs
Much Faster, Comparable Correctness
Data Set
mxn
PSVM
SSVM
Ten-fold test % Ten-fold test %
Time (sec.)
Time (sec.)
SVM li ght
Ten-fold test %
Time (sec.)
WPBC (60 mo.)
110 x 32
68.5
0.02
68.5
0.17
62.7
3.85
Ionosphere
351 x 34
87.3
0.17
88.7
1.23
88.0
2.19
Cleveland Heart
297 x 13
85.9
0.01
86.2
0.70
86.5
1.44
Pima Indians
768 x 8
77.5
0.02
77.6
0.78
76.4
37.00
BUPA Liver
345 x 6
69.4
0.02
70.0
0.78
69.5
6.65
Galaxy Dim
4192 x 14
93.5
0.34
95.0
5.21
94.1
28.33
Linear PSVM vs LSVM
2-Million Dataset
Over 30 Times Faster
Dataset
Method
Training
Correctness %
Testing
Correctness %
Time
Sec.
NDC
“Easy”
LSVM
90.86
91.23
658.5
PSVM
90.80
91.13
20.8
LSVM
69.80
69.44
655.6
PSVM
69.84
69.52
20.6
NDC
“Hard”
Nonlinear PSVM: Spiral Dataset
94 Red Dots & 94 White Dots
Nonlinear PSVM Comparisons
Data Set
mxn
PSVM
SSVM
Ten-fold test % Ten-fold test %
Time (sec.)
Time (sec.)
LSVM
Ten-fold test %
Time (sec.)
Ionosphere
351 x 34
95.2
4.60
95.8
25.25
95.8
14.58
BUPA Liver
345 x 6
73.6
4.34
73.7
20.65
73.7
30.75
Tic-Tac-Toe
958 x 9
98.4
74.95
98.4
395.30
94.7
350.64
Mushroom *
8124 x 22
88.0
35.50
88.8
307.66
87.8
503.74
* A rectangular kernel was used of size 8124 x 215
Conclusion
PSVM is an extremely simple procedure for
generating linear and nonlinear classifiers
PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space for a linear classifier
Comparable test set correctness to standard SVM
Much faster than standard SVMs : typically an
order of magnitude less.
Future Work
Extension of PSVM to multicategory classification
Massive data classification using an incremental
PSVM
Parallel formulation and implementation of PSVM