Mathematical Programming in Support Vector Machines

Download Report

Transcript Mathematical Programming in Support Vector Machines

Proximal Support Vector Machine Classifiers
KDD 2001
San Francisco August 26-29, 2001
Glenn Fung & Olvi Mangasarian
Data Mining Institute
University of Wisconsin - Madison
Key Contributions
Fast new support vector machine classifier
An order of magnitude faster than standard classifiers
Extremely simple to implement
 4 lines of MATLAB code
 NO optimization packages (LP,QP) needed
Outline of Talk
 (Standard) Support vector machines (SVM)
 Classify by halfspaces
 Proximal support vector machines (PSVM)
 Classify by proximity to planes
 Linear PSVM classifier
 Nonlinear PSVM classifier
 Full and reduced kernels
 Numerical results
 Correctness comparable to standard SVM
 Much faster classification!
 2-million points in 10-space in 21 seconds
 Compared to over 10 minutes for standard SVM
Support Vector Machines
Maximizing the Margin between Bounding
Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wjj 2
Proximal Vector Machines
Fitting the Data using two parallel
Bounding Planes
w
x 0w = í + 1
A+
A-
0
xw= í à 1
2
jj wjj 2
Standard Support Vector Machine
Algebra of 2-Category Linearly Separable Case
 Given m points in n dimensional space
 Represented by an m-by-n matrix A
 Membership of each A i in class +1 or –1 specified by:
 An m-by-m diagonal matrix D with +1 & -1 entries
 Separate by two bounding planes, x 0w = í æ1 :
A i w= í + 1; for D i i = + 1;
A i w5 í à 1; for D i i = à 1:
 More succinctly:
D (Aw à eí ) = e;
where e is a vector of ones.
Standard Support Vector Machine
Formulation
 Solve the quadratic program for some ÷ > 0:
min
÷
2
k
k
y
2
2
1
2kw; í
k 22
y; w; í
s. t. D (Aw à eí ) + y > e
+
(QP)
,
where D i i = æ1, denotes A + or A à membership.
 Margin is maximized by minimizing 12kw; í k 22
PSVM Formulation
We have from the QP SVM formulation:
min
w; í
s. t.
Solving for
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= e
D (Aw à eí ) + y =
y in terms of w and í
÷
2ke à
(QP)
gives:
D (A w à eí )k 22 + 12kw; í k 22
This simple, but critical modification, changes the nature
of the optimization problem tremendously!!
Advantages of New Formulation
 Objective function remains strongly convex
 An explicit exact solution can be written in terms
of the problem data
 PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space
 Exact leave-one-out-correctness can be obtained in
terms of problem data
Linear PSVM
We want to solve:
min
w; í
÷
2ke à
D (A w à eí
2
)k 2
+
1
2kw;
í
2
k2
Setting the gradient equal to zero, gives a
nonsingular system of linear equations.
Solution of the system gives the desired PSVM
classifier
Linear PSVM Solution
h i
w
í
=
I
(÷
0
+ H H)
à1
0
H De
Here, H = [A à e]
 The linear system to solve depends on:
0
HH
which is of the size (n + 1) â (n + 1)
 n is usually much smaller than
m
Linear Proximal SVM Algorithm
Input
Define
A; D
H = [A à e]
Calculate
0
v = H De
h i
Solve
Classifier:
( ÷I + H 0H )
w
í
= v
si gn(w 0x à í )
Nonlinear PSVM Formulation
 Linear PSVM: (Linear separating surface: x 0w
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= í)
(QP)
s. t.
D (Aw à eí ) + y = e
By QP “duality”, w = A 0Du. Maximizing the margin
in the “dual space” , gives:
min ÷2ke à D (A A 0D u à eí ) k 22+ 12ku; í k 22
u; í
 Replace A A 0 by a nonlinear kernel K (A; A 0) :
min ÷2ke à D (K (A ; A 0)D u à eí ) k 22+ 12ku; í k 22
u; í
The Nonlinear Classifier
 The nonlinear classifier:
K (x 0; A 0)D u = í
K (A; A 0) : R m â n â R n â m7
à ! R mâ m
 Where K is a nonlinear kernel, e.g.:
 Gaussian (Radial Basis) Kernel :
0
K (A; A ) i j = "
à ö kA i à A j k 22
; i ; j = 1; . . .; m
 The i j -entry of K (A; A 0) represents the “similarity”
of data points A i and A j
Nonlinear PSVM
 Similar to the linear case,
setting the gradient equal to zero, we obtain:
hi
u
í
=
I
(÷
0
+ H H)
à1
0
H De
Defining slightly different: H = [K (A; A 0) à e]
 Here, the linear system to solve is of the size
(m + 1) â (m + 1)
However, reduced kernels techniques can be used (RSVM)
to reduce dimensionality.
Non Linear Proximal SVM Algorithm
Input
Define
K = K ( A; A 0)
H = [A
K à e]
Calculate
Solve
A; D
0
v = H De
h i
w
( ÷I + H 0H ) u
í = v
Classifier: si gn(K
à àí )í )
si gn(w
(x 0; A0x0)u
u = Du
Linear & Nonlinear PSVM MATLAB Code
function [w, gamma] = psvm(A,d,nu)
% PSVM: linear and nonlinear classification
% INPUT: A, d=diag(D), nu. OUTPUT: w, gamma
% [w, gamma] = psvm(A,d,nu);
[m,n]=size(A);e=ones(m,1);H=[A -e];
v=(d’*H)’
%v=H’*D*e;
r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v
w=r(1:n);gamma=r(n+1);
% getting w,gamma from r
Linear PSVM
Comparisons with Other SVMs
Much Faster, Comparable Correctness
Data Set
mxn
PSVM
SSVM
Ten-fold test % Ten-fold test %
Time (sec.)
Time (sec.)
SVM li ght
Ten-fold test %
Time (sec.)
WPBC (60 mo.)
110 x 32
68.5
0.02
68.5
0.17
62.7
3.85
Ionosphere
351 x 34
87.3
0.17
88.7
1.23
88.0
2.19
Cleveland Heart
297 x 13
85.9
0.01
86.2
0.70
86.5
1.44
Pima Indians
768 x 8
77.5
0.02
77.6
0.78
76.4
37.00
BUPA Liver
345 x 6
69.4
0.02
70.0
0.78
69.5
6.65
Galaxy Dim
4192 x 14
93.5
0.34
95.0
5.21
94.1
28.33
Linear PSVM vs LSVM
2-Million Dataset
Over 30 Times Faster
Dataset
Method
Training
Correctness %
Testing
Correctness %
Time
Sec.
NDC
“Easy”
LSVM
90.86
91.23
658.5
PSVM
90.80
91.13
20.8
LSVM
69.80
69.44
655.6
PSVM
69.84
69.52
20.6
NDC
“Hard”
Nonlinear PSVM: Spiral Dataset
94 Red Dots & 94 White Dots
Nonlinear PSVM Comparisons
Data Set
mxn
PSVM
SSVM
Ten-fold test % Ten-fold test %
Time (sec.)
Time (sec.)
LSVM
Ten-fold test %
Time (sec.)
Ionosphere
351 x 34
95.2
4.60
95.8
25.25
95.8
14.58
BUPA Liver
345 x 6
73.6
4.34
73.7
20.65
73.7
30.75
Tic-Tac-Toe
958 x 9
98.4
74.95
98.4
395.30
94.7
350.64
Mushroom *
8124 x 22
88.0
35.50
88.8
307.66
87.8
503.74
* A rectangular kernel was used of size 8124 x 215
Conclusion
 PSVM is an extremely simple procedure for
generating linear and nonlinear classifiers
 PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space for a linear classifier
 Comparable test set correctness to standard SVM
 Much faster than standard SVMs : typically an
order of magnitude less.
Future Work
 Extension of PSVM to multicategory classification
 Massive data classification using an incremental
PSVM
Parallel formulation and implementation of PSVM