Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001

Download Report

Transcript Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001

Proximal Plane Classification
KDD 2001
San Francisco August 26-29, 2001
Glenn Fung & Olvi Mangasarian
Data Mining Institute
University of Wisconsin - Madison
Second Annual Review
June 1, 2001
Key Contributions
Fast new support vector machine classifier
An order of magnitude faster than standard classifiers
Extremely simple to implement
 4 lines of MATLAB code
 NO optimization packages (LP,QP) needed
Outline of Talk
 (Standard) Support vector machine (SVM) classifiers
 Proximal support vector machines (PSVM) classifiers
 Geometric motivation
 Linear PSVM classifier
 Nonlinear PSVM classifier
 Full and reduced kernels
 Numerical results
 Correctness comparable to standard SVM
 Much faster classification!
 2-million points in 10-space in 21 seconds
 Compared to over 10 minutes for standard SVM
Support Vector Machines
Maximizing the Margin between Bounding
Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wí jj 2
Proximal Vector Machines
Fitting the Data using two parallel
Bounding Planes
w
x 0w = í + 1
A+
A-
0
xw= í à 1
2
jj wí jj 2
SVM as an
Unconstrained Minimization Problem
Changing to 2-norm and measuring margin in ( w; í ) space:
min
y>
+ 12kw; í k 22
0; w; í
s. t. D (Aw à eí ) + y > e
÷
2
k
k
y
2
2
(QP)
At the solution of (QP) : y = (e à D (Aw à eí )) + ,
where ( á) + = max f á; 0g
Hence (QP) is equivalent to :
min ÷k(e à D (A w à eí )) + k 22 + 1kw; í k 22
2
w; í 2
PSVM Formulation
We have from the QP SVM formulation:
min
w; í
s. t.
Solving for
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= e
D (Aw à eí ) + y =
y in terms of w and í
÷
2ke à
(QP)
D (A w à eí
2
)k 2
gives:
+
1
2kw;
í
2
k2
This simple, but critical modification, changes the nature
of the optimization problem tremendously!!
Advantages of New Formulation
 Objective function remains strongly convex
 An explicit exact solution can be written in terms
of the problem data
 PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space
 Exact leave-one-out-correctness can be obtained in
terms of problem data
Linear PSVM
We want to solve:
min
w; í
÷
2ke à
D (A w à eí
2
)k 2
+
1
2kw;
í
2
k2
Setting the gradient equal to zero, gives a
nonsingular system of linear equations.
Solution of the system gives the desired PSVM
classifier
Linear PSVM Solution
h i
w
í
=
I
(÷
0
+ H H)
à1
0
H De
Here, H = [A à e]
 The linear system to solve depends on:
0
HH
which is of the size (n + 1) â (n + 1)
 n is usually much smaller than
m
Linear Proximal SVM Algorithm
Input
Define
A; D
H = [A à e]
Calculate
0
v = H De
h i
Solve
Classifier:
( ÷I + H 0H )
w
í
= v
si gn(w 0x à í )
Nonlinear PSVM Formulation
 Linear PSVM: (Linear separating surface: x 0w
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= í)
(QP)
s. t.
D (Aw à eí ) + y = e
By QP “duality”, w = A 0Du. Maximizing the margin
in the “dual space” , gives:
min ÷2ke à D (AA 0D u à eí ) k 22+ 12ku; í k 22
u; í
 Replace AA 0 by a nonlinear kernel K (A ; A 0) :
min ÷2ke à D (K (A; A 0)D u à eí ) k 22+ 12ku; í k 22
u; í
The Nonlinear Classifier
 The nonlinear classifier:
K (x 0; A 0)D u = í
K (A; B) : R m â n â R n â l 7
à ! R mâ l
 Where K is a nonlinear kernel, e.g.:
 Polynomial Kernel : (A A 0 + öaa0) dï
 Gaussian (Radial Basis) Kernel :
"
à ö kA i à
2
A j k2
; i ; j = 1; . . .; m
Nonlinear PSVM
 Similar to the linear case,
setting the gradient equal to zero, we obtain:
hi
u
í
=
I
(÷
0
+ H H)
à1
0
H De
Defining slightly different: H = [K (A; A 0) à e]
 Here, the linear system to solve is of the size
(m + 1) â (m + 1)
However, reduced kernels techniques can be used (RSVM)
to reduce dimensionality.
Non Linear Proximal SVM Algorithm
Input
Define
A; D
K = K ( A; A 0)
H = [A
K à e]
Calculate
0
v = H De
h i
Solve
( ÷I + H 0H ) uwí = v
Classifier: si gn(u
si 0gn(w
x 0à) àí )í )
K (x;0A
u = Du
PSVM MATLAB Code
function [w, gamma] = psvm(A,d,nu)
% PSVM: linear and nonlinear classification
% INPUT: A, d=diag(D), nu. OUTPUT: w, gamma
% [w, gamma] = pvm(A,d,nu);
[m,n]=size(A);e=ones(m,1);H=[A -e];
v=(d’*H)’
%v=H’*D*e;
r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v
w=r(1:n);gamma=r(n+1);
% getting w,gamma from r
Linear PSVM
Comparisons with Other SVMs
Much Faster, Comparable Correctness
Data Set
mxn
PSVM
SSVM
Ten-fold test % Ten-fold test %
Time (sec.)
Time (sec.)
SVM li ght
Ten-fold test %
Time (sec.)
WPBC (60 mo.)
110 x 32
68.5
0.02
68.5
0.17
62.7
3.85
Ionosphere
351 x 34
87.3
0.17
88.7
1.23
88.0
2.19
Cleveland Heart
297 x 13
85.9
0.01
86.2
0.70
86.5
1.44
Pima Indians
768 x 8
77.5
0.02
77.6
0.78
76.4
37.00
BUPA Liver
345 x 6
69.4
0.02
70.0
0.78
69.5
6.65
Galaxy Dim
4192 x 14
93.5
0.34
95.0
5.21
94.1
28.33
Linear PSVM
Comparisons on Larger Adult Dataset
Much Faster & Comparable Correctness
Dataset Size
(Train,Test)
Testing correctness %
Running time Sec. (Best in Red)
PSVM
LSVM
SSVM
SOR
SMO
SVM li ght
(11221,21341)
84.48
2.5
84.84
38.9
84.79
14.1
84.37
18.8
17.0
84.68
306.6
(16101,16461)
84.78
3.7
85.01
60.5
84.96
21.5
84.62
24.8
35.3
84.83
667.2
(22697,9865)
85.16
5.2
85.35 85.35
92.0
29.0
85.06
31.3
85.7
85.17
1425.6
(32562,16282)
84.56
7.4
85.05
140.9
84.96
83.9
163.6
85.05
2184.0
Attributes=123
85.02
44.5
Linear PSVM vs LSVM
2-Million Dataset
Over 30 Times Faster
Dataset
Method
Training
Correctness %
Testing
Correctness %
Time
Sec.
NDC
“Easy”
LSVM
90.86
91.23
658.5
PSVM
90.80
91.13
20.8
LSVM
69.80
69.44
655.6
PSVM
69.84
69.52
20.6
NDC
“Hard”
Nonlinear PSVM: Spiral Dataset
94 Red Dots & 94 White Dots
Nonlinear PSVM Comparisons
Data Set
mxn
PSVM
SSVM
Ten-fold test % Ten-fold test %
Time (sec.)
Time (sec.)
LSVM
Ten-fold test %
Time (sec.)
Ionosphere
351 x 34
95.2
4.60
95.8
25.25
95.8
14.58
BUPA Liver
345 x 6
73.6
4.34
73.7
20.65
73.7
30.75
Tic-Tac-Toe
958 x 9
98.4
74.95
98.4
395.30
94.7
350.64
Mushroom *
8124 x 22
88.0
35.50
88.8
307.66
87.8
503.74
* A rectangular kernel was used of size 8124 x 215
Conclusion
 PSVM is an extremely simple procedure for
generating linear and nonlinear classifiers
 PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space for a linear classifier
 Comparable test set correctness to standard SVM
 Much faster than standard SVMs : typically an
order of magnitude less.
Future Work
 Extension of PSVM to multicategory classification
 Massive data classification using an incremental
PSVM
Parallel extension and implementation of PSVM