Mathematical Programming in Support Vector Machines

Download Report

Transcript Mathematical Programming in Support Vector Machines

RSVM: Reduced Support Vector Machines
Y.-J. Lee & O. L. Mangasarian
University of Wisconsin-Madison
First SIAM International Conference on Data Mining
Chicago, April 6, 2001
Outline of Talk
 What is a support vector machine (SVM) classifier?
 The smooth support vector machine (SSVM)
 A new SVM solvable without an optimization package
 Difficulties with nonlinear SVMs:
Computational: Handling massive kernel matrix: m â m
Storage: Separating surface depends on almost entire dataset
 Reduced Support Vector Machines (RSVMs)
 Reduced kernel : Much smaller rectangular matrix m â m
 m : 1% to 10% of m
 Speeds computation & reduces storage
 Numerical Results
 e.g. 32,562-point dataset classified in 17 minutes compared
to 2.15 hours by a standard algorithm (SMO)
What is a Support Vector Machine?
 An optimally defined surface
 Typically nonlinear in the input space
 Linear in a higher dimensional space
 Implicitly defined by a kernel function
What are Support Vector Machines
Used For?
 Classification
 Regression & Data Fitting
 Supervised & Unsupervised Learning
(Will concentrate on classification)
Geometry of the Classification Problem
2-Category Linearly Separable Case
w
x 0w = í + 1
A+
A0
xw= í à 1
Support Vector Machines
Maximizing the Margin between Bounding Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
= Margin
jj wjj 2
Support Vector Machines Formulation
 Solve the quadratic program for some ÷ > 0:
min
y>
+ 12kw; í k 22
0; w; í
s. t. D (Aw à eí ) + y > e
÷
2
k
k
y
2
2
(QP)
,
where D i i = æ1, denotes A + or A à membership.
 Margin is maximized by minimizing 12kw; í k 22
SVM as an
Unconstrained Minimization Problem
min
y>
÷
2
k
k
y
2
2
1
2kw; í
k 22
0; w; í
s. t. D (Aw à eí ) + y > e
+
(QP)
At the solution of (QP) : y = (e à D (Aw à eí )) + ,
where ( á) + = max f á; 0g
Hence (QP) is equivalent to the nonsmooth SVM:
min ÷k(e à D (A w à eí )) + k 22 + 1kw; í k 22
2
2
w; í
SSVM:
The Smooth Support Vector Machine
 Replacing the plus function (á) + in the nonsmooth
SVM by the smooth p(á; ë ) , gives our SSVM:
min
÷
kp((e à
n+
1
2
(w; í ) 2 R
2
2
1
à
k
k
k
D(Aw eí )); ë) 2 + 2 w; í 2
Here, p( á; ë) is an accurate smooth approximation
of ( á) + , obtained by integrating the sigmoid function
of neural networks. (sigmoid = smoothed step)
 The solution of SSVM converges to the solution of
nonsmooth SVM as ë goes to infinity.
(Typically, ë = 5)
Nonlinear Smooth Support Vector Machine
Nonlinear Separating Surface: K (x 0; A 0)D u = í
0
 Use a nonlinear kernel
K (A; A ) in SSVM:
min ÷kp(e à D (K (A; A 0)D u à eí ) k 22 + 1ku; í k 2
2
2
2
u; í
 The kernel matrix
0
m â m is fully dense
2
K (A; A ) R
 Use Newton algorithm to solve the problem
 Each iteration solves m+1 linear equations in m+1
variables
 Nonlinear separating surface depends on entire dataset :
K (x 0; A 0)D u = í
Examples of Kernels
K (A; B) : R m â n â R n â l 7
à ! R mâ l
A2 R
mâ n
; a 2 R m ; ö 2 R; d is an integer:
 Polynomial Kernel : (AA 0 + öaa0) dï
(Linear Kernel A A 0: ö = 0; d = 1)
 Gaussian (Radial Basis) Kernel :
0
K (A; A ) i j =
"
à ö kA i à A j k 22
; i ; j = 1; . . .; m
Difficulties with Nonlinear SVM
for Large Problems
 The nonlinear kernel K ( A; A 0m
) 2 R m â m is fully dense
 Long CPU time to compute
2
numbers
 Runs out of memory while storing m â m kernel matrix
 Computational complexity depends on m
 Complexity of nonlinear SSVM ø O((m + 1) 3)
 Separating surface depends on almost entire dataset
 Need to store the entire dataset after solving the problem
Overcoming Computational & Storage Difficulties
Use a Rectangular Kernel
 Choose a small random sample A 2 R mâ n of A
 The small random sample A is a representative sample
of the entire dataset
 Typically A is 1% to 10% of the rows of
A
 Replace K (A; A 0) by K (A; A 0) 2 R mâ m with
corresponding D ú D in nonlinear SSVM
 Only need to compute and store m â m numbers for
the rectangular kernel
 Computational complexity reduces to O((m + 1) 3)
 The nonlinear separator only depends on A
Using K (A; A 0) gives lousy results!
Reduced Support Vector Machine Algorithm
öu
Nonlinear Separating Surface: K (x 0; Aö0)D
ö= í
(i) Choose a random subset matrix A 2 R mâ n of
entire data matrix A 2 R mâ n
(ii) Solve the following problem by the Newton
method with corresponding D ú D :
min
(u; í ) 2 R
÷
kp(e à
2
m+ 1
öu
D(K (A; A 0)D
ö à eí ); ë) k 22 + 12ku
ö; í k 22
(iii) The separating surface is defined by the optimal
solution ( u; í ) in step (ii):
öu
K (x 0; Aö0)D
ö= í
How to Choose A in RSVM?
 A is a representative sample of the entire dataset
 Need not be a subset of A
 A good selection of A may generate a classifier using
very small m
 Possible ways to choose A :
 Choose m random rows from the entire dataset A
 Choose A such that the distance between its rows
exceeds a certain tolerance
 Use k cluster centers of A + and A à as A
A Nonlinear Kernel Application
Checkerboard Training Set: 1000 Points in R 2
Separate 486 Asterisks from 514 Dots
Conventional SVM Result on Checkerboard
Using 50 Randomly Selected Points Out of 1000
K (A; A 0) 2 R 50â 50
RSVM Result on Checkerboard
Using SAME 50 Random Points Out of 1000
K (A; A 0) 2 R 1000â 50
RSVM on Moderate Sized Problems
(Best Test Set Correctness %, CPU seconds)
Dataset Size
m â n; m
Cleveland Heart
297 x 13,
30
BUPA Liver
345 x 6 ,
35
Ionosphere
351 x 34,
35
Pima Indians
768 x 8,
50
Tic-Tac-Toe
958 x 9,
96
Mushroom
8124 x 22, 215
K (A; A 0) mâ m K (A; A 0) mâ m K (A; A 0) mâ m
86.47
3.04
85.92
32.42
76.88
1.58
74.86
2.68
73.62
32.61
68.95
2.04
95.19
5.02
94.35
59.88
88.70
2.13
78.64
5.72
76.59
328.3
57.32
4.64
98.75
14.56
98.43
1033.5
88.24
8.87
89.04
466.20
N/A
N/A
83.90
221.50
RSVM on Large UCI Adult Dataset
Standard Deviation over 50 Runs = 0.001
Average Correctness % & Standard Deviation, 50 Runs
UCI Adult
Dataset
Size
( Train ; Test )
K ( A; A 0) m â m
0
(
K A; A ) m â m
Testing% Std.Dev. Testing% Std.Dev.
A m â 123
m m=m
(6414, 26148)
84.47
0.001
77.03
0.014
210 3.2%
(11221, 21341)
84.71
0.001
75.96
0.016
225 2.0%
(16101, 16461)
84.90
0.001
75.45
0.017
242 1.5%
(22697, 9865)
85.31
0.001
76.73
0.018
284 1.2%
(32562, 16282)
85.07
0.001
76.95
0.013
326 1.0%
CPU Times on UCI Adult Dataset
RSVM, SMO and PCGC with a Gaussian Kernel
Adult Dataset : Training Set Size vs. CPU Time in Seconds
Size
RSVM
3185
4781
6414
11221
16101
22697
32562
44.2
83.6
123.4
227.8
342.5
587.4
980.2
SMO
66.2
146.6
258.8
781.4
1784.4 4126.4 7749.6
PCGC
380.5
1137.2
2530.6 11910.6
Ran out of memory
CPU Time Comparison on UCI Dataset
Time( CPU sec. )
RSVM, SMO and PCGC with a Gaussian Kernel
Training Set Size
Conclusion
 RSVM : An effective classifier for large datasets
 Classifier uses 10% or less of dataset
 Can handle massive datasets
 Much faster than other algorithms
 Test set correctness:
Same or better than full dataset K (A; A 0)
Much better than randomly chosen subset K (A; A 0)
 Rectangular kernel K (A; A 0) :
Novel practical idea
 Applicable to all nonlinear kernel problems