Mathematical Programming in Support Vector Machines

Download Report

Transcript Mathematical Programming in Support Vector Machines

Mathematical
Programming in Support Vector
Machines
Olvi L. Mangasarian
University of Wisconsin - Madison
High Performance Computation for Engineering Systems Seminar
MIT October 4, 2000
What is a Support Vector Machine?
An optimally defined surface
Typically nonlinear in the input space
Linear in a higher dimensional space
Implicitly defined by a kernel function
What are Support Vector Machines
Used For?
Classification
Regression & Data Fitting
Supervised & Unsupervised Learning
(Will concentrate on classification)
Example of Nonlinear Classifier:
Checkerboard Classifier
Outline of Talk
 Generalized support vector machines (SVMs)
Completely general kernel allows complex classification
(No Mercer condition!)
 Smooth support vector machines
Smooth & solve SVM by a fast Newton method
 Lagrangian support vector machines
Very fast simple iterative schemeOne matrix inversion: No LP. No QP.
 Reduced support vector machines
Handle large datasets with nonlinear kernels
Generalized Support Vector Machines
2-Category Linearly Separable Case
w
0
xw= í + 1
A+
A0
xw= í à 1
Generalized Support Vector Machines
Algebra of 2-Category Linearly Separable Case
 Given m points in n dimensional space
 Represented by an m-by-n matrix A
 Membership of each A i in class +1 or –1 specified by:
 An m-by-m diagonal matrix D with +1 & -1 entries
 Separate by two bounding planes, x 0w = í æ1 :
A i w= í + 1; for D i i = + 1;
A i w5 í à 1; for D i i = à 1:
 More succinctly:
D (Aw à eí ) = e;
where e is a vector of ones.
Generalized Support Vector Machines
Maximizing the Margin between Bounding Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wjj 2
Generalized Support Vector Machines
The Linear Support Vector Machine Formulation
D (Aw à eí ) = e
Solve the following mathematical program for some ÷ >
min
w;í ;y
0
÷e y +
k wk
2
D (Aw à eí ) + y = e
y = 0:
The nonnegative slack variable y is zero iff:
 Convex hulls of A + and A à do not intersect
 ÷ is sufficiently large
s.t.
0:
Breast Cancer Diagnosis Application
97% Tenfold Cross Validation Correctness
780 Samples:494 Benign, 286 Malignant
Another Application: Disputed Federalist Papers
Bosch & Smith 1998
56 Hamilton, 50 Madison, 12 Disputed
Generalized Support Vector Machine Motivation
(Nonlinear Kernel Without Mercer Condition)
Linear SVM: Linear separating surface: x 0w
0
= í
min ÷e y + k w k 1
s.t. D (Aw à eí ) + y= e; y= 0
 Set w = A 0D u . Resulting linear surface: x 0A 0D u = í
0
min ÷e y + k u k 1
s.t. D (AA 0D u à eí ) + y= e; y= 0
Replace A A 0by arbitrary nonlinear kernel K (A; A 0)
 Resulting nonlinear surface: K (x 0; A 0)D u = í
0
min ÷e y + k u k 1
s.t. D (K (A; A 0)D u à eí ) + y= e; y= 0
SSVM: Smooth Support Vector Machine
(SVM as Unconstrained Minimization Problem)
Changing to 2-norm and measuring margin in(
) space:
Smoothing the Plus Function:
Integrate the Sigmoid Function
SSVM: The Smooth Support Vector Machine
Smoothing the Plus Function
 Integrating the sigmoid approximation to the step function:
s(x; ë) =
1
1+ " à ëx ;
gives a smooth, excellent approximation to the plus function:
p(x; ë) = x +
1
ë log (1 +
" à ëx ); ë > 0:
 Replacing the plus function in the nonsmooth SVM
by the smooth approximation gives our SSVM:
min Ð ë (w; í ) :=
min
÷
2
k p( e à D ( Aw à eí ) ; ë ) k22 +
1
2
k w; í k22
Newton: Minimize a sequence of quadratic approximations
to the strongly convex objective function, i.e. solve a sequence
of linear equations in n+1 variables. (Small dimensional input
space.)
Armijo: Shorten distance between successive iterates so as
to generate sufficient decrease in objective function. (In
computational reality, not needed!)
Global Quadratic Convergence: Starting from any point,
the iterates guaranteed to converge to the unique solution
at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8
iterations without an Armijo.)
SSVM with a Nonlinear Kernel
Nonlinear Separating Surface in Input Space
Examples of Kernels
Generate Nonlinear Separating Surfaces in Input Space
A2R
mâ n
; a 2 R m; ö 2 R; dinteger
Polynomial Kernel
0
0d
(AA + öaa )
 Gaussian (Radial Basis) Kernel
"
2
à ö kA i à A j k ; i ;j = 1;. . .;m:
Neural Network Kernel
0
0
(AA + öaa ) ã
( á) ã : R !
f 0; 1g
LSVM: Lagrangian Support Vector Machine
Dual of SVM
Taking the dual of the SVM formulation:
,
gives the following simple dual problem:
min0ô u2 R m
1 0I
2u ( ÷ +
0
0
0
D ( AA + ee ) D ) u à e u
The variables ( w; í ; y) of SSVM are related to u by:
w = A 0D u; y = u÷; í = à e0D u:
LSVM: Lagrangian Support Vector Machine
Dual SVM as Symmetric Linear Complementarity Problem
Defining the two matrices:
H = D [A
à e];
Reduces the dual SVM to:
min
0ô
u2 R m
f (u) :=
Q=
I
÷+
1 0
u
Qu
2
HH
0
0
à e u:
The optimality condition for this dual SVM is the LCP:
0ô u ? Qu à eõ 0;
which, by Implicit Lagrangian Theory, is equivalent to:ë
Qu à e = ((Qu à e) à ëu) + :
> 0;
LSVM Algorithm
Simple & Linearly Convergent – One Small Matrix Inversion
u
i+ 1
à1
= Q ( e + (( Qu i à e) à ëu i ) + ) ; i = 0; 1; . . .
Where:
0< ë <
2
÷
Key Idea: Sherman-Morrison-Woodbury formula allows the inversion
inversion of an extremely large m-by-m matrix Q by merely inverting
a much smaller n-by-n matrix as follows:
I
(÷ +
0 à1
HH )
= ÷( I à
I
H (÷ +
0
à1
0
H H ) H ):
LSVM Algorithm – Linear Kernel
11 Lines of MATLAB Code
function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)
% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,
% Q=I/nu+H*H', H=D[A -e]
% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma
% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol);
[m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0;
S=H*inv((speye(n+1)/nu+H'*H));
u=nu*(1-S*(H'*e));oldu=u+1;
while it<itmax & norm(oldu-u)>tol
z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1));
oldu=u;
u=nu*(z-S*(H'*z));
it=it+1;
end;
opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;
function pl = pl(x); pl = (abs(x)+x)/2;
LSVM Algorithm – Linear Kernel
Computational Results
2 Million random points in 10 dimensional space
Classified in 6.7 minutes in 6 iterations & e-5 accuracy
250 MHz UltraSPARC II with 2 gigabyte memory
CPLEX ran out of memory
32562 points in 123-dimensional space (UCI Adult Dataset)
Classified in141 seconds & 55 iterations to 85% correctness
400 MHz Pentium II with 2 gigabyte memory
SVM li ght classified in 178 seconds & 4497 iterations
LSVM – Nonlinear Kernel
Formulation
For the nonlinear kernel:
K (A; B) : R mâ n â R nâ l !
R mâ l ;
the separating nonlinear surface is given by:
h 0i
A
0
à 1]; à e0 )D u = 0
K ([x
Where u is the solution of the dual problem:
mi n
05 u2 R m f (u)
with Q redefined as:
G = [A
:=
1 0
2u Qu
à e]; Q =
I
÷+
à e0u;
0
D K (G; G )D
LSVM Algorithm – Nonlinear Kernel Application
100 Iterations, 58 Seconds on Pentium II, 95.9% Accuracy
Reduced Support Vector Machines (RSVM)
Large Nonlinear Kernel Classification Problems
Key idea: Use a rectangular kernel.
 K (A; Aö0);where Aö0 is a small random sample of A 0
 Typically Aö has 1% to 10% of the rows of A
mi n ÷ 0
1 0
2
öu
ö+ í )
 uö;í ;y 2y y + 2(u
0 ö
ö
s:t: D (K (A; A )D u
ö à eí ) + y= e; y= 0
 Two important consequences:
RSVM can solve very large problems
 Nonlinear separator depends on Aö only
0 ö0 ö
 Separating surface:
 K (Aö; Aö0)
K (x ; A )D uö = í
gives lousy results
Conventional SVM Result on Checkerboard
Using 50 Random Points Out of 1000
RSVM Result on Checkerboard
Using SAME 50 Random Points Out of 1000
RSVM on Large Classification Problems
Standard Error over 50 Runs = 0.001 to 0.002
RSVM Time = 1.24 * (Random Points Time)
Conclusion
Mathematical Programming plays an essential role in SVMs
Theory
New formulations
Generalized SVMs
New algorithm-generating concepts
Smoothing (SSVM)
Implicit Lagrangian (LSVM)
Algorithms
Fast : SSVM
Massive: LSVM, RSVM
Future Research
Theory
Concave minimization
Concurrent feature & data selection
 Multiple-instance problems
 SVMs as complementarity problems
 Kernel methods in nonlinear programming
Algorithms
Chunking for massive classification: 108
 Multicategory classification algorithms
Talk & Papers Available on Web
www.cs.wisc.edu/~olvi