Mathematical Programming in Support Vector Machines

Download Report

Transcript Mathematical Programming in Support Vector Machines

Incremental Support Vector Machine Classification
Second SIAM International Conference on Data Mining
Arlington, Virginia, April 11-13, 2002
Glenn Fung & Olvi Mangasarian
Data Mining Institute
University of Wisconsin - Madison
Key Contributions
Fast incremental classifier based on PSVM
 Proximal Support Vector Machine
Capable of modifying an existing linear classifier by
both adding and retiring data
Extremely simple to implement
 Small memory requirement
 Even for huge problems (1 billion)
 NO optimization packages (LP,QP) needed
Outline of Talk
 (Standard) Support vector machines (SVM)
 Classification by halfspaces
 Proximal linear support vector machines (PSVM)
 Classification by proximity to planes
The incremental and decremental algorithm
 Option of keeping or retiring old data
Numerical results
1 Billion points in 10 dimensional space classified
in less than 3 hours!
Numerical results confirm that algorithm time is
linear in the number of data points
Support Vector Machines
Maximizing the Margin between Bounding
Planes
w
x 0w = í + 1
A+
A-
x 0w = í à 1
2
jj wjj 2
Proximal Support Vector Machines
Fitting the Data using two parallel
Bounding Planes
w
x 0w = í + 1
A+
A-
0
xw= í à 1
2
jj wjj 2
Standard Support Vector Machine
Algebra of 2-Category Linearly Separable Case
 Given m points in n dimensional space
 Represented by an m-by-n matrix A
 Membership of each A i in class +1 or –1 specified by:
 An m-by-m diagonal matrix D with +1 & -1 entries
 Separate by two bounding planes, x 0w = í æ1 :
A i w= í + 1; for D i i = + 1;
A i w5 í à 1; for D i i = à 1:
 More succinctly:
D (Aw à eí ) = e;
where e is a vector of ones.
Standard Support Vector Machine
Formulation
 Solve the quadratic program for some ÷ > 0:
min
÷
2
k
k
y
2
2
1
2kw; í
k 22
y; w; í
s. t. D (Aw à eí ) + y > e
+
(QP)
,
where D i i = æ1, denotes A + or A à membership.
 Margin is maximized by minimizing 12kw; í k 22
PSVM Formulation
We have from the standard QP SVM formulation:
min
w; í
s. t.
Solving for
min
w; í
÷
2
k
k
y
2
2
+ 12kw; í k 22
= e
D (Aw à eí ) + y =
y in terms of w and í
÷
2ke à
(QP)
gives:
D (A w à eí )k 22 + 12kw; í k 22
This simple, but critical modification, changes the nature
of the optimization problem tremendously!!
Advantages of New Formulation
 Objective function remains strongly convex.
 An explicit exact solution can be written in terms
of the problem data.
 PSVM classifier is obtained by solving a single
system of linear equations in the usually small
dimensional input space.
 Exact leave-one-out-correctness can be obtained in
terms of problem data.
Linear PSVM
We want to solve:
min
w; í
÷
2ke à
D (A w à eí
2
)k 2
+
1
2kw;
í
2
k2
Setting the gradient equal to zero, gives a
nonsingular system of linear equations.
Solution of the system gives the desired PSVM
classifier.
Linear PSVM Solution
h i
w
í
=
I
(÷
0
+ H H)
à1
0
H De
Here, H = [A à e]
 The linear system to solve depends on:
0
HH
which is of size
(n + 1) â (n + 1)
 n is usually much smaller than
m
Linear Proximal SVM Algorithm
Input
Define
A; D
H = [A à e]
Calculate
0
v = H De
h i
Solve
Classifier:
( ÷I + H 0H )
w
í
= v
si gn(w 0x à í )
Linear & Nonlinear PSVM MATLAB Code
function [w, gamma] = psvm(A,d,nu)
% PSVM: linear and nonlinear classification
% INPUT: A, d=diag(D), nu. OUTPUT: w, gamma
% [w, gamma] = psvm(A,d,nu);
[m,n]=size(A);e=ones(m,1);H=[A -e];
v=(d’*H)’
%v=H’*D*e;
r=(speye(n+1)/nu+H’*H)\v % solve (I/nu+H’*H)r=v
w=r(1:n);gamma=r(n+1);
% getting w,gamma from r
Incremental PSVM Classification
 Suppose we have two “blocks” of data
m 1â n and A 2 R m 2â n
ô
õ 2
ô õ
1
A1 à e
E1
) E=
E=
A2 à e
E2
A 2R
ô
) E 0E =
h i
w
í
( ÷I +
õ0
E1
[ E 1 E 2 ] = E 01E 1 + E 02E 2
E2
E 01E 1+
à1
0
E 2E 2) ( E 01D 1e +
=
E 02D 2e)
 The linear system to solve depends on the
E 01E 1; E 02E 2
compressed blocks:
which are of the size (n + 1) â (n + 1)
Linear Incremental Proximal SVM
Algorithm
E 0E = 0; d = 0; i = 1
i
A ; d
i0
Read from disk
i
i
i0
d = E D ie
(n + 1) â 1
i0 i
E E = E E + E E Update in
0
0
i= i+1
Compute and
Store in
memory
E E(n + 1) â (n + 1)
i
Initialization
Discard:
A i ; D i ; E i ; di
Keep:
E 0E; d
memory
d = d + di
Compute output
Yes
i = i max ?
No
w; í
Linear Incremental Proximal SVM
Adding – Retiring Data
 Capable of modifying an existing linear classifier by
both adding and retiring data
 Option of retiring old data is similar to adding new
data
 Financial Data: old data is obsolete
 Option of keeping old data and merging it with the new
data:
 Medical Data: old data does not obsolesce.
Numerical experiments
One-Billion Two-Class Dataset
 Synthetic dataset consisting of 1 billion points in 10dimensional input space
 Generated by NDC (Normally Distributed Clustered)
dataset generator
Dataset divided into 500 blocks of 2 million points
each.
Solution obtained in less than 2 hours and 26 minutes
 About 30% of the time was spent reading data from
disk.
Testing set Correctness 90.79%
Numerical Experiments
Simulation of Two-month 60-Million Dataset
 Synthetic dataset consisting of 60 million points (1
million per day) in 10- dimensional input space
 Generated using NDC
 At the beginning, we only have data corresponding to
the first month
 Every day:
 The oldest block of data is retired (1 Million)
 A new block is added (1 Million)
 A new linear classifier is calculated daily
 Only an 11 by 11 matrix is kept in memory at the end
of each day. All other data is purged.
Numerical experiments
Separator changing through time
Numerical experiments
Normals to the separating hyperplanes
Corresponding to 5 day intervals
Conclusion
 Proposed algorithm is an extremely simple
procedure for generating linear classifiers in an
incremental fashion for huge datasets.
 The linear classifier is obtained by solving a single
system of linear equations in the small dimensional
input space.
 The proposed algorithm has the ability to retire old
data and add new data in a very simple manner.
 Only a matrix of the size of the input space is kept
in memory at any time
Future Work
 Extension to nonlinear classification
 Parallel formulation and implementation on
remotely located servers for massive datasets
 Real time on-line application, e.g. fraud detection