Online Learning of Maximum Margin Classifiers

Download Report

Transcript Online Learning of Maximum Margin Classifiers

Online Learning of
Maximum p-Norm Margin Classifiers
with Bias
Kohei HATANO
Kyusyu University
(Joint work with K. Ishibashi and M. Takeda)
COLT 2008
Plan of this talk
1. Introduction
2. Preliminalies
― ROMMA
3. Our result
– Our new algorithm PUMMA
– Our implicit reduction
4. Experiments
Maximum Margin Classification
• SVMs [Boser et al. 92]
– 2-norm margin
• Boosting [Freund&Schapire 97]
– ∞-norm margin (approximtely)
• Why maximum (or large)
margin?
– Good generalization
[Schapire et al. 98]
[Shawe-Taylor et al. 98]
– Formulated as convex
optimization problems(QP, LP)
Scaling up Max. Margin Classification
1. Decomposition Methods (for SVMs)
– Break original QP into smaller QPs
– SMO [Platt 99],SVMlight [Joachims 99],
LIBSVM [Chang & Lin 01]
– state-of-the-art implementations
2. Online Learning (our approach)
Online Learning
Online Learning Algorithm
For t=1 to T
1. Receive an instance xt in Rn
2. Guess a label ŷt=sign(wt ∙ xt+bt)
3. Receive the label yt in {-1,1}
4. Update
(wt+1,bt+1)=UPDATE_RULE(wt,bt,xt,yt)
end
Advantages of online Learning
• Simple & easy to implement
• Uses less memory
• Adaptive for changing concepts
-1
xt
+1?
Online Learning Algorithms
for maximum margin classification
• Max Margin Perceptron [Kowalzyk 00]
hyperplane with bias
• ROMMA [Li & Long 02]
• ALMA [Gentile 01]
• LASVM [Bordes et al. 05]
• MICRA
bias
[Tsampouka&Shawe-Taylor 07]
0
• Pegasos [Shalev-Shwalz et al. 07]
• Etc.
hyperplane w/o bias
Most of online algs cannot learn hyperplane with
bias!
Typical Reduction to deal with bias
[Cf. Cristianini& Shawe-Taylor 00]
Adding an extra dimension corresponding bias.
R : max x j
Augmented space
~j  (x j ,R)  Rn1
↔ x
u 1
↔
Original space
instance
x j  Rn
hyperplane (u , b)
j
~ : max x~
R
j
u~  (u, b / R)
NOTE: u  x j  b  u~  ~x j ( u~ is equivalent with (u,b) )
y (u  x j  b)
margin
γ  min j
uR
(over normalized Instances) j
↔
~~
yj (u
xj )
~
γ  min
~
~R
j
u
xj
This reduction
weaken the guarantee of margin: ~
xj
γ
γ
γ  γ~ 
b
(u , b)
2
~
γ
u~
→it might cause significant difference
in genealization!
~
R
R
j
Our New Online Learning Algorithm
PUMMA(P-norm Utilizing Maximum Margin Algorithm)
• PUMMA can learn maximum margin classifiers
with bias directly (without using the typical reduction!).
• Margin is defined as p-norm (p≥2)
– For p=2, similar to Perceptron.
– For p=O(ln n) [Gentile ’03], similar to Winnow [Littlestone ‘89].
Fast when the target is sparse.
• Extended to linearly inseparable case (omitted).
– Soft margin with2-norm slack variables.
Problem of finding the p-norm maximum
margin hyperplane [Cf. Mangasarian 99]
Given: (linearly separable) S=((x1,y1),…,(xT,yT)),
Goal: Find an approximate solution of (w*,b*)
1 2
( w ,b )  argmin. w q
w ,b 2
sub.to :
y j ( w  x j  b )  1 ( j  1 ,...T )
*
*
q-norm (dual norm)
1/p+1/q=1
E.g.
p=2, q=2
p=∞, q=1
We want an online alg. solving the problem
with small # of updates.
ROMMA
(Relaxed Online Maximum Margin Algorithm)[Li&Long,’02]
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,
1. Predict ŷt=sign(wt∙xt), and receive yt
2. If yt(wt ·xt )<1-δ (margin is “insufficient”),
3. update:
wt 1
Constraint over the last example
which causes an update
1
2
 argmin. w 2
2
w
sub.to : y t (w  xt )  1,
w  w t  wt
4. Otherwise, wt+1=wt
NOTE: bias is fixed with 0
2
2 constraints only!
2
Constraint over the last hyperplane
ROMMA [Li&Long,’02]
SVM (without bias)
feasible region of SVM
1 2
min. w 2
w
2
sub.to : y j ( w  x j )  1,
weght space
wSVM
w2
4
ROMMA
w3
w1
1
3
0
2
(j  1,...,4)
1
2
min. w 2
w
2
sub.to : y t (w  xt )  1,
w  w t-1  w
2
2
Solution of ROMMA
Solution of ROMMA is an additive update:
2
(i)If wt 1  wt  wt 2 ,
wt 1  αyt xt , whereα 
1
2
xt 2 .
(ii)Otherwise,
wt 1  αyt xt  βwt , where
α
wt
wt
2
2
1  wt  xt 
2
2
wt
,β 
2
2
xt 2  (wt  xt )
wt
2
2
2
2
2
xt 2  2(wt  xt )
2
xt 2  (wt  xt )
2
.
PUMMA
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,
1. Predict ŷt=sign(wt∙xt), and receive yt
2
f (wt)>1-δ,
)  wt q update:
2. If○yt(wt ·xwt +b
t
○
q-norm
≧1
(1/p+1/q=1)
≧1
1 2
(wt 1 , bt 1 )  min. w q
w,b 2
≧1
sub.to : w  xtpos  b  1,
●
w  xtneg  b  1,
bias is
optimized
PUMMA
w  f (wt-1 )  w
3. Otherwise, wt+1=wt
pos , xneg 2
x
w  wtt  wtt
: last positive and
negative examples
which incur updates
2
q
ROMMA
link function [Grove et al. 97]
q 1
sign(wi ) wi q
f (w) 
q 2
wq
Solution of PUMMA
Solution of PUMMA is found numerically:
2
(i)If wt 1  wt  wt 2 ,
wt 1  αyt zt , whereα 
2
zt
2
andzt  xtpos  xtneg .
2
(ii)Otherwise,
wt 1  αyt zt  βwt , where
1
2
2
( α ,β )  argmin. αzt  βf ( wt ) p  2α  β f ( wt ) p ,
2
which is solvedby the Newton method.
In eithercases,
 ( wt 1  xtpos  wt 1  xtneg )
bt 1 
.
2
xpost, xnegt
: last positive and
negative examples
which incur updates
Observation:
For p=2, the solution is the same as that of ROMMA for zt = xtpos – xtneg.
Our (implicit) reduction
which preserves the margin
~
Thm. w  w
*
w,b
1
w
2
-
hyperplane without bias
over pairs of positive and negative instances
hyperplane with bias
(w * , b* )  argmin.
=
2
2
sub.
to :
PUMMA
1 2
w~  argmin. w 2
w
2
pairs
sub.to of
: positive
implicitly runs ROMMA over
w negative
xipos  b  1 (instances
i  1,...,P)
and
in an efficient way!w  (xipos  x neg
j )2
w  x neg
 b  1 ( j  1,...,N)
j
(i  1,...P , j  1,...,N)
Main Result
Thm
• Suppose that given S=((x1,y1),…,(xT,yT)),
there exists a linear classifier (u,b) , s.t. yt(u·x+b)≥1 for
t=1,…,T.
• (# of updates of PUMMAp(δ)) ≤(p-1) u
2 2
q R /
δ2
whereR  max xt p .
t 1 ,...T
similar to those of previous algorithms
• After (p-1) u
PUMMAp(δ) outputs a hypothesis with p-norm margin
≥ (1-δ)γ (γ: margin of (u,b) ).
2R2/
q
δ2 updates,
Experiment over artificial data
• example (x,y)
- x: n(=100)-dimensional {-1,+1}-valued vector
- y=f(x),where
f (x)  sign(x1  x2   x16  b),
• generate 1000 examples randomly
• 3 datasets (b=1 (small), 9(medium), 15(large))
• Compare ROMMA(p=2), ALMA(p=2ln n).
Results over Artificial Data
p=2
p=2
0.0225
-0.02
0
PUMMA
PUMMA
margin
-0.04
-0.1
-0.06
PUMMA(15)
PUMMA(1)
ROMMA(15)
ALMA(1)
-0.08
-0.1
0
0.5
1
1.5
# of updates
## of
of updates
updates
2
2.5
4
x 10
ALMA
-0.05
margin
margin
margin
0.0461
ROMMA
0
p=2ln
p=2 lnn(N)
-0.15
-0.2
ALMA(15)
ALMA(9)
ALMA(1)
PUMMA(15)
PUMMA(9)
PUMMA(1)
10
3
10
4
# of updates
# of
updates
NOTE1: margin is defined over the original space (w/o reduction)
NOTE2: We omit the results for b=9 for clarity .
10
5
10
6
Computation Time
p=2
p=2ln n
• time
ALMA
ROMMA
Sec.
Sec.
PUMMA
PUMMA
large←
bias
→small
large←
bias
→small
For p=2,PUMMA is faster than ROMMA.
For p=2ln n,PUMMA is faster than ALMA even though PUMMA uses Newton method.
Results over UCI Adult data
adult
32561
# of data
algorithm
sec.
magin
rate
SVMlight
5893
100
ROMMA
(99%)
71296
99.03
PUMMA
(99%)
44480
99.14
•Fix p=2.
•2-norm soft margin formulation for linearly inseparable data.
•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Results over MNIST data
MNIST
# of data
algorithm
sec.
margin rate(%)
SVMlight
401.36
100
ROMMA
(99%)
1715.57
93.5
PUMMA
(99%)
1971.30
99.2
•Fix p=2.
•Use polynomial kernels.
•2-norm soft margin formulation for linearly inseparable data.
•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Summary
• PUMMA can learn p-norm maximum margin
classifiers with bias directly.
– # of updates is similar to those of previous algs.
– achieves (1-δ) times the maximum p-norm margin.
• PUMMA outperforms other online algs
when the underlying hyperplane has large bias.
Future work
• Maximizing ∞-norm margin directly.
• Tighter bounds of # of updates:
– In our experiments, PUMMA is faster especially
when bias is large (like WINNOW).
– Our current bound does not reflect this fact.