Online Learning of Maximum Margin Classifiers
Download
Report
Transcript Online Learning of Maximum Margin Classifiers
Online Learning of
Maximum p-Norm Margin Classifiers
with Bias
Kohei HATANO
Kyusyu University
(Joint work with K. Ishibashi and M. Takeda)
COLT 2008
Plan of this talk
1. Introduction
2. Preliminalies
― ROMMA
3. Our result
– Our new algorithm PUMMA
– Our implicit reduction
4. Experiments
Maximum Margin Classification
• SVMs [Boser et al. 92]
– 2-norm margin
• Boosting [Freund&Schapire 97]
– ∞-norm margin (approximtely)
• Why maximum (or large)
margin?
– Good generalization
[Schapire et al. 98]
[Shawe-Taylor et al. 98]
– Formulated as convex
optimization problems(QP, LP)
Scaling up Max. Margin Classification
1. Decomposition Methods (for SVMs)
– Break original QP into smaller QPs
– SMO [Platt 99],SVMlight [Joachims 99],
LIBSVM [Chang & Lin 01]
– state-of-the-art implementations
2. Online Learning (our approach)
Online Learning
Online Learning Algorithm
For t=1 to T
1. Receive an instance xt in Rn
2. Guess a label ŷt=sign(wt ∙ xt+bt)
3. Receive the label yt in {-1,1}
4. Update
(wt+1,bt+1)=UPDATE_RULE(wt,bt,xt,yt)
end
Advantages of online Learning
• Simple & easy to implement
• Uses less memory
• Adaptive for changing concepts
-1
xt
+1?
Online Learning Algorithms
for maximum margin classification
• Max Margin Perceptron [Kowalzyk 00]
hyperplane with bias
• ROMMA [Li & Long 02]
• ALMA [Gentile 01]
• LASVM [Bordes et al. 05]
• MICRA
bias
[Tsampouka&Shawe-Taylor 07]
0
• Pegasos [Shalev-Shwalz et al. 07]
• Etc.
hyperplane w/o bias
Most of online algs cannot learn hyperplane with
bias!
Typical Reduction to deal with bias
[Cf. Cristianini& Shawe-Taylor 00]
Adding an extra dimension corresponding bias.
R : max x j
Augmented space
~j (x j ,R) Rn1
↔ x
u 1
↔
Original space
instance
x j Rn
hyperplane (u , b)
j
~ : max x~
R
j
u~ (u, b / R)
NOTE: u x j b u~ ~x j ( u~ is equivalent with (u,b) )
y (u x j b)
margin
γ min j
uR
(over normalized Instances) j
↔
~~
yj (u
xj )
~
γ min
~
~R
j
u
xj
This reduction
weaken the guarantee of margin: ~
xj
γ
γ
γ γ~
b
(u , b)
2
~
γ
u~
→it might cause significant difference
in genealization!
~
R
R
j
Our New Online Learning Algorithm
PUMMA(P-norm Utilizing Maximum Margin Algorithm)
• PUMMA can learn maximum margin classifiers
with bias directly (without using the typical reduction!).
• Margin is defined as p-norm (p≥2)
– For p=2, similar to Perceptron.
– For p=O(ln n) [Gentile ’03], similar to Winnow [Littlestone ‘89].
Fast when the target is sparse.
• Extended to linearly inseparable case (omitted).
– Soft margin with2-norm slack variables.
Problem of finding the p-norm maximum
margin hyperplane [Cf. Mangasarian 99]
Given: (linearly separable) S=((x1,y1),…,(xT,yT)),
Goal: Find an approximate solution of (w*,b*)
1 2
( w ,b ) argmin. w q
w ,b 2
sub.to :
y j ( w x j b ) 1 ( j 1 ,...T )
*
*
q-norm (dual norm)
1/p+1/q=1
E.g.
p=2, q=2
p=∞, q=1
We want an online alg. solving the problem
with small # of updates.
ROMMA
(Relaxed Online Maximum Margin Algorithm)[Li&Long,’02]
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,
1. Predict ŷt=sign(wt∙xt), and receive yt
2. If yt(wt ·xt )<1-δ (margin is “insufficient”),
3. update:
wt 1
Constraint over the last example
which causes an update
1
2
argmin. w 2
2
w
sub.to : y t (w xt ) 1,
w w t wt
4. Otherwise, wt+1=wt
NOTE: bias is fixed with 0
2
2 constraints only!
2
Constraint over the last hyperplane
ROMMA [Li&Long,’02]
SVM (without bias)
feasible region of SVM
1 2
min. w 2
w
2
sub.to : y j ( w x j ) 1,
weght space
wSVM
w2
4
ROMMA
w3
w1
1
3
0
2
(j 1,...,4)
1
2
min. w 2
w
2
sub.to : y t (w xt ) 1,
w w t-1 w
2
2
Solution of ROMMA
Solution of ROMMA is an additive update:
2
(i)If wt 1 wt wt 2 ,
wt 1 αyt xt , whereα
1
2
xt 2 .
(ii)Otherwise,
wt 1 αyt xt βwt , where
α
wt
wt
2
2
1 wt xt
2
2
wt
,β
2
2
xt 2 (wt xt )
wt
2
2
2
2
2
xt 2 2(wt xt )
2
xt 2 (wt xt )
2
.
PUMMA
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,
1. Predict ŷt=sign(wt∙xt), and receive yt
2
f (wt)>1-δ,
) wt q update:
2. If○yt(wt ·xwt +b
t
○
q-norm
≧1
(1/p+1/q=1)
≧1
1 2
(wt 1 , bt 1 ) min. w q
w,b 2
≧1
sub.to : w xtpos b 1,
●
w xtneg b 1,
bias is
optimized
PUMMA
w f (wt-1 ) w
3. Otherwise, wt+1=wt
pos , xneg 2
x
w wtt wtt
: last positive and
negative examples
which incur updates
2
q
ROMMA
link function [Grove et al. 97]
q 1
sign(wi ) wi q
f (w)
q 2
wq
Solution of PUMMA
Solution of PUMMA is found numerically:
2
(i)If wt 1 wt wt 2 ,
wt 1 αyt zt , whereα
2
zt
2
andzt xtpos xtneg .
2
(ii)Otherwise,
wt 1 αyt zt βwt , where
1
2
2
( α ,β ) argmin. αzt βf ( wt ) p 2α β f ( wt ) p ,
2
which is solvedby the Newton method.
In eithercases,
( wt 1 xtpos wt 1 xtneg )
bt 1
.
2
xpost, xnegt
: last positive and
negative examples
which incur updates
Observation:
For p=2, the solution is the same as that of ROMMA for zt = xtpos – xtneg.
Our (implicit) reduction
which preserves the margin
~
Thm. w w
*
w,b
1
w
2
-
hyperplane without bias
over pairs of positive and negative instances
hyperplane with bias
(w * , b* ) argmin.
=
2
2
sub.
to :
PUMMA
1 2
w~ argmin. w 2
w
2
pairs
sub.to of
: positive
implicitly runs ROMMA over
w negative
xipos b 1 (instances
i 1,...,P)
and
in an efficient way!w (xipos x neg
j )2
w x neg
b 1 ( j 1,...,N)
j
(i 1,...P , j 1,...,N)
Main Result
Thm
• Suppose that given S=((x1,y1),…,(xT,yT)),
there exists a linear classifier (u,b) , s.t. yt(u·x+b)≥1 for
t=1,…,T.
• (# of updates of PUMMAp(δ)) ≤(p-1) u
2 2
q R /
δ2
whereR max xt p .
t 1 ,...T
similar to those of previous algorithms
• After (p-1) u
PUMMAp(δ) outputs a hypothesis with p-norm margin
≥ (1-δ)γ (γ: margin of (u,b) ).
2R2/
q
δ2 updates,
Experiment over artificial data
• example (x,y)
- x: n(=100)-dimensional {-1,+1}-valued vector
- y=f(x),where
f (x) sign(x1 x2 x16 b),
• generate 1000 examples randomly
• 3 datasets (b=1 (small), 9(medium), 15(large))
• Compare ROMMA(p=2), ALMA(p=2ln n).
Results over Artificial Data
p=2
p=2
0.0225
-0.02
0
PUMMA
PUMMA
margin
-0.04
-0.1
-0.06
PUMMA(15)
PUMMA(1)
ROMMA(15)
ALMA(1)
-0.08
-0.1
0
0.5
1
1.5
# of updates
## of
of updates
updates
2
2.5
4
x 10
ALMA
-0.05
margin
margin
margin
0.0461
ROMMA
0
p=2ln
p=2 lnn(N)
-0.15
-0.2
ALMA(15)
ALMA(9)
ALMA(1)
PUMMA(15)
PUMMA(9)
PUMMA(1)
10
3
10
4
# of updates
# of
updates
NOTE1: margin is defined over the original space (w/o reduction)
NOTE2: We omit the results for b=9 for clarity .
10
5
10
6
Computation Time
p=2
p=2ln n
• time
ALMA
ROMMA
Sec.
Sec.
PUMMA
PUMMA
large←
bias
→small
large←
bias
→small
For p=2,PUMMA is faster than ROMMA.
For p=2ln n,PUMMA is faster than ALMA even though PUMMA uses Newton method.
Results over UCI Adult data
adult
32561
# of data
algorithm
sec.
magin
rate
SVMlight
5893
100
ROMMA
(99%)
71296
99.03
PUMMA
(99%)
44480
99.14
•Fix p=2.
•2-norm soft margin formulation for linearly inseparable data.
•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Results over MNIST data
MNIST
# of data
algorithm
sec.
margin rate(%)
SVMlight
401.36
100
ROMMA
(99%)
1715.57
93.5
PUMMA
(99%)
1971.30
99.2
•Fix p=2.
•Use polynomial kernels.
•2-norm soft margin formulation for linearly inseparable data.
•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Summary
• PUMMA can learn p-norm maximum margin
classifiers with bias directly.
– # of updates is similar to those of previous algs.
– achieves (1-δ) times the maximum p-norm margin.
• PUMMA outperforms other online algs
when the underlying hyperplane has large bias.
Future work
• Maximizing ∞-norm margin directly.
• Tighter bounds of # of updates:
– In our experiments, PUMMA is faster especially
when bias is large (like WINNOW).
– Our current bound does not reflect this fact.