Transcript Online Passive-Aggressive Algorithms
PEGASOS
1 Shai Shalev-Shwartz Yoram Singer Nati Srebro
The Hebrew University Jerusalem, Israel
QP form:
Support Vector Machines
2 More “natural” form: Regularization term Empirical loss
Outline
• Previous Work • The Pegasos algorithm • Analysis – faster convergence rates • Experiments – outperforms state-of-the-art • Extensions • kernels • complex prediction problems • bias term 3
Previous Work
• Dual-based methods • Interior Point methods • Memory: m 2 , time: m 3 log(log(1/ )) • Decomposition methods • Memory: m, Time: super-linear in m • Online learning & Stochastic Gradient • Memory: O(1), Time: 1/ 2 • Memory: 1/ 2 , Time: 1/ 4 (linear kernel) (non-linear kernel) Typically, online learning algorithms do not converge to the optimal solution of SVM 4
PEGASOS
A_t = S Subgradient method |A_t| = 1 Stochastic gradient 5 Subgradient Projection
Run-Time of Pegasos
• Choosing |A t |=1 and a linear kernel over R n Run-time required for Pegasos to find accurate solution w.p. ¸ 1 6 • Run-time does not depend on #examples • Depends on “difficulty” of problem ( and )
Formal Properties
• Definition: w is accurate if • Theorem 1 : Pegasos finds accurate solution w.p. ¸ 1 after at most iterations.
• Theorem 2 : Pegasos finds log(1/ ) solutions s.t. w.p. ¸ 1 , at least one of them is accurate after iterations 7
Proof Sketch
A second look on the update step: 8
Proof Sketch
• Lemma ( free projection ): • Logarithmic Regret for OCP (Hazan et al’06) 9 • Take expectation : • f(w r )-f(w * ) ¸ 0 Markov gives that w.p. ¸ 1 • Amplify the confidence
Experiments
•
3 datasets
(provided by Joachims) • Reuters CCAT (800K examples, 47k features) • Physics ArXiv (62k examples, 100k features) • Covertype (581k examples, 54 features) •
4 competing algorithms
• SVM-light (Joachims) • SVM-Perf (Joachims’06) • Norma (Kivinen, Smola, Williamson ’02) • Zhang’04 (stochastic gradient descent) •
Source-Code
available online 10
Training Time (in seconds)
Pegasos SVM-Perf SVM-Light 11 Reuters
2
77 20,075 Covertype Astro Physics
6 2
85 5 25,514 80
Compare to Norma (on Physics)
12 obj. value test error
Compare to Zhang (on Physics)
13 But, tuning the parameter is more expensive than learning …
Effect of k=|A
t
| when T is fixed
14
Effect of k=|A
t
| when kT is fixed
15
I want my kernels !
• Pegasos can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function • No need to switch to the dual problem • Number of support vectors is bounded by 16
Complex Decision Problems
• Pegasos works whenever we know how to calculate subgradients of loss func. l (w;(x,y)) • Example: Structured output prediction 17 • Subgradient is (x,y’) (x,y) where y’ is the maximizer in the definition of l
b
ias term
• Popular approach: increase dimension of x Cons: “pay” for b in the regularization term • Calculate subgradients w.r.t. w and w.r.t b: Cons: convergence rate is 1/ 2 • Define: Cons: |A t | need to be large • Search b in an outer loop Cons: evaluating objective is 1/ 2 18
Discussion
• Pegasos : Simple & Efficient solver for SVM • Sample vs. computational complexity • Sample complexity: How many examples do we need as a function of VC-dim ( ), accuracy ( ), and confidence ( ) • In Pegasos, we aim at analyzing computational complexity based on , , and (also in Bottou & Bousquet) • Finding argmin vs. calculating min : It seems that Pegasos finds the argmin more easily than it requires to calculate the min value 19