Transcript pptx
Support Vector Machines
Lecturer: Yishay Mansour
Itay Kirshenbaum
Lecture Overview
In this lecture we present in detail one of
the most theoretically well motivated and
practically most effective classification
algorithms in modern machine learning:
Support Vector Machines (SVMs).
Lecture Overview – Cont.
We begin with building the intuition
behind SVMs
continue to define SVM as an
optimization problem and discuss how
to efficiently solve it.
We conclude with an analysis of the
error rate of SVMs using two
techniques: Leave One Out and VCdimension.
Introduction
Support Vector Machine is a supervised
learning algorithm
Used to learn a hyperplane that can
solve the binary classification problem
Among the most extensively studied
problems in machine learning.
Binary Classification Problem
Input space: X R
Output space: Y { 1, 1}
Training data: S {( x1 , y 1 ),..., ( x m , y m )}
S drawn i.i.d with distribution D
Goal: Select hypothesis h H that best
predicts other points drawn i.i.d from D
n
Binary Classification – Cont.
Consider the problem of predicting the
success of a new drug based on a patient
height and weight
m ill people are selected and treated
This generates m 2d vectors (height and
weight)
Each point is assigned +1 to indicate
successful treatment or -1 otherwise
This can be used as training data
Binary classification – Cont.
Infinitely many ways to classify
Occam’s razor – simple classification
rules provide better results
Linear classifier or hyperplane
h H maps x X to 1 if ( w * x b ) 0
Our class of linear classifiers:
H {x sign ( w * x b ) | w R , b R }
n
Choosing a Good Hyperplane
Intuition
Consider two cases of positive
classification:
w*x + b = 0.1
w*x + b = 100
More confident in the decision made by the
latter rather than the former
Choose a hyperplane with maximal
margin
Good Hyperplane – Cont.
Definition: Functional margin S
ˆ s
min
i
i
i
i
ˆ with ˆ y ( w * x b )
i{1 ,..., m }
i
i
y is the classifica tion of x according to ( w , b )
A linear classifier:
Maximal Margin
w,b can be scaled to increase margin
sign(w*x + b) = sign(5w*x + 5b) for all x
(5w, 5b) is 5 times greater than (w,b)
Cope by adding an additional
constraint:
||w|| = 1
Maximal Margin – Cont.
Geometric Margin
Consider the geometric distance between
the hyperplane and the closest points
Geometric Margin
Definition: Geometric margin S
s
min
with
i
i
y (
i
w
i{1 ,..., m }
*x
i
b
w
Relation to functional margin
ˆ w y
i
w
i
Both are equal when
w 1
)
The Algorithm
We saw:
Two definitions of the margin
Intuition behind seeking a maximizing
hyperplane
Goal: Write an optimization program
that finds such a hyperplan
We always look for (w,b) maximizing
the margin
The Algorithm – Take 1
First try:
max y ( w * x b ) , i 1,..., m , w 1
i
Idea
i
Maximize - For each sample the
Functional margin is at least
Functional and geometric margin are the
same as w 1
Largest possible geometric margin with
respect to the training set
The Algorithm – Take 2
The first try can’t be solved by any offthe-shelf optimization software
The w 1 constraint is non-linear
In fact, it’s even non-convex
How can we discard the constraint?
Use geometric margin!
max
ˆ
w
i
i
y ( w * x b ) ˆ , i 1,..., m
The Algorithm – Take 3
We now have a non-convex objective
function – The problem remains
Remember
We can scale (w,b) as we wish
Force the functional margin to be 1
1
Objective function: max
w
1
Same as: min w
2
Factor of 0.5 and power of 2 do not change
the program – Make things easier
2
The algorithm – Final version
The final program:
max
1
2
w
2
y ( w * x b ) 1, i 1,..., m
i
i
The objective is convex (quadratic)
All constraints are linear
Can solve efficiently using standard
quadratic programing (QP) software
Convex Optimization
We want to solve the optimization
problem more efficiently than generic
QP
Solution – Use convex optimization
techniques
Convex Optimization – Cont.
Definition: A convex function
f
x , y X , 0 ,1 :
f ( x (1 ) y ) f ( x ) (1 ) f ( y )
Theorem
Let f : x be a differenti able convex function
x , y X : f ( y ) f ( x ) f ( x )( y x )
Convex Optimization Problem
Convex optimization problem
Let f , g i :x , i 1,.., m be convex function
Find min
x X
f ( x ) s.t. g i ( x ) 0 , i 1,.., m
We look for
a value of x X
Minimizes f ( x )
Under the constraint
g i ( x ) 0 , i 1,.., m
Lagrange Multipliers
Used to find a maxima or a minima of a
function subject to constraints
Use to solve out optimization problem
Definition
Lagragian
L of function
f subject to constraint s
g i , i 1,.., m
m
L ( x, ) f ( x)
i
g i ( x) x X , i 0
i 1
i are called the Lagrange Multiplier s
Primal Program
Plan
Use the Lagrangian to write a program
called the Primal Program
Equal to f(x) is all the constraints are met
Otherwise –
Definition – Primal Program
P ( x ) max
0
L ( x, )
Primal Progam – Cont.
The constraints are of the form
If they are met P ( x ) f ( x )
m
is maximized when all
i are 0, and the summation is 0
i
gi ( x)
i 1
Otherwise P ( x )
m
i 1
i
gi ( x)
is maximized for i
gi ( x) 0
Primal Progam – Cont.
Our convex optimization problem is
now:
min
x X
P ( x ) min
x X
max
0
L ( x, )
P ( x ) as the value of
Define
x X
the primal program
p min
*
Dual Program
We define the Dual Program as:
D ( x ) min
L ( x, )
We’ll look at
max
x X
a0
min
x X
D ( x ) max
a0
min
x X
L ( x, )
Same as our primal program
Order of min / max is different
Define the value of our Dual Program
d max
*
a0
min
x X
L ( x, )
Dual Program – Cont.
We want to show
*
*
If we find a solution to one problem, we
find the solution to the second problem
Start with d * p *
“max min” is always less then “min max”
d max
*
d p
a0
min
x X
L ( x , ) min
Now on to p * d *
x X
max
a0
L ( x, ) p
*
Dual Program – Cont.
Claim
if exists x and a 0 which are a saddle point and
*
*
a 0 , x which is feasible : L ( x , a ) L ( x , a ) L ( x , a )
*
*
then p d and x is a solution t o p ( x )
*
*
*
Proof
p inf sup L ( x , a ) sup L ( x , a ) L ( x , a )
*
*
x
a0
a0
inf L ( x , a ) sup inf L ( x , a ) d
*
x
*
Conclude
x
a0
d p
*
*
*
*
*
*
Karush-Kuhn-Tucker (KKT)
conditions
KKT conditions derive a
characterization of an optimal solution
to a convex problem.
Theorem
Assume that f and g i , i 1,.., m are differenti able and convex.
x is a solution t o the optimizati on problem 0 s.t. :
1. x L ( x , ) x f ( x ) x g ( x ) 0
2. a L ( x , ) g ( x ) 0
3. g ( x )
i
gi(x) 0
KKT Conditions – Cont.
Proof
For every feasible x :
f ( x) f ( x ) x f ( x ) ( x x )
m
i 1
m
i 1
m
i 1
ai x g i ( x ) ( x x )
a i [ g i ( x ) g i ( x )]
ai g i ( x) 0
The other direction holds as well
KKT Conditions – Cont.
Example
Consider the following optimization
1
problem: min x s .t . x 2
2 1
We have f ( x ) x , g ( x ) 2 x
2
1
L
(
x
,
)
x (2 x)
The Lagragian will be
2
2
1
2
L
x
2
x 0 x
*
*
L( x , )
*
1
2
( 2 ) 2
2
1
2
L( x , ) 2 0 2 x
*
*
2
Optimal Margin Classifier
Back to SVM
Rewrite our optimization program
min
1
w
2
y ( w * x b ) 1, i 1,..., m
i
i
2
g i (w, b) y (w * x b) 1 0
i
Following the KKT conditions
i
i 0
Only for points in the training set with a
margin of exactly 1
These are the support vectors of the
training set
Optimal Margin – Cont.
Optimal margin classifier and its
support vectors
Optimal Margin – Cont.
Construct the Lagragian
L (w, b, )
1
w
m
2
2
[ y ( w * x b ) 1]
i
i
i
i 1
Find the dual form
First minimize L ( w , b , ) to get
Do so by setting the derivatives to zero
D
m
x L (w, b, ) w
i 1
m
w
*
i 1
i
i
y x
i
iy x 0
i
i
Optimal Margin – Cont.
Take the derivative with respect to
b
m
L (w, b, )
iy 0
i
i 1
Use w * in the Lagrangian
m
L(w , b , )
*
*
i
i 1
b
1
m
m
y y i j x x b i y
i
2 i , j 1
j
i
j
i 1
We saw the last tem is zero
m
L(w , b , )
*
*
i
i 1
1
m
2
i , j 1
y y i j x x
i
j
i
j
W ( )
i
Optimal Margin – Cont.
The dual optimization problem
max W ( ) : i 0 , i 1,.., m
m
y 0
i
i
i 1
The KKT conditions hold
*
Can solve by finding that maximize W ( )
Assuming we have – define w y x
The solution to the primal problem
m
*
*
i 1
i
i
Optimal Margin – Cont.
Still need to find
Assume
We get
x
i
b
is a support vector
1 y (w x b )
i
*
i
*
y w x b
*
b y w x
i
i
*
*
i
*
i
*
Error Analysis Using LeaveOne-Out
The Leave-One-Out (LOO) method
Remove one point at a time from the
training set
Calculate an SVM for the remaining points
Test our result using the removed point
Definition
1
m
I (h
(x ) y )
m
The indicator function I(exp) is 1 if exp is
true, otherwise 0
Rˆ LOO
i
i
i 1
S { x }
i
LOO Error Analysis – Cont.
Expected error
E S ~ D m [ Rˆ LOO ]
1
m
m
i 1
E [ I ( h S { x i } ( x ) y )]
i
i
E S , X [ h S { x i } ( x ) y ] E S ' ~ D m 1 [ error ( h S ' )]
i
i
It follows the expected error of LOO for a
training set of size m is the same as for a
training set of size m-1
LOO Error Analysis – Cont.
Theorem
E S ~ D m [ error ( h S )] E S ~ D m 1 [
N SV ( S )
m 1
]
N SV ( S ) is the number of support ve ctors in S
Proof
if h S classifies a point incorrectl y, the point must be a support ve ctor.
N SV ( S )
ˆ
Hence : R LOO
m 1
Generalization Bounds Using
VC-dimension
Theorem
Let S x : x R . Let d be the VC - dimension
set sign ( w x ): min
Proof
x S
w x , w . Then d
1
Assume that the set x ,.., x
d
is shattered.
d w y x w
i
i 1
i
d
i 1
d
i
y x
i
i 1
i
y x
2
2
2
i
i
d
R
So for every y 1, 1
w : y ( w x ) i 1,.., d . Summing over d :
i
of the hyperplane
i
Generalization Bounds Using
VC-dimension – Cont.
Proof – Cont.
Averaging over the y ' s with uniform distributi on :
1
d
d E y
i
y x
E 2 y
i
i 1
2
d
i
y x
E y [ x x y y ]
i
i
i 1
j
i
j
i, j
Since E y [ y y ] 0 when i j and E y [ y y ] 1 when i j
i
j
i
j
we can conclude that :
d E y [ x x y y ]
i
j
i, j
j
i
R
2
Therefore d
i
2
2
2
x
i
dR
2