Lecture Title - University of Wisconsin–Madison

Download Report

Transcript Lecture Title - University of Wisconsin–Madison

Support Vector Machine:
An Introduction
Linear Hyper-plane Classifier
x2
x
r
w
b/|w|
H
For x in the side of o :
For x in the side of :
Distance from x to H:
(C) 2001-2005 by Yu Hen Hu
x1
Given: {(xi, di); i = 1 to N, di 
{+1, 1}}.
A linear hyper-plane classifier is
a hyper-plane consisting of
points x such that
H = {x| g(x) = wTx + b = 0}
g(x): a discriminant function!.
wTx + b  0;
d = +1;
wTx + b  0;
d = 1.
r = wTx/|w|  (b/|w|) = g(x) /|w|
2
Distance from a Point to a Hyper-plane
The hyper-plane H is characterized by
wT x + b = 0
(*)
w: normal vector perpendicular to H.
(*) says any vector x on H that
project to w will have a length of
OA = b/|w|.
Consider a special point C corresponding
to vector x*. Its magnitude of projection
onto vector w is
wTx*/|w| = OA + BC.
Or equivalently,
wTx*/|w| = b/|w| + r
(C) 2001-2005 by Yu Hen Hu
H
C
A
X*
w
r
B
Hence
r = (wTx*+b)/|w| = g(x*)/|w|
If x* is on the other side of H
(same side as the origin), then
r = (wTx*+b)/|w| = g(x*)/|w|
3
Optimal Hyper-plane: Linearly Separable Case
 Optimal hyper-plane should be
in the center of the gap.
x2

 Support Vectors  Samples on
the boundaries. Support vectors
alone can determine optimal
hyper-plane.

x1
 Question: How to find optimal
hyper-plane?
For di = +1, g(xi) = wTxi + b  |w|  woTxi + bo  1
For di = 1, g(xi) = wTxi + b  |w|  woTxi + bo  1
(C) 2001-2005 by Yu Hen Hu
4
Separation Gap
For xi being a support vector,
For di = +1, g(xi) = wTxi + b = |w|  woTxi + bo = 1
For di = 1, g(xi) = wTxi + b = |w|  woTxi + bo = 1
Hence wo = w/(|w|), bo = b/(|w|).
But distance from xi to hyper-plane is  = g(xi)/|w|.
Thus wo = w/g(xi), and  = 1/|wo|.
The maximum distance between the two classes is
2 = 2/|wo|.
The objective is to find wo, bo to minimize |wo| (so that  is
maximized) subject to the constraints that
woTxi + bo  1 for di = +1; and woTxi + bo  1 for di = 1.
Combine these constraints, one has:
di(woTxi + bo)  1
(C) 2001-2005 by Yu Hen Hu
5
Quadratic Optimization Problem Formulation
Given {(xi, di); i = 1 to N}, find w and b such that
(w) = wTw/2
is minimized subject to N constraints
di  (wTxi + b)  0; 1  i  N.
Method of Lagrange Multiplier
N
 
 
J (W , b, )   (W )   i d i W T xi  b  1
i 1
Set
N
J (W , b, )
 0  W    i d i xi
W
i 1
N
J (W , b, )
 0   i d i  0
b
i 1
(C) 2001-2005 by Yu Hen Hu
6
Optimization (continued)
The solution of Lagrange multiplier problem is at a
saddle point where the minimum is sought w.r.t. w
and b, while the maximum is sought w.r.t. i.
Kuhn-Tucker Condition: at the saddle point,
i[di (wTxi + b)  1] = 0
for 1  i  N.
• If xi is NOT a support vector, the corresponding i
= 0!
• Hence, only support vector will affect the result of
optimization!
(C) 2001-2005 by Yu Hen Hu
7
A Numerical Example
(1,1)
(2,1)
(3,1)
3 inequalities:
1w + b  1; 2w + b  +1; 3w + b  + 1
J = w2/2  1(wb1)  2(2w+b1)  3(3w+b1)
J/w = 0  w =  1 + 22 + 33
J/b = 0  0 = 1  2  3
Kuhn-Tucker condition implies:
(a) 1(wb1) = 0 (b) 2(2w+b1) = 0 (c); 3(3w + b 1) = 0
Later, we will see the solution is 1 = 2 = 2 and 3 = 0. This yields
w = 2, b = 3.
Hence the solution of decision boundary is:
2x  3 = 0. or x = 1.5!
This is shown as the dash line in above figure.
(C) 2001-2005 by Yu Hen Hu
8
Primal/Dual Problem Formulation
Given a constrained optimization problem with a
convex cost function and linear constraints; a dual
problem with the Lagrange multipliers providing the
solution can be formulated.
Duality Theorem (Bertsekas 1995)
(a) If the primal problem has an optimal solution, then
the dual problem has an optimal solution with the
same optimal values.
(b) In order for wo to be an optimal solution and o to
be an optimal dual solution, it is necessary and
sufficient that wo is feasible for the primal problem
and
(wo) = J(wo,bo, o) = Minw J(w,bo, o)
(C) 2001-2005 by Yu Hen Hu
9
Formulating the Dual Problem
N
N
N
1 T
T
J ( w, b, )  w w   i di w xi  b  i di   i
i 1
i 1
i 1
2
N
At the saddle point, we have W   i d i xi and
i 1
N
 d
i 1
i
i
 0,
substituting these relations into above, then we have the
Dual Problem
1N N
Maximize Q( )  i   i j di d j xiT x j
i 1
2 i 1 j 1
N
N
Subject to:  i di  0 and i  0 for i = 1, 2, …, N.
i 1
N
Note Q( )   i 
i 1
(C) 2001-2005 by Yu Hen Hu
1
1d1
2
 x12  x1 xN   1d1 


  N d N  

    
 xN x1  xN2   N d N 


10
Numerical Example (cont’d)
1
Q( )    i   1  2
i 1
2
3
1
 3  2

 3
3    1 
4 6   2 


6 9   3 
2
or Q() = 1 + 2 + 3  [0.512 + 222 + 4.532  212 
313 + 623]
subject to constraints: 1 + 2 + 3 = 0, and
1  0, 2  0, and 3  0.
Use Matlab Optimization tool box command:
x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq)
The solution is [1 2 3] = [2 2 0] as expected.
(C) 2001-2005 by Yu Hen Hu
11
Implication of Minimizing ||w||
Let D denote the diameter of the smallest hyper-ball that
encloses all the input training vectors {x1, x2, …, xN}.
The set of optimal hyper-planes described by the
equation
WoTx + bo = 0
has a VC-dimension h bounded from above as
h  min { D2/2, m0} + 1
where m0 is the dimension of the input vectors, and  =
2/||wo|| is the margin of the separation of the hyperplanes.
VC-dimension determines the complexity of the classifier
structure, and usually the smaller the better.
(C) 2001-2005 by Yu Hen Hu
12
Non-separable Cases
Recall that in linearly separable case, each training sample pair (xi,
di) represents a linear inequality constraint
di(wTxi + b)  1, i = 1, 2, …, N
(*)
If the training samples are not linearly separable, the constraint
can be modified to yield a soft constraint:
di(wTxi + b)  1i , i = 1, 2, …, N
(**)
{i; 1  i  N} are known as slack variables.
Note that originally, (*) is a normalized version of
di g(xi)/|w|  . With the slack variable I, that eq. becomes di
g(xi)/|w|  (1i). Hence with the slack variable, we allow some
samples xi fall within the gap. Moreover, if i > 1, then the
corresponding (xi, di) is mis-classified because the sample will fall
on the wrong side of the hyper-plane H.
(C) 2001-2005 by Yu Hen Hu
13
Non-Separable Case
Since i > 1 implies misclassification, the cost function
must include a term to minimize the
number of samples that are misclassified:
N
(W ,  )  W W / 2    I ( i  1)
T
i 1
where  is a Lagrange multiplier.
But this formulation is non-convex
and a solution is difficult to find
using existing nonlinear
optimization algorithms.
Hence, we may instead use an
approximated cost function
With this approximated cost function,
the goal is to maximize  (minimize
||W||) while minimize i ( 0).
i: not counted if xi outside gap and on
the correct side.
0 < i < 1: xi inside gap, but on the
correct side.
i > 1: xi on the wrong side (inside or
outside gap).
N
1 T
(W ,  )  W W  C  i
2
i 1
0
(C) 2001-2005 by Yu Hen Hu
1

14
Primal Problem Formulation
Primal Optimization Problem Given {(xi, di);1 i N}.
Find w, b such that
N
1 T
( w,  )  w w  C  i
2
i 1
is minimized subject to the constraints
(i) i  0, and
(ii) di(wTxi + b)  1i for i = 1, 2, …, N.
Using i and i as Lagrange multipliers,
the unconstrained cost function becomes


N
N
N
1 T
T
( w,  )  w w  C  i   i di ( w xi  b)  1   i   i i
2
i 1
i 1
i 1
(C) 2001-2005 by Yu Hen Hu
15
Dual Problem Formulation
N
( w,  )
 0  w   i d i xi
w
i 1
( w,  )
 0  C   i  i  0
 i
Note that
Dual Optimization Problem Given {(xi, i);1 i N}. Find Lagrange
multipliers {i; 1  i  N} such that
N
1 N N
Q( )   i   i j di d j xiT xi
2 i 1 j 1
i 1
is maximized subject to the constraints
(i)
0  i  C (a user-specified positive number) and
N
(ii)
 d 0

i 1
(C) 2001-2005 by Yu Hen Hu
i
i
16
Solution to the Dual Problem
By the Karush-Kuhn-Tucker condition:
for i = 1, 2, …, N,
(i) i [di (wTxi + b)  1 + i] = 0
(*)
(ii) i i = 0
At optimal point i +i = C. Thus, one may deduce that
if 0 < i < C, then i = 0 and di(wTxi+b) = 1
if i = C, then i  0 and di(wTxi+b) = 1- i  1
if i = 0, then di(wTxi+b) 1: xi is not a support vector
Finally, the optimal solutions are:
N
wo    i d i xi
i 1

 

T



bo    (1  d i wo xi )  /   d i 
 iI o
  iI o 
where Io = {i; 0 < i < C}
(C) 2001-2005 by Yu Hen Hu
17
Inner Product Kernels
In general, if the input is first transformed via a set of nonlinear
functions {i(x)} and then subject to the hyperplane classifier
p
p
j 1
j 0
g ( x)   w j j ( x)  b  w j j ( x)  wT 
b  w0 ; 0 ( x)  1
Define the inner product kernel as
p
K ( x, y)    j ( x) j ( y)   T
j 0
one may obtain a dual optimization problem formulation as:
N
1 N N
Q( )   i   i j di d j K ( xi , x j )
2 i1 j 1
i 1
Often, dim of  (=p+1) >> dim of x!
(C) 2001-2005 by Yu Hen Hu
18
Polynomial Kernel
Consider a polynomial kernel
m
m
m
i 1
i 1 j i 1
m
K ( x, y )  (1  x y )  1  2 xi yi  2  xi yi x j y j   xi2 yi2
T
2
i 1
Let K(x,y) = T(x) (y), then
(x) = [1 x12, , xm2, 2 x1, , 2xm, 2 x1 x2, , 2 x1xm,
2 x2 x3, , 2 x2xm, ,2 xm1xm]
= [1 1(x), , p(x)]
where p = 1 +m + m + (m1) + (m2) +  + 1 = (m+2)(m+1)/2
Hence, using a kernel, a low dimensional pattern classification
problem (with dimension m) is solved in a higher dimensional
space (dimension p+1). But only j(x) corresponding to support
vectors are used for pattern classification!
(C) 2001-2005 by Yu Hen Hu
19
Numerical Example: XOR Problem
Training samples:
(1 1; 1), (1 1 +1),
(1 1 +1), (1 1 1)
x = [x1, x2]T. Use K(x,y) = (1 + xTy)2 one has
(x) = [1 x12 x22 2 x1, 2 x2, 2 x1x2]T
1
1

1
1
1
1
1
1
1  2
1  2
1
2
1
2
 2
2
 2
2
2 
 2

 2
2 
9
1
K (x i , x j )  T  
1
1

1 1 1
9 1 1

1 9 1
1 1 9
Note dim[(x)] = 6 > dim[x] = 2!
Dim(K) = Ns = # of support vectors.
(C) 2001-2005 by Yu Hen Hu
20
XOR Problem (Continued)
Note that K(xi, xj) can be calculated directly without using !
2
2


 1 
 1 



E.g. K1,1  1   1  1    9; K1, 2  1   1  1    1
 1 
 1 


The corresponding Lagrange multiplier  = (1/8)[1 1 1 1]T.
N
W   i d i (x i )  T 1d1  2 d 2   N d N 
T
i 1
1
1
1
1
 (1) (x1 )  (1) (x 2 )  (1) (x 3 )  (1) (x 4 )
8
8
8
8
T
1 

 0 0 0 0 0 
2 

Hence the hyper-plane is:
y = wT(x) =  x1x2
(x1, x2)
(1, 1)
(1, +1)
(+1,1)
(+1,+1)
y = 1 x1x2
1
+1
+1
1
(C) 2001-2005 by Yu Hen Hu
21
Other Types of Kernels
type of SVM
K(x,y)
Comments
Polynomial
learning machine
Radial basis
function
Two-layer
perceptron
(xTy + 1)p
p: selected a priori
2
 1
2   : selected a priori
exp   2 || x  y || 
 2

tanh(oxTy + 1)
only some o and 1
values are feasible.
What kernel is feasible? It must satisfy the "Mercer's theorem"!
(C) 2001-2005 by Yu Hen Hu
22
Mercer's Theorem
Let K(x,y) be a continuous, symmetric kernel, defined on a
x,y  b. K(x,y) admits an eigen-function expansion

K (x, y )   i i (x) i ( y )
i 1
with i > 0 for each i. This expansion converges absolutely and
uniformly if and only if
aa
  K (x, y ) (x) (y )dxdy  0
bb
for all (x) such that
a
 (x)dx  
2
b
(C) 2001-2005 by Yu Hen Hu
23
Testing with Kernels
For many types of kernels, (x) can not be explicitly represented or
even found. However,
N
W    i d i (x i )   T 1d1  2 d 2   N d N    T f
i 1
T


y ( x)  W T  (x )   T f  ( x )  f T K (x i , x )  K (x, x i ) f
T
Hence there is no need to know (x) explicitly! For example,
in the XOR problem, f = (1/8)[1 +1 +1 1]T. Suppose that x
= (1, +1), then
y ( x)  K ( x, x j ) f

 1
 1
 1
 1  [ 1  1]  ) 2 1  [ 1 1]  ) 2 1  [1  1]  ) 2
1
1
1

 1 9 1 1 1 / 8 1 / 8 1 / 8  1 / 8  1
 1 / 8
 1 2   1 / 8 
1  [1 1]  )  

 1   1/ 8 
 1 / 8
T
(C) 2001-2005 by Yu Hen Hu
24
SVM Using Nonlinear Kernels
W+
P
Nonlinear transform
xN
0
K(x,xj)
•••
x1
•••
•••
•••
xN
0
•••
x1
P
K(x,xj)
Nonlinear transform
f
+
Kernel evaluation
Using kernel, low dimensional feature vectors will be mapped to
high dimensional (may be infinite dim) kernel feature space
where the data are likely to be linearly separable.
(C) 2001-2005 by Yu Hen Hu
25