SVMs and the Kernel Trick

Download Report

Transcript SVMs and the Kernel Trick

Support Vector Machines and The
Kernel Trick
William Cohen
3-26-2007
The voted perceptron
A
instance xi
B
^
yi
yi
^
Compute: yi = vk . xi
If mistake: vk+1 = vk + yi xi
(3a) The guess v2 after the two
(3b) The guess v2 after the one positive and
positive examples: v2=v1+x2
one negative example: v2=v1-x2
u
u
+x2
v2
v2
>γ
v1
v1
+x1
-x2
-u
-u
2γ
2γ
Perceptrons vs SVMs
• For the voted perceptron to “work” (in this proof), we
need to assume there is some u such that
..or, u.u=||u||2=1
Perceptrons vs SVMs
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: γ, (x1,y1), (x2,y2), (x3,y3), …
– Find: some w where
• ||w||2=1 and
• for all i, w.xi.yi > γ
Perceptrons vs SVMs
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find: some w and γ such that
• ||w||=1 and
• for all i, w.xi.yi > γ
The best possible w and γ
Perceptrons vs SVMs
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Maximize γ under the constraint
• ||w||2=1 and
• for all i, w.xi.yi > γ
– Mimimize ||w||2 under the constraint
• for all i, w.xi.yi > 1
Units are arbitrary: rescaling
increases γ and w
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
arg min w
1
w
2
2
such that i, yi w  x i  1
objective function
…but here nothing is
linear, so you need to use
quadratic programming
This is a
constrained
optimization
problem.
constraints
Famous example of constrained optimization:
linear programming, where objective function
is linear, constraints are linear (in)equalities
SVMs and optimization
• Motivation for SVMs as “better perceptrons”
– learners that minimize w.w under the constraint
that for all i, yiw.xi>1
• Questions:
– What if the data isn’t separable?
• Slack variables
• Kernel trick
– How do you solve this constrained optimization
problem?
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
arg min w
1
w
2
2
such that i, yi w  x i  1
1
2
w  C  zi
2
i
such that i, yi w  x i  zi  1 and zi  0
arg min w ,z
SVM with slack variables
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
The Kernel Trick
The voted perceptron
A
instance xi
B
^
yi
yi
^
Compute: yi = vk . xi
If mistake: vk+1 = vk + yi xi
The kernel trick
Remember:
sparse weighted sum of examples
v k  yi1 x i1  yi2 x i2  ...  yik x ik
where i1,…,ik are the mistakes… so:
Can think of
this as a
weighted sum
of all examples
with some of
the weights
being zero –
non-zero
weighted
examples are
support
vectors
xtest  v k  xtest  ( yi1 xi1  yi2 xi2  ...  yik xik )
 xtest  yi1 xi1  xtest  yi2 xi2  ...  xtest  yik xik
 yi1 xtest  xi1  yi2 xtest  xi2  ...  yik xtest  xik
The kernel trick – con’t
Since:
v k  yi1 x i1  yi2 x i2  ...  yik x ik
where i1,…,ik are the mistakes… then
x test  v k  yi1 x test  x i1  yi2 x test  x i2  ...  yik x test  x ik
Consider a preprocesser that replaces every x with x’
to include, directly in the example, all the pairwise
variable interactions, so what is learned is a vector v’:
x'test v'k  yi1 x'test x'i1  yi2 x'test x'i2 ...  yik x'test x'ik
 yi1 K (xtest , xi1 )  yi2 K (xtest , xi2 )  ...  yik K (xtest , xik )
where K (xtest , xi )  x'test xi '
The kernel trick – con’t
u  a, b  ax  by
v  c, d  cx  dy
u  v  acx  bdy
A voted perceptron over
vectors like u,v is a linear
function…
Replacing u with u’ would lead to
non-linear functions – f(x,y,xy,x2,…)
u'  a, b  ax  by  ex  fy  gxy  h
2
2
v'  c, d  cx  dy  lx 2  my 2  nxy  p
u'v'  acx  bdy  elx  fmy  gnxy  hp
2
2
The kernel trick – con’t
But notice…if we replace uv with (uv+1)2 ….
u  v  1  acx  bdy  1
 acx  bdy  1acx  bdy  1
2
2
 a 2 c 2  x 2  2abcd  xy  2ac  x  b 2 d 2 y 2  2bdy  1
Compare to
u'  a, b  ax  by  ex 2  fy 2  gxy  h
v'  c, d  cx  dy  lx  my  nxy  p
2
2
u'v'  acx  bdy  elx 2  fmy2  gnxy  hp
The kernel trick – con’t
So – up to constants on the cross-product terms
u  v  1
2
 u'v'
Why not replace the computation of
x'test v'k  yi1 x'test x'i1  yi2 x'test x'i2 ...  yik x'test x'ik
With the computation of
x'test v'k  yi1 K (x test , x i1 )  ...  yik K (x test , x ik )
where K (x, x i )  (x  x i  1)
2
?
The kernel trick – con’t
General idea: replace an expensive preprocessor xx’
and ordinary inner product with no preprocessor and a
function K(x,xi) where
K (x, xi )  x'x'i
Some popular kernels for numeric vectors x:
polynomial : K d (x, x i )  (x  x i  1) d
Gaussian/R BF :
 x  xi
K (x, x i )  exp  
2

2

2




Demo with An Applet
http://www.site.uottawa.ca/~gcaron/SVMApplet/SVMApplet.html
The kernel trick – con’t
Kernels work for other data structures also!
• String kernels:
• x and xi are strings, S=set of shared substrings, |s|=length of
string s: by dynamic programming you can quickly compute
K  (x, xi )   
| s|
sS
There are also tree kernels, graph kernels, …..
The kernel trick – con’t
x=“william”
j={1,3,4}
x[j]=“wll”
“wll”<“wl”
len(x,j)=4
Kernels work for other data structures also!
• String kernels:
• x and xi are strings, S=set of shared substrings, j,k are subsets
of the positions inside x,xi, len(x,j) is the distance between the
first position in j and the last, s<t means s is a substring of t, by
dynamic programming you can quickly compute
K (x, xi )  
 
len( x , j) len( xi .k )
s j:s  x[ j] k:s  xi [ k ]
The kernel trick – con’t
Even more general idea: use any function K that is
• Continuous
• Symmetric—i.e., K(u,v)=K(v,u)
• “Positive semidefinite”—i.e., K(u,v)≥0
Then by an ancient theorem due to Mercer, K corresponds
to some combination of a preprocessor and an inner
product: i.e.,
K (x, xi )  x'x'i
Terminology: K is a Mercer kernel. The set of all x’ is a reproducing
kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram
matrix.
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
1
arg min w ,z w  C  zi
2
i
which is
such that i, yi w  x i  zi  1 and zi  0
equivalent to
finding:
1
arg min α  yi y j i j xi x j -   i
2 i j
i
such that
 y
i
i
i
 0 and i,0   i  C
primal form
Lagrangian
dual
Langrange multipliers
maximize f(x,y)=2-x2-2y2
subject to g(x)=x2+y2-1=0
Langrange multipliers
maximize f(x,y)=2-x2-2y2
subject to g(x)=x2+y2-1=0
f 
Claim: at the constrained maximum
the gradient of f must be perpendicular to g
g
Langrange multipliers
maximize f(x,y)=2-x2-2y2
subject to g(x)=x2+y2-1=0
f 
Claim: at the constrained maximum
the gradient of f must be perpendicular to g
g
f ( x , y )   g ( x , y )
g ( x, y )  0
( x, y,  )   0
where ( x, y,  )  f ( x, y )  g ( x, y )
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
1
arg min w ,z w  C  zi
2
i
which is
such that i, yi w  x i  zi  1 and zi  0
equivalent to
finding:
1
arg min α  yi y j i j xi x j -   i
2 i j
i
such that
 y
i
i
i
 0 and i,0   i  C
primal form
Lagrangian
dual
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
Some key points:
– Given: (x1,y1), (x2,y2), (x3,y3), …
•Solving the QP directly (Vapnik’s
– Find:
original method) is possible but
1
w  C  zi
2
i
such that i, yi w  x i  zi  1 and zi  0
arg min w ,z
arg min α
1
 yi y j i j xi x j - i  i
2 i j
such that
 yi i  0 and i,0   i  C
i
KKT (Karush-Kuhn-Tucker)
conditions or Kuhn-Tucker
conditions, after Karush (1939) and
Kuhn-Tucker (1951)
expensive.
• The dual form can be expressed
as constraints on each example
• eg. αi=0 yiw.xi≥1
• Fastest methods for SVM learning
ignore most of the constraints, solve
a subproblem containing a few
‘active constraints’, then cleverly
pick a few additional constraints &
repeat…..
More on SVMs and kernels
• Many other types of algorithms can be
“kernelized”
– Gaussian processes, memory-based/nearest
neighbor methods, ….
• Work on optimization for linear SVMs is
very active