SVMs and the Kernel Trick
Download
Report
Transcript SVMs and the Kernel Trick
Support Vector Machines and The
Kernel Trick
William Cohen
3-26-2007
The voted perceptron
A
instance xi
B
^
yi
yi
^
Compute: yi = vk . xi
If mistake: vk+1 = vk + yi xi
(3a) The guess v2 after the two
(3b) The guess v2 after the one positive and
positive examples: v2=v1+x2
one negative example: v2=v1-x2
u
u
+x2
v2
v2
>γ
v1
v1
+x1
-x2
-u
-u
2γ
2γ
Perceptrons vs SVMs
• For the voted perceptron to “work” (in this proof), we
need to assume there is some u such that
..or, u.u=||u||2=1
Perceptrons vs SVMs
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: γ, (x1,y1), (x2,y2), (x3,y3), …
– Find: some w where
• ||w||2=1 and
• for all i, w.xi.yi > γ
Perceptrons vs SVMs
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find: some w and γ such that
• ||w||=1 and
• for all i, w.xi.yi > γ
The best possible w and γ
Perceptrons vs SVMs
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Maximize γ under the constraint
• ||w||2=1 and
• for all i, w.xi.yi > γ
– Mimimize ||w||2 under the constraint
• for all i, w.xi.yi > 1
Units are arbitrary: rescaling
increases γ and w
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
arg min w
1
w
2
2
such that i, yi w x i 1
objective function
…but here nothing is
linear, so you need to use
quadratic programming
This is a
constrained
optimization
problem.
constraints
Famous example of constrained optimization:
linear programming, where objective function
is linear, constraints are linear (in)equalities
SVMs and optimization
• Motivation for SVMs as “better perceptrons”
– learners that minimize w.w under the constraint
that for all i, yiw.xi>1
• Questions:
– What if the data isn’t separable?
• Slack variables
• Kernel trick
– How do you solve this constrained optimization
problem?
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
arg min w
1
w
2
2
such that i, yi w x i 1
1
2
w C zi
2
i
such that i, yi w x i zi 1 and zi 0
arg min w ,z
SVM with slack variables
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
The Kernel Trick
The voted perceptron
A
instance xi
B
^
yi
yi
^
Compute: yi = vk . xi
If mistake: vk+1 = vk + yi xi
The kernel trick
Remember:
sparse weighted sum of examples
v k yi1 x i1 yi2 x i2 ... yik x ik
where i1,…,ik are the mistakes… so:
Can think of
this as a
weighted sum
of all examples
with some of
the weights
being zero –
non-zero
weighted
examples are
support
vectors
xtest v k xtest ( yi1 xi1 yi2 xi2 ... yik xik )
xtest yi1 xi1 xtest yi2 xi2 ... xtest yik xik
yi1 xtest xi1 yi2 xtest xi2 ... yik xtest xik
The kernel trick – con’t
Since:
v k yi1 x i1 yi2 x i2 ... yik x ik
where i1,…,ik are the mistakes… then
x test v k yi1 x test x i1 yi2 x test x i2 ... yik x test x ik
Consider a preprocesser that replaces every x with x’
to include, directly in the example, all the pairwise
variable interactions, so what is learned is a vector v’:
x'test v'k yi1 x'test x'i1 yi2 x'test x'i2 ... yik x'test x'ik
yi1 K (xtest , xi1 ) yi2 K (xtest , xi2 ) ... yik K (xtest , xik )
where K (xtest , xi ) x'test xi '
The kernel trick – con’t
u a, b ax by
v c, d cx dy
u v acx bdy
A voted perceptron over
vectors like u,v is a linear
function…
Replacing u with u’ would lead to
non-linear functions – f(x,y,xy,x2,…)
u' a, b ax by ex fy gxy h
2
2
v' c, d cx dy lx 2 my 2 nxy p
u'v' acx bdy elx fmy gnxy hp
2
2
The kernel trick – con’t
But notice…if we replace uv with (uv+1)2 ….
u v 1 acx bdy 1
acx bdy 1acx bdy 1
2
2
a 2 c 2 x 2 2abcd xy 2ac x b 2 d 2 y 2 2bdy 1
Compare to
u' a, b ax by ex 2 fy 2 gxy h
v' c, d cx dy lx my nxy p
2
2
u'v' acx bdy elx 2 fmy2 gnxy hp
The kernel trick – con’t
So – up to constants on the cross-product terms
u v 1
2
u'v'
Why not replace the computation of
x'test v'k yi1 x'test x'i1 yi2 x'test x'i2 ... yik x'test x'ik
With the computation of
x'test v'k yi1 K (x test , x i1 ) ... yik K (x test , x ik )
where K (x, x i ) (x x i 1)
2
?
The kernel trick – con’t
General idea: replace an expensive preprocessor xx’
and ordinary inner product with no preprocessor and a
function K(x,xi) where
K (x, xi ) x'x'i
Some popular kernels for numeric vectors x:
polynomial : K d (x, x i ) (x x i 1) d
Gaussian/R BF :
x xi
K (x, x i ) exp
2
2
2
Demo with An Applet
http://www.site.uottawa.ca/~gcaron/SVMApplet/SVMApplet.html
The kernel trick – con’t
Kernels work for other data structures also!
• String kernels:
• x and xi are strings, S=set of shared substrings, |s|=length of
string s: by dynamic programming you can quickly compute
K (x, xi )
| s|
sS
There are also tree kernels, graph kernels, …..
The kernel trick – con’t
x=“william”
j={1,3,4}
x[j]=“wll”
“wll”<“wl”
len(x,j)=4
Kernels work for other data structures also!
• String kernels:
• x and xi are strings, S=set of shared substrings, j,k are subsets
of the positions inside x,xi, len(x,j) is the distance between the
first position in j and the last, s<t means s is a substring of t, by
dynamic programming you can quickly compute
K (x, xi )
len( x , j) len( xi .k )
s j:s x[ j] k:s xi [ k ]
The kernel trick – con’t
Even more general idea: use any function K that is
• Continuous
• Symmetric—i.e., K(u,v)=K(v,u)
• “Positive semidefinite”—i.e., K(u,v)≥0
Then by an ancient theorem due to Mercer, K corresponds
to some combination of a preprocessor and an inner
product: i.e.,
K (x, xi ) x'x'i
Terminology: K is a Mercer kernel. The set of all x’ is a reproducing
kernel Hilbert space (RKHS). The matrix M[i,j]=K(xi,xj) is a Gram
matrix.
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
1
arg min w ,z w C zi
2
i
which is
such that i, yi w x i zi 1 and zi 0
equivalent to
finding:
1
arg min α yi y j i j xi x j - i
2 i j
i
such that
y
i
i
i
0 and i,0 i C
primal form
Lagrangian
dual
Langrange multipliers
maximize f(x,y)=2-x2-2y2
subject to g(x)=x2+y2-1=0
Langrange multipliers
maximize f(x,y)=2-x2-2y2
subject to g(x)=x2+y2-1=0
f
Claim: at the constrained maximum
the gradient of f must be perpendicular to g
g
Langrange multipliers
maximize f(x,y)=2-x2-2y2
subject to g(x)=x2+y2-1=0
f
Claim: at the constrained maximum
the gradient of f must be perpendicular to g
g
f ( x , y ) g ( x , y )
g ( x, y ) 0
( x, y, ) 0
where ( x, y, ) f ( x, y ) g ( x, y )
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
– Given: (x1,y1), (x2,y2), (x3,y3), …
– Find:
1
arg min w ,z w C zi
2
i
which is
such that i, yi w x i zi 1 and zi 0
equivalent to
finding:
1
arg min α yi y j i j xi x j - i
2 i j
i
such that
y
i
i
i
0 and i,0 i C
primal form
Lagrangian
dual
SVMs and optimization
• Question: why not use this assumption directly in the
learning algorithm? i.e.
Some key points:
– Given: (x1,y1), (x2,y2), (x3,y3), …
•Solving the QP directly (Vapnik’s
– Find:
original method) is possible but
1
w C zi
2
i
such that i, yi w x i zi 1 and zi 0
arg min w ,z
arg min α
1
yi y j i j xi x j - i i
2 i j
such that
yi i 0 and i,0 i C
i
KKT (Karush-Kuhn-Tucker)
conditions or Kuhn-Tucker
conditions, after Karush (1939) and
Kuhn-Tucker (1951)
expensive.
• The dual form can be expressed
as constraints on each example
• eg. αi=0 yiw.xi≥1
• Fastest methods for SVM learning
ignore most of the constraints, solve
a subproblem containing a few
‘active constraints’, then cleverly
pick a few additional constraints &
repeat…..
More on SVMs and kernels
• Many other types of algorithms can be
“kernelized”
– Gaussian processes, memory-based/nearest
neighbor methods, ….
• Work on optimization for linear SVMs is
very active