Transcript lecture23

Today’s Topics
• Review of the linear SVM with Slack Variables
• Kernels (for non-linear models)
• SVM Wrapup
• Remind me to repeat Q’s for those listening to audio
• Informal Class Poll: Favorite ML Algo? (Domingos’s on-line ‘five tribes’ talk 11/24)
Nearest Neighbors
D-trees / D-forests
Genetic Algorithms
Naïve Bayes / Bayesian Nets
Neural Networks
Support Vector Machines
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
1
Recall: Three Key
SVM Concepts
• Maximize the Margin
Don’t choose just any separating plane
• Penalize Misclassified Examples
Use soft constraints and ‘slack’ variables
• Use the ‘Kernel Trick’ to get Non-Linearity
Roughly like ‘hardwiring’ the input  HU
portion of ANNs (so only need a perceptron)
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
2
Recall: ‘Slack’ Variables
Dealing with Data that is not Linearly Separable
2
||w||2
For each wrong example,
we pay a penalty, which
is the distance we’d have
to move it to get on the
right side of the decision
boundary (ie, the
separating plane)
y
11/17/15
Support
Vectors
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
If we deleted
any/all of the
non support
vectors we’d
get the same
answer!
3
Recall: The Math Program
with Slack Vars
min ||w||1 + μ ||S||1
w, s, 
Dim = # of
input features
Notice we are solving the
perceptron task with a
complexity penalty (sum of
wgts) – Hinton’s wgt decay!
Dim = # of
training
examples
such that
w · xposi + Si ≥  + 1
w · xnegj – Sj ≤  – 1
Sk ≥ 0
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
The S’s are how far
we would need to
move an example in
order for it to be on
the proper side of
the decision surface
4
Recall: SVMs and Non-Linear
Separating Surfaces
f2
+
_
Non-linearly
map to new
space
_
_
+
+
+
h(f1, f2)
f1
Result is a non-linear
separator in original
space
11/10/15
_
g(f1, f2)
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 10
Linearly separate in
new space
5
Idea #3: Finding Non-Linear
Separating Surfaces via Kernels
• Map inputs into new space, eg
– ex1 features: x1=5, x2=4  Old Rep
– ex1 features: (x12, x22, 2  x1  x2)  New Rep
= (25, 16, 40)
(sq of old rep)
• Solve linear SVM program in this new space
– Computationally complex if many derived features
– But a clever trick exists!
• SVM terminology (differs from other parts of ML)
– Input space: the original features
– Feature space: the space of derived features
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
6
Kernels
• Kernels produce non-linear separating
surfaces in the original space
• Kernels are similarity functions between two
examples, K(exi, exj), like in k-NN
• Sample kernels (many variants exist)
K(exi, exj) = exi ● exj
this is linear
K(exi, exj) = exp{-||exi – exj||2 / σ2 }
this is the Gaussian kernel
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
7
Bug or feature?
Kernels as Features
• Let the similarity between
examples be the features!
• Feature j for example i is K(exi, exj)
• Models are of the form
If ∑ αj K(exi, exj) >  then + else –
The α’s weight the
similarities (we
hope many α = 0)
11/17/15
An instancebased learner!
So a model is determined by
(a) finding some good exemplars
(those with α ≠ 0; they are the support vectors) and
(b) weighting the similarity to these exemplars
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
8
Our Array of ‘Feature’ Values
Examples
Features: K(exi, exj)
Similarity
between
examples i and j
An array of #’s
Our models are
linear in these
features, but will
be non-linear in
the original
features if K is a
non-linear
function
Notice that we can compute K(exi, exj) outside the SVM code!
So we really only need code for the LINEAR SVM – it doesn’t
know from where the ‘rectangle’ of data has come
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
9
Concrete Example
Use the ‘squaring’ kernel to convert the
following set of examples
K(exi, exj) = (x • z)2
Raw Data
Derived Features
F1
F2
Output
Ex1
4
2
T
Ex1
T
Ex2
-6
3
T
Ex2
T
Ex3
-5
-1
F
Ex3
F
11/17/15
K(exi, ex1)
K(exi, ex2)
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
K(exi, ex3)
Output
10
Concrete Example (w/ answers)
Use the ‘squaring’ kernel to convert the
following set of examples
Probably want
K(exi, exj) = (x • z)2
Raw Data
F1
F2
Output
Ex1
4
2
T
Ex2
-6
3
Ex3
-5
-1
11/17/15
to divide this by
1000 to scale
the derived
features
Derived Features
K(exi, ex1)
K(exi, ex2)
K(exi, ex3)
Output
Ex1
400
324
484
T
T
Ex2
324
2025
729
T
F
Ex3
484
729
676
F
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
11
A Simple Example of a Kernel Creating
a Non-Linear Separation:
Assume K(A,B) = - distance(A,B)
Kernel-Produced Feature Space
(only two dimensions shown)
Feature Space
K(exi, ex1)
ex6
ex8
ex9
ex3
ex2
ex5
ex4
ex5
ex7
ex6
ex7
ex3
ex9
Separating surface
in the original feature
space (non-linear!)
11/17/15
ex2
ex4
ex8
ex1
K(exi, ex6)
ex1
Separating
plane in the
derived space
Model
if (K(exnew, ex1) > -5) then GREEN else RED
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
12
Our 1-Norm SVM with Kernels
min ||α||1 + μ ||S||1
such that
We use α instead of w to
indicate we’re weighting
similarities rather than
‘raw’ features
 pos ex’s { ∑ αj  K(xj, xpos ) } + Si ≥  + 1
i
 neg ex’s { ∑ αj  K(xj, xneg ) } – Sk ≤  – 1
k
 Sm ≥ 0
11/17/15
Same linear LP code can be used,
simply create the K()’s externally!
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
13
The Kernel ‘Trick’
• The Linear SVM can be written
(using [not-on-final] primal-dual concept of LPs)
min ([½  yi yj i j (exi • exj)] - i)
• Whenever we see dot products:
• We can replace with kernels:
– this is called the ‘kernel trick’
http://en.wikipedia.org/wiki/Kernel_trick
exi • exj
K(exi , exj )
• This trick not only for SVMs
ie, ‘kernel machines’ a broad ML topic
- can use ‘similarity to examples’ as features for ANY ML algo!
- eg, run d-trees with kernelized features
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
14
Kernels and
Mercer’s Theorem
K(x, y)’s that are
– continuous
– symmetric: K(x, y) = K(y, x)
– positive semidefinite
(create a square Hermitian matrix whose eigenvectors
are all positive; see en.wikipedia.org/wiki/Positive_semidefinite_matrix)
Are equivalent to a dot product in some space
K(x, y) = (x)  (y)
Note: can use any similarity function to create a new ‘feature space’ and
solve with a linear SVM, but the ‘dot product in a derived space’
interpretation will be lost unless Mercer’s Theorem holds
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
15
The New Space for
a Sample Kernel
Let K(x, z) = (x • z)2 and let #features = 2
Key point:
we don’t
explicitly
create the
expanded
‘raw’ feature
space, but
the result is
the same as
if we did
(x •z )2 = (x1z1 + x2z2)2
= x1x1z1z1 + x1x2z1z2 + x2x1z2z1 + x2x2z2z2
= <x1x1, x1x2, x2x1, x2x2> • <z1z1, z1z2, z2z1, z2z2>
• Our new feature space with 4 dimensions;
we’re doing a dot product in it!
• Note: if we used an exponent > 2, we’d have gotten a
much larger ‘virtual’ feature space for very little cost!
Notation: <a, b, …, z> indicates a vector, with its
components explicitly listed
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
16
Review:
Matrix Multiplication
AB=C
Matrix A is K by M
Matrix B is N by K
Matrix C is M by N
From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
17
The Kernel Matrix
Let A be our usual array of
f
one example per row
one (standard) feature per column
e
A’ is ‘A transpose’ (rotate around diagonal)
e
one (standard) feature per row
one example per column
f
f
The Kernel Matrix is K(A, A’)
e
x
e
f
=
e
e
K
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
18
The Reduced SVM
(Lee & Mangasarian, 2001)
• With kernels, learned models are weighted sums
of similarities to some of the training examples
• Kernel matrix is size 0(N2), where N = # ex’s
– With ‘big data’ squaring can be prohibitive!
• Can randomly (or cleverly) choose a
subset as candidates; size can scale 0(N)
K(ei,ej)
Examples
• But no reason all training examples need to
be candidate ‘exemplars’
Create (and use) only these blue columns
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
19
More on Kernels
K(x, z) = tanh(c  (x • z) + d)
Relates to the sigmoid of ANN’s
(here # of HU’s determined by # of support vectors)
How to choose a good kernel function?
– Use a tuning set
– Or just use the Gaussian kernel
– Some theory exists
– A sum of kernels is a kernel ( other ‘closure’ properties)
– Don’t want the kernel matrix to be all 0’s off the diagonal
since we want to model examples as sums of other ex’s
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
20
The Richness of Kernels
• Kernels need not solely be similarities
computed on numeric data!
• Or where ‘raw’ data is in a rectangle
• Can define similarity between examples
represented as
– trees (eg, parse trees in NLP; see image
above), count common subtrees, say
– sequences (eg, DNA sequences)
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
21
Using Gradient Descent Instead
of Linear Programming
• Recall last lecture we said that perceptron training with
weight decay was quite similar to SVM training
• This is still the case with kernels; ie, we create a new data
set outside the perceptron code and use gradient descent
Kernel
• So here we get the non-linearity provided by HUs in a
‘hard-wired’ fashion (ie, by using the kernel to non-linearly
compute a new representation of the data)
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
22
SVM Wrapup
• For approx a decade, SVMs were the
‘hottest’ topic in ML (Deep NNs now are)
• Formalize nicely the task of
find a simple models with few ‘outliers’
• Use hard-wired ‘kernels’ to do the job
done by HUs in ANNs
• Kernels can be used in any ML algo
– just preprocess the data to create ‘kernel’ features
– can handle non fixed-length-feature vectors
• Lots of good theory and empirical results
11/17/15
CS 540 - Fall 2015 (Shavlik©), Lecture 23, Week 11
23