Transcript lecture22

Today’s Topics
Support Vector Machines (SVMs)
Three Key Ideas
– Max Margins
– Allowing Misclassified Training Examples
– Kernels (for non-linear models; in next lecture)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
1
Three Key
SVM Concepts
• Maximize the Margin
Don’t choose just any separating plane
• Penalize Misclassified Examples
Use soft constraints and ‘slack’ variables
• Use the ‘Kernel Trick’ to get Non-Linearity
Roughly like ‘hardwiring’ the input  HU
portion of ANNs (so only need a perceptron)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
2
Support Vector Machines
Maximizing the Margin between Bounding Planes
SVMs define some
inequalities we want
satisfied. We then use
advanced optimization
methods (eg, linear
programming)
to find the satisfying
solutions, but in cs540
we’ll do a simpler approx
?
Support
Vectors
2
||w||2
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
3
Margins and
Learning Theory
Theorems exist that connect learning
(‘PAC’) theory to the size of the margin
– Basically the larger the margin, the better
the expected future accuracy
– See, for example, Chapter 4 of Support
Vector Machines by N. Christianini &
J. Shawe-Taylor, Cambridge Press, 2000
(not an assigned reading)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
4
‘Slack’ Variables
Dealing with Data that is not Linearly Separable
For each wrong example,
we pay a penalty, which
is the distance we’d have
to move it to get on the
right side of the decision
boundary (ie, the
separating plane)
y
11/10/15
Support
Vectors
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
If we deleted
any/all of the
non support
vectors we’d
get the same
answer!
5
SVMs and Non-Linear
Separating Surfaces
f2
+
_
Non-linearly
map to new
space
_
_
+
+
+
h(f1, f2)
f1
Result is a non-linear
separator in original
space
11/10/15
_
g(f1, f2)
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
Linearly separate in
new space
6
Math Review: Dot Products
X  Y  X1  Y1 + X2  Y2 + … + Xn  Yn
So if X = [4, 5, -3, 7] and Y = [9, 0, -8, 2]
Then X  Y = 49 + 50 + (-3)(-8) + 72 = 74
(weighted sums in ANNs are dot products)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
7
Some Equations
Separating Plane
 
W x 
-
+
threshold
+
-
-
+
weights input features
-
-
+
-
+
For all positive examples
 
W  x pos    1
For all negative examples
 
W  xneg    1
11/10/15
These 1’s result from dividing
through by a constant for convenience
(it is the distance from the dashed lines
to the green line)
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
8
Idea #1: The Margin
(derivation not on final)
xA
xj
W
(i)
(ii)
The green line is the
set of all pts that
satisfy this equation
(ditto for red line)
 
W  x A    1
 
W  xB    1
Subtracting (ii) from (i) gives
xB
(iv)
xi
(iii)
  
W  ( x A  xB )  2
  

 
W  ( x A  xB )  || W ||  || x A  xB || cos()


||
x

x
A
B || 
Combining (iii) and (iv) we get
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
= 1 since
parallel
lines
2
|| W ||
9
Our Initial ‘Mathematical Program’
min ||w||
w, 
1
(this is the ‘1-norm’ length of the weight vector, which
is the sum of the absolute values of the weights;
some SVMs use quadratic programs, but 1-norms
have some preferred properties)
such that
w · xpos ≥  + 1
w · xneg ≤  – 1
11/10/15
// for ‘+’ ex’s
// for ‘–’ ex’s
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
10
The ‘p’ Norm – Generalization of the
Familiar Euclidean Distance (p=2)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
11
Our Mathematical
Program (cont.)
Note: w and  are our adjustable parameters
(we could, of course, use the ANN ‘trick’ and
move  to the left side of our inequalities and
treat as another weight)
We can now use existing math programming
optimization s/w to find a sol’n to our
current program (covered in cs525)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
12
Idea #2: Dealing with
Non-Separable Data
• We can add what is called a ‘slack’ variable
to each example
• This variable can be viewed as
= 0 if example correctly separated
else = ‘distance’ we need to move ex to get it correct
(ie, distance from decision boundary)
• Note: we are NOT counting #misclassified
would be nice to do so, but that becomes
[mixed] integer programming, which is much harder
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
13
The Math Program with Slack Vars
(this is the linear-programming version; there also is a quadratic-prog version
- in cs540 we won’t worry about the difference)
min ||w||1 + μ ||S||1
w, s, 
Dim = # of
input features
scalar
Dim = # of
training
examples
Notice we are solving the
perceptron task with a
complexity penalty (sum of
wgts) – Hinton’s wgt decay!
Scaling constant
(use tuning set to
select value)
such that
w · xposi + Si ≥  + 1
w · xnegj – Sj ≤  – 1
Sk ≥ 0
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
The S’s are how far
we would need to
move an example in
order for it to be on
the proper side of
the decision surface
14
Slack’s and Separability
• If training data is separable, will all Si = 0 ?
• Not necessarily!
– Might get a larger margin by misclassifying a
few examples (just like in d-tree pruning)
– This can also happen when using gradientdescent to minimize an ANN’s cost function
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
15
Brief Intro to
Linear Programs (LP’s)
- not on final
• We need to convert our task into
Az≥b
which is the basic form of an LP (A is a constant matrix,
b is a constant vector, z is a vector of variables)
• Note Can convert inequalities containing ≤ into ones
using ≥ by multiplying both sides by -1
eg, 5x ≤ 15 same as -5x ≥ -15
• Can also handle = (ie, equalities)
could use ≥ and ≤ to get =, but more efficient methods exist
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
16
Brief Intro to
Linear Programs (cont.)
- not on final
In addition, we want to
min c  z
Yellow region are those
points that satisfy the
constraints; dotted
lines are iso-cost lines
under the linear Az ≥ b constraints
Vector c says how to penalize settings
for variables in vector z
Highly optimized s/w for solving LP exists
(eg, CPLEX, COINS [free])
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
Lecture #21, Slide 17
Review:
Matrix Multiplication
AB=C
Matrix A is K by M
Matrix B is N by K
Matrix C is M by N
From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
18
Aside: Our SVM as an LP
(not on final)
Let Apos = our positive training examples
Aneg = our negative training examples
| f | e/2 | e/2 | 1 | f |
e/2
e/2
e/2
e/2
f
f
Apos 1
(assume 50% pos and 50% neg for notational simplicity)
0 -1 0
-Aneg 0
1
1 0
0
1
0
0 0
0
0
1
0 0
-1
0
0
0 1
1
0
0
0 1

e/2
f
W
1
e/2
Spos
1
e/2
0
e

0
f
Z
0
f
e/2
1
f
Sneg
≥
The 1’s are identity matrices (often written as I)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
19
Our C Vector
(determines the cost we’re minimizing, also not on final)
C
Note we min Z’s not
W’s since only Z’s ≥ 0
min [ 0 μ 0 1 ] W
S
Aside: could also penalize 
(but would need to add more
variables since  can be negative)
= min μ ● S + 1 ● Z

Z
= min μ ||S||1 + ||W||1
since all S are non-negative and
the Z’s ‘squeeze’ the W’s
Note here: S = Spos concatenated with Sneg
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
11/10/15
20
Where We are so Far
• We have an ‘objective’ function that we can
optimize by Linear Programming
– min ||w||1 + μ ||S||1 subject to some constraints
– Free LP solvers exist
– CS 525 teaches Linear Programming
• We could also use gradient descent
– Perceptron learning with ‘weight decay’
quite similar, though uses SQUARED wgts and
SQUARED error (the S is this error)
11/10/15
CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10
21