Transcript pptx - CUNY

Support Vector Machines
(and Kernel Methods in general)
Machine Learning
1
Last Time
• Multilayer Perceptron/Logistic Regression
Networks
– Neural Networks
– Error Backpropagation
2
Today
• Support Vector Machines
•
Note: we’ll rely on some math from Optimality Theory that we won’t derive.
3
Maximum Margin
• Perceptron (and other linear classifiers) can lead to
many equally valid choices for the decision boundary
Are these really
“equally valid”?
4
Max Margin
• How can we pick
which is best?
• Maximize the size
of the margin.
Small Margin
Large Margin
Are these really
“equally valid”?
5
Support Vectors
• Support Vectors
are those input
points (vectors)
closest to the
decision boundary
• 1. They are vectors
• 2. They “support”
the decision
hyperplane
6
Support Vectors
• Define this as a
decision problem
• The decision
hyperplane:
• No fancy math, just the
equation of a hyperplane.
7
Support Vectors
• Aside: Why do some
cassifiers use
or
– Simplicity of the math
and interpretation.
– For probability density
function estimation 0,1
has a clear correlate.
– For classification, a
decision boundary of 0 is
more easily interpretable
than .5.
8
Support Vectors
• Define this as a
decision problem
• The decision
hyperplane:
• Decision Function:
9
Support Vectors
• Define this as a decision
problem
• The decision
hyperplane:
• Margin hyperplanes:
10
Support Vectors
• The decision
hyperplane:
• Scale invariance
11
Support Vectors
• The decision
hyperplane:
• Scale invariance
12
Support Vectors
• The decision
hyperplane:
This scaling does not change the
decision hyperplane, or the support
vector hyperplanes. But we will
eliminate a variable from the
optimization
• Scale invariance
13
What are we optimizing?
• We will represent the
size of the margin in
terms of w.
• This will allow us to
simultaneously
– Identify a decision
boundary
– Maximize the margin
14
How do we represent the size of the
margin in terms of w?
1. There must at least one
point that lies on each
support hyperplanes
Proof outline: If not, we
could define a larger
margin support hyperplane
that does touch the nearest
point(s).
15
How do we represent the size of the
margin in terms of w?
1. There must at least one
point that lies on each
support hyperplanes
Proof outline: If not, we
could define a larger
margin support hyperplane
that does touch the nearest
point(s).
16
How do we represent the size of the
margin in terms of w?
1. There must at least one
point that lies on each
support hyperplanes
2. Thus:
3. And:
17
How do we represent the size of the
margin in terms of w?
1. There must at least one
point that lies on each
support hyperplanes
2. Thus:
3. And:
18
How do we represent the size of the
margin in terms of w?
• The vector w is
perpendicular to the
decision hyperplane
– If the dot product of two
vectors equals zero, the two
vectors are perpendicular.
19
How do we represent the size of the
margin in terms of w?
• The margin is the projection
of x1 – x2 onto w, the normal
of the hyperplane.
20
Aside: Vector Projection
21
How do we represent the size of the
margin in terms of w?
• The margin is the projection
of x1 – x2 onto w, the normal
of the hyperplane.
Projection:
Size of the Margin:
22
Maximizing the margin
• Goal: maximize
the margin
Linear Separability of the data
by the decision boundary
23
Max Margin Loss Function
• If constraint optimization then Lagrange
Multipliers
• Optimize the “Primal”
24
Max Margin Loss Function
• Optimize the “Primal”
Partial wrt b
25
Max Margin Loss Function
• Optimize the “Primal”
Partial wrt w
26
Max Margin Loss Function
• Optimize the “Primal”
Partial wrt w
Now have to find αi.
Substitute back to the Loss function
27
Max Margin Loss Function
• Construct the “dual”
28
Dual formulation of the error
• Optimize this quadratic program to identify
the lagrange multipliers and thus the weights
There exist (rather) fast approaches to quadratic
optimization in both C, C++, Python, Java and R
29
Quadratic Programming
•If Q is positive semi definite, then f(x) is convex.
•If f(x) is convex, then there is a single maximum.
30
Support Vector Expansion
Independent of the
Dimension of x!
New decision Function
• When αi is non-zero then xi is a support vector
• When αi is zero xi is not a support vector
31
Kuhn-Tucker Conditions
• In constraint optimization: At the optimal
solution
– Constraint * Lagrange Multiplier = 0
Only points on the decision boundary contribute to the solution!
32
Visualization of Support Vectors
33
Interpretability of SVM parameters
• What else can we tell from alphas?
– If alpha is large, then the associated data point is
quite important.
– It’s either an outlier, or incredibly important.
• But this only gives us the best solution for
linearly separable data sets…
34
Basis of Kernel Methods
• The decision process doesn’t depend on the dimensionality of the
data.
• We can map to a higher dimensionality of the data space.
• Note: data points only appear within a dot product.
• The error is based on the dot product of data points – not the data
points themselves.
35
Basis of Kernel Methods
• Since data points only appear within a dot product.
• Thus we can map to another space through a replacement
• The error is based on the dot product of data points – not the data
points themselves.
36
Learning Theory bases of SVMs
• Theoretical bounds on testing error.
– The upper bound doesn’t depend on the
dimensionality of the space
– The lower bound is maximized by maximizing the
margin, γ, associated with the decision boundary.
37
Why we like SVMs
• They work
– Good generalization
• Easily interpreted.
– Decision boundary is based on the data in the
form of the support vectors.
• Not so in multilayer perceptron networks
• Principled bounds on testing error from
Learning Theory (VC dimension)
38
SVM vs. MLP
• SVMs have many fewer parameters
– SVM: Maybe just a kernel parameter
– MLP: Number and arrangement of nodes and eta
learning rate
• SVM: Convex optimization task
– MLP: likelihood is non-convex -- local minima
39
Soft margin classification
• There can be outliers on the other side of the decision
boundary, or leading to a small margin.
• Solution: Introduce a penalty term to the constraint
function
40
Soft Max Dual
Still Quadratic Programming!
41
Soft margin example
• Points are allowed within
the margin, but cost is
introduced.
Hinge Loss
42
Probabilities from SVMs
• Support Vector Machines are discriminant
functions
– Discriminant functions: f(x)=c
– Discriminative models: f(x) = argmaxc p(c|x)
– Generative Models: f(x) = argmaxc p(x|c)p(c)/p(x)
• No (principled) probabilities from SVMs
• SVMs are not based on probability distribution
functions of class instances.
43
Efficiency of SVMs
• Not especially fast.
• Training – n^3
– Quadratic Programming efficiency
• Evaluation – n
– Need to evaluate against each support vector
(potentially n)
44
Good Bye
• Next time:
– The Kernel “Trick” -> Kernel Methods
– or
– How can we use SVMs that are not linearly
separable?
45