1. Stat 231. Lecture 6.
Download
Report
Transcript 1. Stat 231. Lecture 6.
1. Stat 231. A.L. Yuille. Fall 2004
Linear Separation and Margins.
Non-Separable and Slack Variables.
Duality and Support Vectors.
Read 5.10, A.3, 5.11 Duda, Hart, Stork.
Or better: 12.1, 12.2, Hastie, Tibshirani, Friedman.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
2. Separation by Hyperplanes
Data
Hyperplane
Linear Classifier:
By simple geometry, the signed distance of a point x to
the plane is
The line through x perpendicular to the plane is:
Hits the plane when
which implies that the distance is
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
3. Margin, Support Vectors
We introduce new concepts: (I) Margin, (II) Support Vectors.
This will enable us to understand performance in non-seperable
case.
Technical methods: quadratic optimization with linear constraints,
Lagrange multipliers and duality.
Margins will also be important when studying generalization.
Everything in this lecture can be extended beyond
hyperplanes (next lecture).
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
4. Margin for Separable Data
Assume there is a separating hyperplane:
We seek to find the classifier with the biggest margin:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
5. Margin Non-Separable
Use concept of margin to define optimal criterion for nonseparable data.
For data samples, define slack variables
Seek the hyperplane that maximizes the margin allowing for a
limited amount K of misclassified data (slack).
One criterion:
Subject to
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
6. Margin Non-Separable
If,
then the data point j is correctly classified by the
hyperplane.
If
is the proportional amount that data
is on
the wrong side of the margin.
From this criterion, the points closest to the hyperplane are the
ones that most influence its form (more details later). These are
the datapoints that are hardest to classify. These will become the
support vectors.
By contrast, data points away from the hyperplane are less
important. This differs from probability estimation.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
7. Quadratic Programming
Remove the restriction that
Define criterion:
This is a quadratic primal problem with linear constraints
(unique soln.). It can be formulated using Lagrange multipliers.
Variables,
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
8. Quadratic Programming
Extremize (differentiate) wrt
The solution is
The solution only depends on the support vectors:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
respectively yields,
9. Duality
Any quadratic optimization problem L_p with linear constraints
can be reformulated in terms of a dual problem L_d.
The variables of the dual problem are the Lagrange parameters
of the primal problem. In this case
Linear algebra gives:
Subject to:
Standard packages to solve the primal and dual (easier)
problem.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
Primal to Dual
To obtain the dual formulation
Rewrite the primal as:
Extremize w.r.t.
All terms cancel except
Set
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
10. Support Vectors
The form of the solution is
Now
These
are the Support Vectors. Two types:
(1) those on the margin, for which
(2) those past the margin, for which
Karush-Kuhn-Tucker conditions.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
11. Karush-Kuhn-Tucker
KKT Conditions: Lagrange multipliers and constraints
Use any margin point to solve for b.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
12. Perceptron and Margins.
The Perceptron rule can be re-interpreted in terms of the margin
and given a formulation in dual space.
Perceptron convergence. The critical quantity for the
convergence of the perceptron is
, where
R is the radius of the smallest ball containing the data and
m is the margin.
Define
Then number of Perceptron errors in one sweep is bounded
above by
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
13. Perceptron in Dual Space
The Perceptron learning algorithm works by adding misclassified
data examples to the weights. Set initial weights to be zero.
The weight hypothesis will always be of form
Perceptron Rule in Dual Space: Update rule for
If data
is misclassified, I.e.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
Summary
Linear Separability and Margins.
Slack Variables z’s – formulate for non-separable case.
Quadratic optimization with linear constraints.
Primal problem L_p and Dual Problem L_d (standard
techniques for solution). Dual Perceptron Rule (separable case
only).
Solution of form
Dual variables alpha’s determine the support vectors
Support Vectors – hard to classify data (no analog in probability).
Lecture notes for Stat 231: Pattern Recognition and Machine Learning