1. Stat 231. Lecture 6.

Download Report

Transcript 1. Stat 231. Lecture 6.

1. Stat 231. A.L. Yuille. Fall 2004





Linear Separation and Margins.
Non-Separable and Slack Variables.
Duality and Support Vectors.
Read 5.10, A.3, 5.11 Duda, Hart, Stork.
Or better: 12.1, 12.2, Hastie, Tibshirani, Friedman.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
2. Separation by Hyperplanes
Data
 Hyperplane
 Linear Classifier:
 By simple geometry, the signed distance of a point x to
the plane is
The line through x perpendicular to the plane is:

Hits the plane when
which implies that the distance is
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
3. Margin, Support Vectors





We introduce new concepts: (I) Margin, (II) Support Vectors.
This will enable us to understand performance in non-seperable
case.
Technical methods: quadratic optimization with linear constraints,
Lagrange multipliers and duality.
Margins will also be important when studying generalization.
Everything in this lecture can be extended beyond
hyperplanes (next lecture).
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
4. Margin for Separable Data

Assume there is a separating hyperplane:

We seek to find the classifier with the biggest margin:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
5. Margin Non-Separable

Use concept of margin to define optimal criterion for nonseparable data.
For data samples, define slack variables
Seek the hyperplane that maximizes the margin allowing for a
limited amount K of misclassified data (slack).
One criterion:

Subject to



Lecture notes for Stat 231: Pattern Recognition and Machine Learning
6. Margin Non-Separable




If,
then the data point j is correctly classified by the
hyperplane.
If
is the proportional amount that data
is on
the wrong side of the margin.
From this criterion, the points closest to the hyperplane are the
ones that most influence its form (more details later). These are
the datapoints that are hardest to classify. These will become the
support vectors.
By contrast, data points away from the hyperplane are less
important. This differs from probability estimation.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
7. Quadratic Programming




Remove the restriction that
Define criterion:
This is a quadratic primal problem with linear constraints
(unique soln.). It can be formulated using Lagrange multipliers.
Variables,
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
8. Quadratic Programming

Extremize (differentiate) wrt

The solution is

The solution only depends on the support vectors:
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
respectively yields,
9. Duality





Any quadratic optimization problem L_p with linear constraints
can be reformulated in terms of a dual problem L_d.
The variables of the dual problem are the Lagrange parameters
of the primal problem. In this case
Linear algebra gives:
Subject to:
Standard packages to solve the primal and dual (easier)
problem.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
Primal to Dual





To obtain the dual formulation
Rewrite the primal as:
Extremize w.r.t.
All terms cancel except
Set
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
10. Support Vectors




The form of the solution is
Now
These
are the Support Vectors. Two types:
(1) those on the margin, for which
(2) those past the margin, for which
Karush-Kuhn-Tucker conditions.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
11. Karush-Kuhn-Tucker

KKT Conditions: Lagrange multipliers and constraints

Use any margin point to solve for b.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning
12. Perceptron and Margins.
The Perceptron rule can be re-interpreted in terms of the margin
and given a formulation in dual space.
 Perceptron convergence. The critical quantity for the
convergence of the perceptron is
, where
R is the radius of the smallest ball containing the data and
m is the margin.
 Define
 Then number of Perceptron errors in one sweep is bounded
above by

Lecture notes for Stat 231: Pattern Recognition and Machine Learning
13. Perceptron in Dual Space
The Perceptron learning algorithm works by adding misclassified
data examples to the weights. Set initial weights to be zero.
 The weight hypothesis will always be of form
 Perceptron Rule in Dual Space: Update rule for
If data
is misclassified, I.e.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning
Summary

Linear Separability and Margins.
Slack Variables z’s – formulate for non-separable case.
Quadratic optimization with linear constraints.
Primal problem L_p and Dual Problem L_d (standard
techniques for solution). Dual Perceptron Rule (separable case
only).
Solution of form
Dual variables alpha’s determine the support vectors

Support Vectors – hard to classify data (no analog in probability).





Lecture notes for Stat 231: Pattern Recognition and Machine Learning