Statistical Learning Theory and Classification based on

Download Report

Transcript Statistical Learning Theory and Classification based on

STATISTICAL LEARNING THEORY AND
CLASSIFICATION BASED ON SUPPORT VECTOR
MACHINES
Presentation by Michael Sullivan
Based on:
The Nature of Statistical Learning Theory by V. Vapnick
2009 Presentation by John DiMona
and some slides based on lectures given by Professor Andrew Moore
of Carnegie Mellon University
1
EMPIRICAL DATA MODELING
Observations of a system are collected
 Based on these observations a process of
induction is used to build up a model of the
system
 This model is used to deduce responses of the
system not yet observed

2
EMPIRICAL DATA MODELING
Data obtained through observation is finite and
sampled by nature
 Typically this sampling is non-uniform
 Due to the high dimensional nature of some
problems the data will form only a sparse
distribution in the input space
 Creating a model from this type of data is an ill
posed problem

3
EMPIRICAL DATA MODELING
Globally Optimal
Model
Selected
model
Best Reachable
Model
The goal in modeling is to choose a model from the hypothesis space,
which is closest (with respect to some error measure) to the underlying
function in the target space.
4
MODELING ERROR

Approximation Error is a consequence of the
hypothesis space not exactly fitting target space,




The underlying function may lie outside the hypothesis
space
A poor choice of the model space will result in a large
approximation error (model mismatch)
Estimation Error is the error due to the learning
procedure converging to a non-optimal model in the
hypothesis space
Together these form the Generalization Error
5
EMPIRICAL DATA MODELING
Generalization
Error
Approximation
Error
Globally Optimal
Model
Selected
model
Estimation Error
Best Reachable
Model
The goal in modeling is to choose a model from the hypothesis space,
which is closest (with respect to some error measure) to the underlying
function in the target space.
6
WHAT IS STATISTICAL LEARNING?

Definition: “Consider the learning problem as a problem of
finding a desired dependence using a limited number of
observations.” (Vapnick 17)
7
MODEL OF SUPERVISED LEARNING
Training: The supervisor takes each
generated x value and returns an output
value y.
 Each (x,y) pair is part of the training set:

 F(x,y)
= F(x),F(y|x) = (x1, y1), ….., (xl, yl)
8
MODEL OF SUPERVISED LEARNING
Goal: For each (x,y) pair, we want to choose the LM’s
estimation function :
that closest estimates
according the supervisor’s response, y.
Once we have the estimation function, we can classify new
and unseen data.
9
RISK MINIMIZATION

To find the best function, we need to measure
loss.
is the discrepancy function
based on the y’s generated by the supervisor
and the ‘s generated by the estimation
functions.
10
RISK MINIMIZATION
To do this, we calculate the risk functional:
We choose the function, f(x, α) that minimizes the risk
functional R(α) over the class functions f(x, α), α ϵ Λ
Remember, F(x,y) is unknown except for the information
contained in the training set.
11
RISK MINIMIZATION WITH PATTERN
RECOGNITION
With pattern recognition, the supervisor’s output y
can only take on 2 values, y = {0, 1} and the loss
takes the following values.
So the risk functional determines the probability of
different answers being given by the supervisor and
the estimation function.
12
RISK MINIMIZATION

The expected value of loss with regards to
some estimation function f (x, ) :
R() 
where

 L(y, f (x,))d P(x,y)

P(x, y)  P(x)P(y | x)

Problem: We still don’t don’t know
P( x, y)


13
TO SIMPLIFY THESE TERMS…
From this point on, we’ll refer to the training set,
{(x1, y1), (x2, y2),…,(xl, yl) }, as
{z1, z2, …, zl}
And we’ll refer to the loss functional,
, as
14
EMPIRICAL RISK MINIMIZATION (ERM)

Instead of measuring risk over the set of all just
measure it over just the training set
giving the empirical risk functional of

The empirical risk
must converge uniformly to
the actual risk
over the set of loss functions
in both directions
15
SO WHAT DOES LEARNING THEORY NEED TO
ADDRESS?

i. What are the (necessary and sufficient) conditions for
consistency of a learning process based on the ERM
principle?

ii. How fast is the rate of convergence of the learning
process?

iii. How can one control the rate of convergence (the
generalization ability) of the learning process?

iv. How can one construct algorithms that can control
the generalization ability?
16
VC DIMENSION (VAPNIK–CHERVONENKIS)

The VC dimension is a scalar value that measures the capacity
of a set of functions.

The VC dimension of a set of functions is responsible for the
generalization ability of learning machines.

The VC dimension of a set of indicator functions
α ϵ Λ is
the maximum number h of vectors z1, …zh that can be separated
into two classes in all possible ways using functions of the set.
17
VC DIMENSION
3 vectors can be shattered, but not 4 since vectors z2, z4
cannot be separated by a line from vectors z1, z3
Rule: The set of linear indicator functions in n dimensional
space has a VC dimension h = n + 1
18
UPPER BOUND FOR RISK

It can be shown that
where
is the confidence interval
and h is the VC dimension
,

ERM only minimizes
and
, the
confidence interval, is fixed based on the VC
dimension of the set of functions
determined a priori

When implementing ERM one must tune the
confidence interval based on the problem to avoid
underfitting/overfitting the data
19
STRUCTURAL RISK MINIMIZIATION (SRM)

SRM attempts to minimize the right hand side of the
inequality over both terms simultaneously

The first term is dependent upon a specific function’s
error and the second depends on the VC dimension of
the space that function is in

Therefore VC dimension must be a controlling variable
20
STRUCTURAL RISK MINIMIZATION (SRM)

We define our hypothesis space S to be the set of
functions

We say that
is the hypothesis
space of VC dimension, k, such that:

For a set of observations
SRM chooses
the function
minimizing the empirical risk
in subset for which the guaranteed risk is
minimal
21
STRUCTURAL RISK MINIMIZATION (SRM)

SRM defines a trade-off between the quality of the
approximation of the given data and the
complexity of the approximating function

As VC dimension increases the minima of the
empirical risks decrease but the confidence
interval increases

SRM is more general than ERM because it uses
the subset for which minimizing
yields
the best bound on
22
SUPPORT VECTOR CLASSIFICATION

Uses the SRM principal to separate two classes
by a linear indicator function which is induced
from available examples in the training set.

The goal is to produce a classifier that will work
well on unseen test examples. We want to the
classifier with the maximum generalizing
capacity i.e. the lowest risk.
23
SIMPLEST CASE: LINEAR CLASSIFIERS
How would you
classify this
data?
24
SIMPLEST CASE: LINEAR CLASSIFIERS
All of these lines
work as linear
classifiers
Which one is the
best?
25
SIMPLEST CASE: LINEAR CLASSIFIERS
Define the
margin of a
linear classifier
as the width the
boundary can be
increased by
before hitting a
datapoint.
26
SIMPLEST CASE: LINEAR CLASSIFIERS
We want the
maximum margin
linear classifier.
Support vectors
are the
datapoints the
margin pushes
up against
This is the
simplest SVM
called a linear
SVM
27
SIMPLEST CASE: LINEAR CLASSIFIERS
+1 zone
-1 zone
Minus Plane
Plus Plane
We can define these two planes by x, the y-intercept, b, and
w, a vector perpendicular to the lines they lie on so that the
dot product gives the perpendicular planes
28
THE OPTIMAL SEPARATING HYPERPLANE

But how can we find M in terms of w and b
when the planes are defined as:
Positive plane = (w * x) + b = 1
 Negative plane = (w * x) +b = -1


Note: Linear classifier plane: (w * x) + b = 0
29
THE OPTIMAL SEPARATING HYPERPLANE
The margin is defined the
distance from any point
on the minus plane to the
closest point on the
plus plane
(w * x) + b ≥ 1
(w * x) + b ≤ 1
30
THE OPTIMAL SEPARATING HYPERPLANE
Why?
31
THE OPTIMAL SEPARATING HYPERPLANE
32
THE OPTIMAL SEPARATING HYPERPLANE
=
33
THE OPTIMAL SEPARATING HYPERPLANE
=
=
So
34
THE OPTIMAL SEPARATING HYPERPLANE
35
THE OPTIMAL SEPARATING HYPERPLANE
36
THE OPTIMAL SEPARATING HYPERPLANE
37
THE OPTIMAL SEPARATING HYPERPLANE
So we want to maximize
Or minimize
38
GENERALIZED OPTIMAL HYPERPLANE

Possible to extend to non-separable training
sets by adding a error parameter and
minimizing:

Data can be split into more than two
classifications by using successive runs on the
resulting classes
39
QUADRATIC PROGRAMMING
Optimization algorithms used to maximize a quadratic function
of some real-valued variables subject to linear constraints.
If we were working in the linear world, we’d want to minimize
Now, we want to maximize:
In the nonnegative quadrant
Under the constraint
40
SUPPORT VECTOR MACHINES (SVM)
Maps the input vectors x into a high-dimensional feature space
using a kernel function
In this feature space the optimal separating hyperplane
is constructed
41
HOW DO SV MACHINES HANDLE DATA IN
DIFFERENT CIRCUMSTANCES?
Basic one dimensional example?
42
HOW DO SV MACHINES HANDLE DATA IN
DIFFERENT CIRCUMSTANCES?
Easy!
43
HOW DO SV MACHINES HANDLE DATA IN
DIFFERENT CIRCUMSTANCES?
Harder one dimensional example?
44
HOW DO SV MACHINES HANDLE DATA IN
DIFFERENT CIRCUMSTANCES?
Project the lower
dimensional training
points into higher
dimensional space
45
SV MACHINES

How are SV Machines implemented?
 Polynomial
Learning Machines
 Radial Basis Functions Machines
 Two Layer Neural Networks
Each of these methods and all SV Machine
implementation techniques use a different kernel
function.
46
TWO-LAYER NEURAL NETWORK APPROACH
The kernel is a sigmoid function:
Implementing the rules:
Using this technique the following are found automatically:
i.
Architecture of the two layer machine, determining the number
N of units in the first layer (the number of support vectors)
ii. The vectors of the weights
in the first layer
iii. The vector of weights for the second layer (values of
)
47
TWO-LAYER NEURAL NETWORK APPROACH
48
HANDWRITTEN DIGIT RECOGNITION
Data used from the U.S. Postal Service Database (1990)
Purpose was to experiment on learning the recognition of
handwritten digits using different SV machines
-7300 training patterns
-2000 test patterns collected from real-life zip codes
16X16 pixel resolution of database  256 dimensional input
space
49
HANDWRITTEN DIGIT RECOGNITION
50
HANDWRITTEN DIGIT RECOGNITION
51
CONCLUDING REMARKS ON SV MACHINES

When implementing, the quality of a learning
machine is characterized by three main
components:
1.
2.
3.
How rich and universal is the set of functions that
the LM can approximate?
How well can the machine generalize?
How fast does the learning process for this
machine converge?
52
EXAM QUESITON 1

What are the two components of Generalization
Error?
53
EXAM QUESITON 1

What are the two components of Generalization
Error?
Approximation Error
and Estimation Error
54
EXAM QUESTION 2

What is the main difference between Empirical
Risk Minimization and Structural Risk
Minimization?
55
EXAM QUESTION 2

What is the main difference between Empirical
Risk Minimization and Structural Risk
Minimization?
ERM: Keep the confidence interval fixed
(chosen a priori) while minimizing empirical risk
 SRM: Minimize both the confidence interval
and the empirical risk simultaneously

56
EXAM QUESTION 3

What differs between SVM implementations?
i.e. Polynomial, radial basis learning
machines, neural network LM’s?
57
EXAM QUESTION 3

What differs between SVM implementations?
i.e. Polynomial, radial basis learning
machines, neural network LM’s?

The kernel function
58
ANY QUESTIONS?
59