Transcript Slide 1

Statistical Learning Theory: Classification
Using Support Vector Machines
John DiMona
Some slides based on
Prof Andrew Moore at CMU: http://www.cs.cmu.edu/~awm/tutorials
(Rough) Outline
1. Empirical Modeling
2. Risk Minimization
i. Theory
ii. Empirical Risk Minimization
iii. Structural Risk Minimization
3.
4.
5.
6.
Optimal Separating Hyperplanes
Support Vector Machines
Example
Questions
Empirical Data Modeling
• Observations of a system are collected
• Based on these observations a process of
induction is used to build up a model of the
system
• This model is used to deduce responses of the
system not yet observed
Empirical Data Modeling
• Data obtained through observation is finite
and sampled by nature
• Typically this sampling is non-uniform
• Due to the high dimensional nature of some
problems the data will form only a sparse
distribution in the input space
• Creating a model from this type of data is an ill
posed problem
Empirical Data Modeling
Globally Optimal
Model
Selected Model
Best Reachable
Model
The goal in modeling is to choose a model from the hypothesis space,
which is closest (with respect to some error measure) to the underlying
function in the target space.
Error in Modeling
• Approximation Error is a consequence of the
hypothesis space not exactly fitting target space,
– The underlying function may lie outside the
hypothesis space
– A poor choice of the model space will result in a large
approximation error (model mismatch)
• Estimation Error is the error due to the learning
procedure converging to a non-optimal model in
the hypothesis space
• Together these form the Generalization Error
Error in Modeling
Globally Optimal
Model
Selected Model
Best Reachable
Model
The goal in modeling is to choose a model from the hypothesis space,
which is closest (with respect to some error measure) to the underlying
function in the target space.
Pattern Recognition
• Given a system
Y (x)  y , where:
x  (x1, ... , x n )  X  R n
and

y  {1, 1}

The item we want to
classify
The true classification of
that item
• Develop a model f (x)  yˆ , that best predicts
the
 behavior of the system for all possible
items x  X .


Supervised Learning
i. A generator (G) of a set of vectors x  R n ,
observed independently from the system with
a fixed, unknown probability distribution P(x)

ii. A supervisor (S) who returns an output value y
to every input vector x, according to the

systems conditional probability function
P(y | x)
(also unknown)


iii. A learning machine
(LM) capable of
implementing a set of functions f (x, ),    ,
where  is a set of parameters
Supervised Learning
1. Training: the generator creates a set {x1, ... , x l }  X
and the supervisor provides correct classification to
form the training set {(x1,y1), ... , (x l , y l )}
–
The learning machine develops
 an estimation function
using the training data.
2. Use the estimation
function to classify new unseen

data
Risk Minimization
• In order to choose the best estimation
function we must have a measure of
discrepancy between a true classification of x
Y (x)  y
and an estimated classification

f (x, )  yˆ

For pattern recognition we use:
0 if y  f (x, )
L(y, f (x, ))  
1 if y  f (x, )
Risk Minimization
• The expected value of loss with regards to
some estimation function f (x, ) :
R() 
where


 L(y, f (x,))d P(x,y)

P(x, y)  P(x)P(y | x)
Risk Minimization
• The expected value of loss with regards to
some estimation function f (x, ) :
R() 
where
 L(y, f (x,))d P(x,y)

P(x, y)  P(x)P(y | x)

• Goal: Find the function f (x, 0 ) that minimizes
the risk R() (over all functions f (x, ),    )




Risk Minimization
• The expected value of loss with regards to
some estimation function f (x, ) :
R() 
where
 L(y, f (x,))d P(x,y)

P(x, y)  P(x)P(y | x)

• Goal: Find the function f (x, 0 ) that minimizes
the risk R() (over all functions f (x, ),    )

• Problem: By definition we don’t know P( x, y)



To make things clearer…
For the coming discussion we will shorten notation in
the following ways
• The training set {(x1,y1), ... , (x l , y l )} will be
referred to as {z1, ... ,zl }
• The loss function L(y, f (x, )) will be Q(z,)




Empirical Risk Minimization (ERM)
• Instead of measuring risk over the set of all x  X
just measure it over just the training set {z1, ... ,zl }
•
1 l
Remp (    Q( zi ,  
l i 1

The empirical risk Remp (  must converge
uniformly to the actual risk R(  over the set of
loss functions Q(z,  ,    in both directions:
lim Remp (   R ( 
l 
lim min Remp (   min R ( 
l   
 

VC Dimension
(Vapnik–Chervonenkis Dimension)
• The VC dimension is a scalar value that
measures the capacity of a set of functions.
• The VC dimension of a set of functions is h if
and only if there exists a set of points {x i }hi1
such that these points can be separated in all
2 h possible configurations, and that no set

this
{x i }qi1 exists where q  h satisfying
property.
VC Dimension
(Vapnik–Chervonenkis Dimension)
•Three points in the plane can be shattered by the set of
linear indicator functions whereas four points cannot
•The set of linear indicator functions in n dimensional space
has a VC dimension equal to n + 1
Upper Bound for Risk
 l 
R( l )  Remp( l )    ,
h 
• It can be shown that
l
where   is the confidence interval
h
and h is the VC dimension
l

• ERM only minimizes Remp (l ) , and   , the
h
confidence interval, is fixed based on the VC
dimension of the set of functions f ( y,α),   
determined a priori
• When implementing ERM one must tune the
confidence interval based on the problem to
avoid underfitting/overfitting the data
Structural Risk Minimization (SRM)
• SRM attempts to minimize the right hand side of
the inequality over both terms simultaneously
 l 
R( l )  Remp( l )   
h 
• The first term is dependent upon a specific
function’s error and the second depends on the

VC dimension of the space that function is in
• Therefore VC dimension must be a controlling
variable
Structural Risk Minimization (SRM)
• We define our hypothesis space S to be the
set of functions Q(z,),   
• We say that Sk  {Q(z, )},    k is the

hypothesis space of VC dimension
k such that:


S1  S2  ... Sn ...
• For a set of observations z1, ... , zl SRM chooses
k

Q(z,

)
the function
minimizing
the empirical
l

risk in subset
Sk for which the guaranteed risk

is minimal

Structural Risk Minimization (SRM)
• SRM defines a trade-off between the quality
of the approximation of the given data and
the complexity of the approximating function
• As VC dimension increases the minima of the
empirical risks decrease but the confidence
interval increases
• SRM is more general than ERM because it uses
the subset Sk for which minimizing Remp()
yields the best bound on R()
Support Vector Classification
• Uses the SRM principal to separate two
classes by a linear indicator function which is
induced from available examples
• The goal is to produce a classifier that will
work well on unseen examples, i.e. it
generalizes well
Linear Classifiers
denotes +1
denotes -1
Imagine a training set
such as this.
What is the best way to
separate this data?
Linear Classifiers
denotes +1
denotes -1
Imagine a training set
such as this.
What is the best way to
separate this data?
All of these are correct
but which is the best?
Linear Classifiers
denotes +1
denotes -1
Imagine a training set
such as this.
What is the best way to
separate this data?
Support
vectors
All of these are correct
but which is the best?
The maximum margin
classifier maximizes the
distance from the
hyperplane to the nearest
data points (support
vectors)
Defining the Optimal Hyperplane
The optimal hyperplane separates the
training set with the largest margin
(w  x)  b  0
(w  x)  b  1
(w  x)  b  1



(w  x i )  b  1
(w  x i )  b  1


Defining the Optimal Hyperplane
The optimal hyperplane separates the
training set with the largest margin
(w  x)  b  0
(w  x)  b  1
The margin is defined the distance

from any point x on the minus plane
to the closest point x  on the plus
plane
M  x  x 
(w  x)  b  1






(w  x i )  b  1
(w  x i )  b  1


Defining the Optimal Hyperplane
The optimal hyperplane separates the
training set with the largest margin
(w  x)  b  0
(w  x)  b  1
The margin is defined the distance

from any point x on the minus plane
to the closest point x  on the plus
plane
M  x  x 
(w  x)  b  1





We need to find M in terms of w

(w  x i )  b  1
(w  x i )  b  1



Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
(w  x  )  b  1
x  x  w


Because w is
perpendicular to
the hyperplane

Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
w  (x   w)  b  1
(w  x  )  b  1
x   x   w


Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
w  (x   w)  b  1
…
(w  x  )  b  1
x  x  w

w  x   b  w  w




Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
w  (x   w)  b  1
…
(w  x  )  b  1
x  x  w

w  x   b  w  w


1 w w 1
…




2
w w
Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
w  (x   w)  b  1
…
(w  x  )  b  1

x  x  w

w  x   b  w  w


1 w w 1
…


M  x  x   w



2
w w
Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
w  (x   w)  b  1
…
(w  x  )  b  1

w  x   b  w  w
x  x  w



1 w w 1
…


M  x  x   w



2
w w
 w   w w

Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
w  (x   w)  b  1
…
(w  x  )  b  1

w  x   b  w  w
x  x  w



1 w w 1
…


M  x  x   w


2
w w
 w   w w




2 ww
2

w w
w w
Defining the Optimal Hyperplane
M  x  x 
(w  x  )  b  1
w  (x   w)  b  1
…
(w  x  )  b  1

w  x   b  w  w
x  x  w





M
1 w w 1
…

M  x  x   w
So we want to maximize

2
w w
 w   w w




2
ww
…or minimize

2 ww
2

w  w  w  w
1
(w)  w  w
2
Quadratic Programming
• Minimizing 1 w  w is equivalent to maximizing
2
the equation
l

1 l
W ( )   i   i j y i y j (x i  x j )
2 i, j
i1
in the non negative quadrant  i  0,

under
the
constraint
l
 y
i

i1
i
i  1, ... , l
0
• This is derived using the Lagrange functional

Extensions
• Possible to extend to non-separable training
sets by adding a error parameter  i and
minimizing:
l
1
(w,)  (w  w)  Ci
2 
i1
• Data can be split into more than two
classifications
by using successive runs on the

resulting classes
Support Vector (SV) Machines
• Maps the input vectors x into a highdimensional feature space using a kernel
function (zi ,z)  K(x, x i )

• In this feature space the optimal separating
hyperplane is constructed

Optimal Hyperplane in Feature Space
Feature Space
Input Space
Support Vector (SV) Machines
1-Dimensional Example
Support Vector (SV) Machines
1-Dimensional Example
Easy!
Support Vector (SV) Machines
1-Dimensional Example
Easy!
Harder (impossible)
Support Vector (SV) Machines
1-Dimensional Example
Easy!
Harder (impossible)
Project into a higher dimension
zk  (x k , x k2 )

Support Vector (SV) Machines
1-Dimensional Example
Easy!
Harder (impossible)
Project into a higher dimension
Magic…
zk  (x k , x k2 )

Support Vector (SV) Machines
• Some possible ways to implement SV
machines:
i. Polynomial Learning Machines
ii. Radial Basis Function Machines
iii. Multi-Layer Neural Networks
• These methods all implement different kernel
functions
Two-Layer Neural Network Approach
• Kernel is a sigmoid function: K(x, x i )  Sv(x  x i )  c 
• Implements the rules:
 N

f (x, )  sign i S(v(x  x i )  c)  b
i1


• Using this technique the following are found
automatically:
– Architectureof the two layer machine,
determining the number N of units in the first
layer (the number of support vectors)
– The vectors of the weights w i  x i in the first layer
– The vector of weights for the second layer (values
of  )
Two-Layer Neural Network Approach
Handwritten Digit Recognition
• Used U.S. Postal Service database
– 7,300 training patterns
– 2,000 test patterns
– Resolution of the database was 16 x 16 pixels
yielding a 256 dimensional input space
Handwritten Digit Recognition
Classifier
Raw error%
Human performance
2.5
Decision tree, C4.5
16.2
Polynomial SVM
4.0
RBF SVM
4.1
Neural SVM
4.2
Exam Question 1
• What are the two components of Generalization
Error?
Exam Question 1
• What are the two components of Generalization
Error?
Approximation Error and
Estimation Error
Exam Question 2
• What is the main difference between Empirical
Risk Minimization and Structural Risk
Minimization?
Exam Question 2
• What is the main difference between Empirical
Risk Minimization and Structural Risk
Minimization?
• ERM: Keep the confidence interval fixed (chosen
a priori) while minimizing empirical risk
• SRM: Minimize both the confidence interval and
the empirical risk simultaneously
Exam Question 3
• What differs between SVM implementations
(polynomial, radial, NN, etc.)?
Exam Question 3
• What differs between SVM implementations
(polynomial, radial, NN, etc.)?
• The Kernel function.
References
• Vapnik: The Nature of Statistical Learning Theory
• Gunn: Support Vector Machines for Classification and
Regression
(http://www.dec.usc.es/persoal/cernadas/tc03/mc/S
VM.pdf)
• Andrew Moore’s SVM Tutorial:
http://www.cs.cmu.edu/~awm/tutorials
Any Questions?