Transcript cs.uvm.edu

Statistical Learning
Theory & Classifications
Based on Support
Vector Machines
The Nature of Statistical Learning Theory by V. Vapnik
Anders Melen
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
1
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
2
Empirical Data Modeling
•
•
•
•
•
Observations of a system are collected
Induction on observations is used to build up a model
of the system.
Model is then used to deduce responses of an
unobserved system.
Sampling is typically non-uniform
High dimensional problems will form a sparse
distribution in the input space
3
Modeling Error
•
Approximation error is the consequence of the
hypothesis space not fitting the target space
Globally Optimal Model
Best Reachable Model
Selected Model
4
Modeling Error
•
Estimation Error is the error due to the learning
procedure converging to a non-optimal model in the
hypothesis space
Approximation Error
Globally Optimal Model
Generalization Error
Best Reachable Model
Estimation Error
Selected Model
● Together these form the Generalization Error
5
Modeling Error
•
Approximation error is the consequence of the
hypothesis space not fitting the target space
Globally Optimal Model
Best Reachable Model
Selected Model
● Goal
6
○ Choose a model from the hypothesis space
which is closest (w/ respect to some error
measure) to the function target space
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
7
Statistical Learning Theory
Definition: “Consider the learning problem as a problem of finding a desired
dependence using a limited number of observations.” (Vapnik 17)
8
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
9
Model of Supervised Learning
•
Training
o The supervisor takes
each generated x value
and returns and output
value y.
o Each (x,y) pair is part of
the training set:
F(x,y) = F(x)
F(y|x) = (x1, y1) , (x2, y2), … , (xl,yl)
10
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
11
Risk Minimization
•
To find the best function, we need to measure loss
L(y, F(x,𝛂))
•
•
L is the discrepancy function which is based on the y’s
generated by the supervision and the ŷ’s generated
by the estimate functions
F is a predictor such that expected loss is minimized
12
Risk Minimization
•
Pattern Recognition
o
With pattern recognition, the supervisor’s output y
can only take on 2 values, y = {0,1} and the loss
takes the following values.
○ So the risk function determines the probability of
different answers being given by the supervisor and
the estimation function.
13
Some Simplifications From Here On
● Training Set
{(X1,Y1), … , (Xl,Yl)} → {Z1, … , Zl}
● Loss Function
L(y, F(x,𝛂)) → Q(z,𝛂)
14
Empirical Risk Minimization (ERM)
● We want to measure the risk over the training set
rather than the set of all
15
Empirical Risk Minimization (ERM)
● The empirical risk must converge to the actual risk
over the set of loss functions
16
Empirical Risk Minimization (ERM)
● In both directions!
17
What do we need to address here?
● What are the necessary and sufficient conditions for
consistency of a learning process based on ERM
principles?
● At what rate does the learning process converge?
● How can we control the rate of the convergence of
learning?
18
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
19
Vapnik-Chervonenkis Dimensions
•
•
•
Lets just call them VC Dimensions
Developed by Alexey Jakovlevich
Chervonenkis & Vladimir Vapnik
The VC dimension is scalar value that
measures the capacity of a set of functions
20
Vapnik-Chervonenkis Dimensions
•
•
The VC dimension of a set of functions is responsible
for the generalization ability of learning machines
The VC dimension of a set of indicator functions
Q(z,𝛂)𝛂 ∈ 𝞚 is the maximum number h of vectors
z1, …, zh that can be separated into two classes in
all 2h possible ways using functions of the set.
21
Upper Bound For Risk
•
It can be shown that
where
h is the
is the confidence interval and
VC dimension
22
Upper Bound For Risk
•
•
ERM only minimizes
and
,
the confidence interval, is fixed based on the VC
dimension of the set of functions determined by
apriori
ERM must tune the confidence interval based on the
problem to avoid overfitting and underfitting
23
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
24
Structural Risk Management (SRM)
•
SRM attempts to minimize the right hand
size of the inequality over both terms
simultaneously
25
Structural Risk Management (SRM)
•
•
The
term is dependent on a specific
function’s error while the
term depends
on the dimension of the space that the
functions lives in.
The VC dimension is the controlling variable
26
Structural Risk Management (SRM)
•
•
We define the hypothesis space S to be the set of
functions:
Q(z,𝛂)𝛂 ∈ 𝞚
We say that Sk= {Q(z,𝛂)},𝛂 ∈ 𝞚k is the
hypothesis space of a VC dimension, k, such that:
27
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
28
Support Vector Classification
•
Uses SRM principle to
separate two classes by a
linear indicator function which
is induced from examples in
the training set
29
Support Vector Classification
•
Well, all of these lines work as
linear classifiers. However,
which one is the best choice?
30
Support Vector Classification
•
The margin of a linear
classifier is defined as the
width the boundary can be
increased by before hitting a
datapoint
31
Support Vector Classification
•
•
How about a better vector
Linear SVM’s are the simplest
SVM’s
32
Support Vector Classification
•
Defines planes by x, yintercepts, b, w and a vector
perpendicular to the lines they
lie on so that the dot product
gives us the perpendicular
planes
- Plane
+ Plane
33
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
34
Optimal Separating Hyperplane
- Plane
Margin = |x- - x+|
•
•
(w * x) + b >= 1
(w * x) + b <= 1
+ Plane
≥≤≥≤
35
Optimal Separating Hyperplane
- Plane
M = |x- - x+|
(w * x+) + b = +1
(w * x-) + b = -1
+ Plane
x+ = x- + 𝛌w
≥≤≥≤
36
Optimal Separating Hyperplane
M = |x- - x+|
(w * x+) + b = +1
(w * x-) + b = -1
x+ = x- + 𝛌w
= w * (x- + 𝛌w) + b = 1
= w * x- + b + 𝛌w * w
≥≤≥≤
37
Optimal Separating Hyperplane
(w * x+) + b = +1
(w * x-) + b = -1
x+ = x- + 𝛌w
≥≤≥≤
= w * (x- + 𝛌w) + b = 1
= w * x- + b + 𝛌w * w
= -1 + 𝛌w * w = 1
so,
38
Optimal Separating Hyperplane
M = |x- - x+|
(w * x+) + b = +1
(w * x-) + b = -1
M
= |x- - x+| = |𝛌w|
= 𝛌|w| = 𝛌
sqrt(w*w)
39
Optimal Separating Hyperplane
● General Optimal Hyperplane
● Extend to non-separable training sets by adding an
error parameter
and minimizing:
40
Quadratic Programming
•
•
In linear world we would we’d want to
minimize:
Now we want to maximize:
41
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
42
Support Vector Machines (SVM)
•
Map input vectors x into a high-dimensional
feature space using a kernel function:
(zi, z) = K(x, xi)
43
Support Vector Machines (SVM)
•
Feature space… Optimal hyperplane…
What are you talking about...
44
Support Vector Machines (SVM)
45
Support Vector Machines (SVM)
● Lets try a basic one dimensional example!
46
Support Vector Machines (SVM)
● Aw snap, that was easy!
47
Support Vector Machines (SVM)
● Ok, what about a harder one dimensional example?
48
Support Vector Machines (SVM)
● Project the lower dimensional data into a higher
dimensional space just like in the animation!
49
Support Vector Machines (SVM)
● There is several ways to implement a SVM
○ Polynomial Learning Machine (Like the animation)
○ Radial Basis Function Machines
○ Two-Layer Neural Networks
50
Simple Neural Network
● Neural Networks are computer science models inspired
by nature!
● The brain is a massive natural neural network consisting
of neurons and synapses
● Neural networks can be modeled using a graphical
model
51
Simple Neural Network
● Neurons → Nodes
● Synapses → Edges
Molecular Form
Neural Network Model
52
Two-Layer Neural Network
Kernel is a sigmoid
function
Implementing the rules
53
Two-Layer Neural Network
● Using this technique the following are found
automatically:
i. Architecture of a two-layer machine
i. Determining N number of units in first layer (# of
support vectors)
i. The vectors of the weights wi = xi in the first layer
i. The vector of weights for the second layer (values of
54
Optical Character Recognition (OCR)
● Data from the U.S. Postal Service Database (1990)
● 7,300 training patterns
● 2,000 test patterns collected from real-life zip code
55
Optical Character Recognition (OCR)
56
Optical Character Recognition (OCR)
57
Conclusion
● The quality of a learning machine is characterized by
three main components
a. How rich and universal is the set of functions that the
LM can approximate?
b. How well can the machine generalize?
c. How fast does the learning process for this machine
converge
58
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
59
Exam Question #1
•
What is the main difference between
Polynomial, radial basis learning machines
and neural network learning machines? Also
provide that difference for the neural network
learning machine
o
The kernel function
60
Exam Question #2
•
What is empirical data modeling? Give a
summary of the main concept and its
components
o
Empirical data modeling is the induction of
observations to build up a model. Then the model is
used to deduce responses of an unobserved
system.
61
Exam Question #3
•
What must the Remp(𝛂) do over the set of loss
functions?
o It must converge to the R(𝛂)
62
Table of Contents
•
•
•
•
•
•
•
•
•
Empirical Data Modeling
What is Statistical Learning Theory
Model of Supervised Learning
Risk Minimization
Vapnik-Chervonenkis Dimensions
Structural Risk Management (SRM)
Support Vector Classification
o Optimal Separating Hyperplane & Quadratic Programming
Support Vector Machines (SVM)
Exam Questions
63
End
Any questions?