Transcript Document
Sparse Kernel Machines
Christopher M. Bishop,
Pattern Recognition and Machine Learning
Outline
Introduction to kernel methods
Support vector machines (SVM)
Relevance vector machines (RVM)
Applications
Conclusions
2
Supervised Learning
In machine learning, applications in which the
training data comprises examples of the input
vectors along with their corresponding target
vectors are called supervised learning
(x,t)
(1,60,pass)
(2,53,fail)
(3,77,pass)
(4,34,fail)
﹕
y(x)
output
3
Classification
x2
y<0
t=-1
y=0
y>0
t=1
x1
4
Regression
t
1
0
-1
0
new x
1
x
5
Linear Models
Linear models for regression and
model parameter
input
classification:
y( x) 0 1x1 D xD where x = (x1,...,x D )
if we apply feature extraction,
M 1
y( x) 0 j j ( x) w ( x) 0
T
j 1
6
Problems with Feature Space
Why feature extraction? Working in high
dimensional feature spaces solves the
problem of expressing complex functions
Problems:
- there is a computational problem (working
with very large vectors)
- curse of dimensionality
7
Kernel Methods (1)
Kernel function: inner products in some
feature space nonlinear similarity measure
k ( x, x ') ( x) ( x ')
T
Examples
- polynomial: k ( x, x ') ( xT x ' c)d
2
- Gaussian: k ( x, x ') exp( x x ' / 2 2 )
8
Kernel Methods (2)
k ( x, z ) ( x T z ) 2 ( x z x z ) 2
1 1
2 2
x12 z12 2 x1 z1 x2 z2 x22 z22
( x12 , 2 x1 x2 , x22 )( z12 , 2 z1 z2 , z22 )T
( x) ( z )
T
Many linear models can be reformulated
using a “dual representation” where the
kernel functions arise naturally only
require inner products between data (input)
9
Kernel Methods (3)
We can benefit from the kernel trick:
- choosing a kernel function is equivalent to
choosing φ no need to specify what
features are being used
- We can save computation by not explicitly
mapping the data to feature space, but just
working out the inner product in the data
space
10
Kernel Methods (4)
Kernel methods exploit information about
the inner products between data items
We can construct kernels indirectly by
choosing a feature space mapping φ, or
directly choose a valid kernel function
If a bad kernel function is chosen, it will map
to a space with many irrelevant features, so
we need some prior knowledge of the target
11
Kernel Methods (5)
Two basic modules for kernel methods
General purpose
learning model
Problem specific
kernel function
12
Kernel Methods (6)
Limitation: the kernel function k(xn,xm)
must be evaluated for all possible pairs xn
and xm of training points when making
predictions for new data points
Sparse kernel machine makes prediction
only by a subset of the training data points
13
Outline
Introduction to kernel methods
Support vector machines (SVM)
Relevance vector machines (RVM)
Applications
Conclusions
14
Support Vector Machines (1)
Support Vector Machines are a system for
efficiently training the linear machines in the
kernel-induced feature spaces while
respecting the insights provided by the
generalization theory and exploiting the
optimization theory
Generalization theory describes how to
control the learning machines to prevent
them from overfitting
15
Support Vector Machines (2)
To avoid overfitting, SVM modify the error
function to a “regularized form”
E(w) ED (w) EW (w)
where hyperparameter λ balances the trade-off
The aim of EW is to limit the estimated functions
to smooth functions
As a side effect, SVM obtain a sparse model
16
Support Vector Machines (3)
Fig. 1 Architecture of SVM
17
SVM for Classification (1)
The mechanism to prevent overfitting in
classification is “maximum margin classifiers”
SVM is fundamentally a two-class classifier
18
Maximum Margin Classifiers (1)
The aim of classification is to find a D-1
dimension hyperplane to classify data in a D
dimension space
2D example:
19
Maximum Margin Classifiers (2)
support vectors
support vectors
margin
20
Maximum Margin Classifiers (3)
small margin
large margin
21
Maximum Margin Classifiers (4)
Intuitively it is a “robust” solution
- If we’ve made a small error in the
location of the boundary, this gives us
least chance of causing a misclassification
The concept of max margin is usually
justified using Vapnik’s Statistical learning
theory
Empirically it works well
22
SVM for Classification (2)
After the optimization process, we obtain
the prediction model:
N
y( x) antn k ( x, x n ) b
n 1
where (xn,tn) are N training data
we can find that an will be zero except for
that of the support vectors sparse
23
SVM for Classification (3)
Fig. 2 data from twp classes in two dimensions showing contours of
constant y(x) obtained from a SVM having a Gaussian kernel function
24
SVM for Classification (4)
For overlapping class distributions, SVM
allow some of the training points to be
misclassified soft margin
penalty
25
SVM for Classification (5)
For multiclass problems, there are some
methods to combine multiple two-class
SVMs
- one versus the rest
- one versus one more training time
Fig. 3 Problems in multiclass
classification using multiple SVMs
26
SVM for Regression (1)
For regression problems, the mechanism
to prevent overfitting is “ε-insensitive error
function”
quadratic error
function
ε-insensitive
error funciton
27
SVM for Regression (2)
×
Error =
|y(x)-t|- ε
No error
Fig . 4 ε-tube
28
SVM for Regression (3)
After the optimization process, we obtain
the prediction model:
N
y( x) (an aˆn )k ( x, x n ) b
n 1
we can find that an will be zero except for
that of the support vectors sparse
29
SVM for Regression (4)
Fig . 5 Regression results. Support vectors are line on the
boundary of the tube or outside the tube
30
Disadvantages
It’s not sparse enough since the number of
support vectors required typically grows
linearly with the size of the training set
Predictions are not probabilistic
The estimation of error/margin trade-off
parameters must utilize cross-validation
which is a waste of computation
Kernel functions are limited
Multiclass classification problems
31
Outline
Introduction to kernel methods
Support vector machines (SVM)
Relevance vector machines (RVM)
Applications
Conclusions
32
Relevance Vector Machines (1)
The relevance vector machine (RVM) is a
Bayesian sparse kernel technique that shares
many of the characteristics of SVM whilst
avoiding its principal limitations
RVM are based on Bayesian formulation and
provides posterior probabilistic outputs, as
well as having much sparser solutions than
SVM
33
Relevance Vector Machines (2)
RVM intend to mirror the structure of the
SVM and use a Bayesian treatment to remove
the limitations of SVM
N
y ( x) wn k ( x, x n ) b
n 1
the kernel functions are simply treated as
basis functions, rather than dot-product in
some space
34
Bayesian Inference
Bayesian inference allows one to model
uncertainty about the world and outcomes of
interest by combining common-sense knowledge
and observational evidence.
35
Relevance Vector Machines (3)
In the Bayesian framework, we use a prior
distribution over w to avoid overfitting
1/ 2
2
p( w | ) ( ) exp( wm )
2
m 1 2
N
where α is a hyperparameter which control
the model parameter w
36
Relevance Vector Machines (4)
Goal: find most probable α* and β* to
compute the predictive distribution over tnew
for a new input xnew, i.e.
p(tnew | xnew, X, t, α*, β*)
Training data and their
target values
Maximize the likelihood function to obtain α*
and β* :
p(t|X, α, β)
37
Relevance Vector Machines (5)
RVM utilize the “automatic relevance
determination” to achieve sparsity
m 1/ 2
m 2
p( w | ) ( ) exp(
wm )
2
m 1 2
N
where αm represents the precision of wm
In the procedure of finding αm*, some αm will
become infinity which leads the
corresponding wm to be zero remain
relevance vectors !
38
Comparisons - Regression
RVM (on standard deviation predictive
distribution)
SVM
39
Comparisons - Regression
40
Comparison - Classification
RVM
SVM
41
Comparison - Classification
42
Comparisons
RVM are much sparser and make probabilistic
prediction
RVM gives better generalization in regression
SVM gives better generalization in
classification
RVM is computationally demanding while
learning
43
Outline
Introduction to kernel methods
Support vector machines (SVM)
Relevance vector machines (RVM)
Applications
Conclusions
44
Applications (1)
SVM for face detection
45
Applications (2)
Marti Hearst, “ Support Vector Machines” ,1998
46
Applications (3)
In the feature-matching based object
tracking, SVM are used to detect false feature
matches
Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001
47
Applications (4)
Recovering 3D human poses by RVM
A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector
48
Regression” 2004
Outline
Introduction to kernel methods
Support vector machines (SVM)
Relevance vector machines (RVM)
Applications
Conclusions
49
Conclusions
The SVM is a learning machine based on
kernel method and generalization theory
which can perform binary classification and
real valued function approximation tasks
The RVM have the same model as SVM but
provides probabilistic prediction and sparser
solutions
50
References
www.support-vector.net
N. Cristianini and J. Shawe-Taylor, “An
Introduction to Support Vector Machines and
Other Kernel-based Learning Methods,”
Cambridge University Press,2000
M. E. Tipping, “Sparse Bayesian Learning and
the Relevance Vector Machine,” Journal of
Machine Learning Research, 2001
51
Underfitting and Overfitting
underfitting-too simple
overfitting-too complex
new data
Adapted from http://www.dtreg.com/svm.htm
52