Document

Transcript Document

Sparse Kernel Machines
Christopher M. Bishop,
Pattern Recognition and Machine Learning
Outline
 Introduction to kernel methods
 Support vector machines (SVM)
 Relevance vector machines (RVM)
 Applications
 Conclusions
2
Supervised Learning
 In machine learning, applications in which the
training data comprises examples of the input
vectors along with their corresponding target
vectors are called supervised learning
(x,t)
(1,60,pass)
(2,53,fail)
(3,77,pass)
(4,34,fail)
﹕
y(x)
output
3
Classification
x2
y<0
t=-1
y=0
y>0
t=1
x1
4
Regression
t
1
0
-1
0
new x
1
x
5
Linear Models
 Linear models for regression and
model parameter
input
classification:
y( x)  0  1x1   D xD where x = (x1,...,x D )
if we apply feature extraction,
M 1
y( x)  0    j j ( x)  w  ( x)  0
T
j 1
6
Problems with Feature Space
 Why feature extraction? Working in high
dimensional feature spaces solves the
problem of expressing complex functions
 Problems:
- there is a computational problem (working
with very large vectors)
- curse of dimensionality
7
Kernel Methods (1)
 Kernel function: inner products in some
feature space  nonlinear similarity measure
k ( x, x ')   ( x)  ( x ')
T
 Examples
- polynomial: k ( x, x ')  ( xT x ' c)d
2
- Gaussian: k ( x, x ')  exp( x  x ' / 2 2 )
8
Kernel Methods (2)
 k ( x, z )  ( x T z ) 2  ( x z  x z ) 2
1 1
2 2
 x12 z12  2 x1 z1 x2 z2  x22 z22
 ( x12 , 2 x1 x2 , x22 )( z12 , 2 z1 z2 , z22 )T
  ( x)  ( z )
T
 Many linear models can be reformulated
using a “dual representation” where the
kernel functions arise naturally  only
require inner products between data (input)
9
Kernel Methods (3)
 We can benefit from the kernel trick:
- choosing a kernel function is equivalent to
choosing φ  no need to specify what
features are being used
- We can save computation by not explicitly
mapping the data to feature space, but just
working out the inner product in the data
space
10
Kernel Methods (4)
 Kernel methods exploit information about
the inner products between data items
 We can construct kernels indirectly by
choosing a feature space mapping φ, or
directly choose a valid kernel function
 If a bad kernel function is chosen, it will map
to a space with many irrelevant features, so
we need some prior knowledge of the target
11
Kernel Methods (5)
 Two basic modules for kernel methods
General purpose
learning model
Problem specific
kernel function
12
Kernel Methods (6)
 Limitation: the kernel function k(xn,xm)
must be evaluated for all possible pairs xn
and xm of training points when making
predictions for new data points
 Sparse kernel machine makes prediction
only by a subset of the training data points
13
Outline
 Introduction to kernel methods
 Support vector machines (SVM)
 Relevance vector machines (RVM)
 Applications
 Conclusions
14
Support Vector Machines (1)
 Support Vector Machines are a system for
efficiently training the linear machines in the
kernel-induced feature spaces while
respecting the insights provided by the
generalization theory and exploiting the
optimization theory
 Generalization theory describes how to
control the learning machines to prevent
them from overfitting
15
Support Vector Machines (2)
 To avoid overfitting, SVM modify the error
function to a “regularized form”
E(w)  ED (w)   EW (w)
where hyperparameter λ balances the trade-off
 The aim of EW is to limit the estimated functions
to smooth functions
 As a side effect, SVM obtain a sparse model
16
Support Vector Machines (3)
Fig. 1 Architecture of SVM
17
SVM for Classification (1)
 The mechanism to prevent overfitting in
classification is “maximum margin classifiers”
 SVM is fundamentally a two-class classifier
18
Maximum Margin Classifiers (1)
 The aim of classification is to find a D-1
dimension hyperplane to classify data in a D
dimension space
 2D example:
19
Maximum Margin Classifiers (2)
support vectors
support vectors
margin
20
Maximum Margin Classifiers (3)
small margin
large margin
21
Maximum Margin Classifiers (4)
 Intuitively it is a “robust” solution
- If we’ve made a small error in the
location of the boundary, this gives us
least chance of causing a misclassification
 The concept of max margin is usually
justified using Vapnik’s Statistical learning
theory
 Empirically it works well
22
SVM for Classification (2)
 After the optimization process, we obtain
the prediction model:
N
y( x)   antn k ( x, x n )  b
n 1
where (xn,tn) are N training data
we can find that an will be zero except for
that of the support vectors  sparse
23
SVM for Classification (3)
Fig. 2 data from twp classes in two dimensions showing contours of
constant y(x) obtained from a SVM having a Gaussian kernel function
24
SVM for Classification (4)
 For overlapping class distributions, SVM
allow some of the training points to be
misclassified  soft margin
penalty
25
SVM for Classification (5)
 For multiclass problems, there are some
methods to combine multiple two-class
SVMs
- one versus the rest
- one versus one  more training time
Fig. 3 Problems in multiclass
classification using multiple SVMs
26
SVM for Regression (1)
 For regression problems, the mechanism
to prevent overfitting is “ε-insensitive error
function”
quadratic error
function
ε-insensitive
error funciton
27
SVM for Regression (2)
×
Error =
|y(x)-t|- ε
No error
Fig . 4 ε-tube
28
SVM for Regression (3)
 After the optimization process, we obtain
the prediction model:
N
y( x)   (an  aˆn )k ( x, x n )  b
n 1
we can find that an will be zero except for
that of the support vectors  sparse
29
SVM for Regression (4)
Fig . 5 Regression results. Support vectors are line on the
boundary of the tube or outside the tube
30
Disadvantages
 It’s not sparse enough since the number of




support vectors required typically grows
linearly with the size of the training set
Predictions are not probabilistic
The estimation of error/margin trade-off
parameters must utilize cross-validation
which is a waste of computation
Kernel functions are limited
Multiclass classification problems
31
Outline
 Introduction to kernel methods
 Support vector machines (SVM)
 Relevance vector machines (RVM)
 Applications
 Conclusions
32
Relevance Vector Machines (1)
 The relevance vector machine (RVM) is a
Bayesian sparse kernel technique that shares
many of the characteristics of SVM whilst
avoiding its principal limitations
 RVM are based on Bayesian formulation and
provides posterior probabilistic outputs, as
well as having much sparser solutions than
SVM
33
Relevance Vector Machines (2)
 RVM intend to mirror the structure of the
SVM and use a Bayesian treatment to remove
the limitations of SVM
N
y ( x)   wn k ( x, x n )  b
n 1
the kernel functions are simply treated as
basis functions, rather than dot-product in
some space
34
Bayesian Inference
 Bayesian inference allows one to model
uncertainty about the world and outcomes of
interest by combining common-sense knowledge
and observational evidence.
35
Relevance Vector Machines (3)
 In the Bayesian framework, we use a prior
distribution over w to avoid overfitting
 1/ 2
 2
p( w |  )   ( ) exp( wm )
2
m 1 2
N
where α is a hyperparameter which control
the model parameter w
36
Relevance Vector Machines (4)
 Goal: find most probable α* and β* to
compute the predictive distribution over tnew
for a new input xnew, i.e.
p(tnew | xnew, X, t, α*, β*)
Training data and their
target values
 Maximize the likelihood function to obtain α*
and β* :
p(t|X, α, β)
37
Relevance Vector Machines (5)
 RVM utilize the “automatic relevance
determination” to achieve sparsity
 m 1/ 2
m 2
p( w |  )   ( ) exp(
wm )
2
m 1 2
N
where αm represents the precision of wm
 In the procedure of finding αm*, some αm will
become infinity which leads the
corresponding wm to be zero  remain
relevance vectors !
38
Comparisons - Regression
RVM (on standard deviation predictive
distribution)
SVM
39
Comparisons - Regression
40
Comparison - Classification
RVM
SVM
41
Comparison - Classification
42
Comparisons
 RVM are much sparser and make probabilistic
prediction
 RVM gives better generalization in regression
 SVM gives better generalization in
classification
 RVM is computationally demanding while
learning
43
Outline
 Introduction to kernel methods
 Support vector machines (SVM)
 Relevance vector machines (RVM)
 Applications
 Conclusions
44
Applications (1)
 SVM for face detection
45
Applications (2)
Marti Hearst, “ Support Vector Machines” ,1998
46
Applications (3)
 In the feature-matching based object
tracking, SVM are used to detect false feature
matches
Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001
47
Applications (4)
 Recovering 3D human poses by RVM
A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector
48
Regression” 2004
Outline
 Introduction to kernel methods
 Support vector machines (SVM)
 Relevance vector machines (RVM)
 Applications
 Conclusions
49
Conclusions
 The SVM is a learning machine based on
kernel method and generalization theory
which can perform binary classification and
real valued function approximation tasks
 The RVM have the same model as SVM but
provides probabilistic prediction and sparser
solutions
50
References
 www.support-vector.net
 N. Cristianini and J. Shawe-Taylor, “An
Introduction to Support Vector Machines and
Other Kernel-based Learning Methods,”
Cambridge University Press,2000
 M. E. Tipping, “Sparse Bayesian Learning and
the Relevance Vector Machine,” Journal of
Machine Learning Research, 2001
51
Underfitting and Overfitting
underfitting-too simple
overfitting-too complex
new data
Adapted from http://www.dtreg.com/svm.htm
52

Document

Transcript Document

Directory