lecture set 9 EE 8591 - Electrical and Computer Engineering

Download Report

Transcript lecture set 9 EE 8591 - Electrical and Computer Engineering

Introduction to
Predictive Learning
LECTURE SET 7
Support Vector Machines
Electrical and Computer Engineering
1
OUTLINE
•
Objectives
explain motivation for SVM
describe basic SVM for classification & regression
compare SVM vs. statistical & NN methods
•
•
•
•
•
•
Motivation for margin-based loss
Linear SVM Classifiers
Nonlinear SVM Classifiers
Practical Issues and Examples
SVM for Regression
Summary and Discussion
2
MOTIVATION for SVM
•
•
•
•
Recall ‘conventional’ methods:
- model complexity ~ dimensionality
- nonlinear methods  multiple local minima
- hard to control complexity
‘Good’ learning method:
(a) tractable optimization formulation
(b) tractable complexity control(1-2 parameters)
(c) flexible nonlinear parameterization
Properties (a), (b) hold for linear methods
SVM solution approach
3
SVM APPROACH
•
•
Linear approximation in Z-space using
special adaptive loss function
Complexity independent of dimensionality
x
gx






z
wz
yˆ
4
Motivation for Nonlinear Methods
1. Nonlinear learning algorithm proposed using
‘reasonable’ heuristic arguments.
reasonable ~ statistical or biological
2. Empirical validation + improvement
3. Statistical explanation (why it really works)
Examples: statistical, neural network methods.
In contrast, SVM methods have been originally
proposed under VC theoretic framework.
5
OUTLINE
•
•
Objectives
Motivation for margin-based loss
Loss functions for regression
Loss functions for classification
Philosophical interpretation
•
•
•
•
•
Linear SVM Classifiers
Nonlinear SVM Classifiers
Practical Issues and Examples
SVM for Regression
Summary and Discussion
6
Main Idea
•
•
•
•
•
Model complexity controlled by a special
loss function used for fitting training data
Such empirical loss functions may be
different from the loss functions used in a
learning problem setting
Such loss functions are adaptive, i.e. can
adapt their complexity to particular data set
Different loss functions for different learning
problems (classification, regression etc)
Model complexity(VC-dim.) is controlled
independently of the number of features
7
Robust Loss Function for Regression
• Squared loss ~ motivated by
large sample settings
parametric assumptions
Gaussian noise
• For practical settings better to use linear loss
8
Epsilon-insensitive Loss for Regression
• Can also control model complexity
L ( y, f (x, ))  max| y  f (x, ) |  ,0
9
Empirical Comparison
• Univariate regression y  x   x [0,1]  ~ N (0,0.36)
• Squared, linear and SVM loss (with   0.6 ):
2
1.5
y
1
0.5
0
-0.5
-1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
• Red ~ target function, Dotted ~ estimate using squared loss,
Dashed ~ linear loss, Dashed-dotted ~ SVM loss
10
Empirical Comparison (cont’d)
• Univariate regression y  x   x [0,1]  ~ N (0,0.36)
• Squared, linear and SVM loss (with   0.6 )
• Test error (MSE) estimated for 5 independent
realizations of training data (4 training samples)
Squared
loss
Least
modulus loss
SVM loss with
epsilon=0.6
1
0.024
0.134
0.067
2
0.128
0.075
0.063
3
0.920
0.274
0.041
4
0.035
0.053
0.032
5
0.111
0.027
0.005
Mean
0.244
0.113
0.042
St. Dev.
0.381
0.099
0.025
11
Loss Functions for Classification
• Decision rule D(x)  sign f (x,)
• Quantity yf (x,  ) is analogous to residuals in regression
• Common loss functions: 0/1 loss and linear loss
• Properties of a good loss function?
12
Motivation for margin-based loss (1)
• Given: Linearly separable data
How to construct linear decision boundary?
(a) Many linear decision boundaries (that have no errors)
13
Motivation for margin-based loss (2)
• Given: Linearly separable data
Which linear decision boundary is better ?
The model with larger margin is more robust for future data
14
Largest-margin solution
• All solutions explain the data well (zero error)
All solutions ~ the same linear parameterization
Larger margin ~ more confidence (larger falsifiability)
M  2
15
Margin-based loss for classification
SVM loss or hinge loss L ( y, f (x, ))  max |   yf (x, ) |,0
Minimization of slack variables i    yi f (xi , )
1
y  1
2
y  1
16
Margin-based loss for classification:
margin size is adapted to training data
Class +1
Class -1
Margin
L ( y, f (x, ))  max  yf (x, ),0
17
Motivation: philosophical
•
Classical view: good model
explains the data + low complexity
 Occam’s razor (complexity ~ # parameters)
• VC theory: good model
explains the data + low VC-dimension
 VC-falsifiability (small VC-dim ~ large falsifiability),
i.e. the goal is to find a model that:
can explain training data / cannot explain other data
The idea: falsifiability ~ empirical loss function
18
Adaptive Loss Functions
• Both goals (explanation + falsifiability) can
encoded into empirical loss function where
- (large) portion of the data has zero loss
- the rest of the data has non-zero loss,
i.e. it falsifies the model
• The trade-off (between the two goals) is
adaptively controlled  adaptive loss fct
• For classification, the degree of
falsifiability is ~ margin size (see below)
19
Margin-based loss for classification
Margin  2
L ( y, f (x, ))  max  yf (x, ),0
20
Classification: non-separable data
slack _ variables
    yf (x,  )
L ( y, f (x, ))  max  yf (x, ),0
21
Margin based complexity control
• Large degree of falsifiability is achieved by
- large margin (classification)
- small epsilon (regression)
• For linear classifiers:
larger margin  smaller VC-dimension
1  2  3  ... ~ h1  h2  h3  ...
22
 -margin hyperplanes
• Solutions provided by minimization of SVM loss
can be indexed by the value of margin 
 SRM structure:
for 1  2  3  ... VC-dim. h1  h2  h3  ...
• If data samples belong to a sphere of radius R,
then the VC dimension bounded by
h  min(R /  , d )  1
2
2
• For large margin hyperplanes, VC-dimension
controlled independent of dimensionality d.
23
SVM Model Complexity
• Two ways to control model complexity
-via model parameterization f (x, )
use fixed loss function: L( y, f (x,  ))
-via adaptive loss function: L ( y, f (x, ))
use fixed (linear) parameterization
f (x, )  (w  x)  b
• ~ Two types of SRM structures
• Margin-based loss can be motivated by
Popper’s falsifiability
24

Margin-based loss: summary
• Classification:
L ( y, f (x, ))  max  yf (x, ),0
falsifiability controlled by margin

• Regression: L ( y, f (x, ))  max| y  f (x, ) |  ,0
falsifiability controlled by
• Single class learning:

Lr ( f (x, ))  max| x  a) | r,0
falsifiability controlled by radius r
NOTE: the same interpretation/ motivation for margin-based
loss for different types of learning problems.
25
OUTLINE
•
•
•
Objectives
Motivation for margin-based loss
Linear SVM Classifiers
- Primal formulation (linearly separable case)
- Dual optimization formulation
- Soft-margin SVM formulation
•
•
•
•
Nonlinear SVM Classifiers
Practical Issues and Examples
SVM for Regression
Summary and Discussion
26
Optimal Separating Hyperplane
Distance btwn hyperplane and sample f (x' ) / w
 Margin   1/ w
Shaded points are SVs
27
Optimization Formulation
•
•
Given training data x i ,yi  i  1,...,n
Find parameters w, b of linear hyperplane
•
Quadratic optimization with linear constraints
tractable for moderate dimensions d
For large dimensions use dual formulation:
- scales better with n (rather than d)
- uses only dot products
•
f x  w  x  b
2
that minimize
 w   0.5  w
yi w  x i   b  1
under constraints
28
From Optimization Theory:
•
•
For a given convex minimization problem
with convex inequality constraints there
exists an equivalent dual unconstrained
maximization formulation with nonnegative
Lagrange multipliers  i  0
Karush-Kuhn-Tucker (KKT) conditions:
*
Lagrange coefficients  i  0 only for samples that
satisfy the original constraint
yi w  x i   b  1
with equality
~ SV’s have positive Lagrange coefficients
29
Convex Hull Interpretation of Dual
Find convex hulls for each class. The closest points
to an optimal hyperplane are support vectors
30
Dual Optimization Formulation
•
•
Given training data x i ,yi  i  1,...,n
Find parameters  i* ,b* of an opt. hyperplane
n
Dx     i* yi x  x i   b *
i 1
as a solution to maximization problem
1 n
L    i   i j yi y j x i  x j   max
2 i , j 1
i 1
n
under constraints
n
y
i 1
•
•
i
i
 0, i  0
Note: data samples with nonzero  i* are SV’s
Formulation requires only inner products x  x'
31
Support Vectors
• SV’s ~ training samples with non-zero loss
• SV’s are samples that falsify the model
• The model depends only on SVs
 SV’s ~ robust characterization of the data
WSJ Feb 27, 2004:
About 40% of us (Americans) will vote for a Democrat, even if the
candidate is Genghis Khan. About 40% will vote for a Republican, even
if the candidate is Attila the Han. This means that the election is left in
the hands of one-fifth of the voters.
32
Support Vectors
•
SVM test error bound:
E Test _ error  
•
•
E  #_ support _ vectors 
n
 small # SV’s ~ good generalization
Can be explained using LOO cross-validation
SVM generalization can be related to data
compression
33
Soft-Margin SVM formulation
x 1 = 1 - f (x1 )
f (x)  w  x  b
x1
x 2 = 1 - f (x 2 )
x2
f (x) = +1
x3
x 3 = 1+ f (x 3 )
f (x) = 0
f (x) = - 1
n
Minimize:
under constraints
1 2
C  i  w  min
2
i 1
yi w  xi   b  1   i
34
SVM Dual Formulation
•
•
Given training data x i ,yi  i  1,...,n
Find parameters  i* ,b* of an opt. hyperplane
n
Dx     i* yi x  x i   b *
i 1
as a solution to maximization problem
1 n
L    i   i j yi y j x i  x j   max
2 i , j 1
i 1
n
n
under constraints
y
i 1
•
•
i
i
 0,
0  i  C
Note: data samples with nonzero  i* are SVs
Formulation requires only inner products x  x'
35
OUTLINE
•
•
•
•
•
•
•
Objectives
Motivation for margin-based loss
Linear SVM Classifiers
Nonlinear SVM Classifiers
Practical Issues and Examples
SVM for Regression
Summary and Discussion
36
Nonlinear Decision Boundary
• Fixed (linear) parameterization is too rigid
• Nonlinear curved margin may yield larger margin
(falsifiability) and lower error
37
Nonlinear Mapping via Kernels
Nonlinear f(x,w) + margin-based loss = SVM
• Nonlinear mapping to feature z space
• Linear in z-space ~ nonlinear in x-space
m
• But  zi  z j    gk (xi ) gk (x j )    xi , x j  ~ symmetric fct
k 1
 Compute dot product via kernel analytically
x
gx






z
wz
yˆ
38
Example of Kernel Function
• 2D input space x  x1 ,x 2 
• Mapping to z space (2-d order polynomial)
z1  1
z2  2x1 z3  2x2 z4  2 x1 x2
z5  x12 z6  x22
• Can show by direct substitution that for two
input vectors
u  u1 ,u2 
v  v1 ,v2 
Their dot product is calculated analytically
 G(u)  G( v)     u, v     u  v   1
2
39
SVM Formulation (with kernels)
•
•
Replacing z  z  K x, x
leads to:
Find parameters  i* ,b* of an optimal
Dx    y K x , x   b
hyperplane
n
i 1
*
i i
*
i
as a solution
ton maximization problem
n
1
L    i   i j yi y j K xi , x j   max
2 i , j 1 n
i 1
under constraints  y i i  0, 0   i  C / n
•
i 1
Given: the training data
xi ,yi 
i  1,...,n
an inner product kernel K x, x
regularization parameter C
40
Examples of Kernels
Kernel K x, x is a symmetric function satisfying general
(Mercer’s) conditions.
Examples of kernels for different mappings xz
• Polynomials of degree m
K x, x  x  x'  1
m
2

• RBF kernel
 x  x' 


K x, x   exp

2
(width parameter)





• Neural Networks
K x, x  tanhv(x  x' )  a
v, a
for given parameters
Automatic selection of the number of hidden units (SV’s)
41
More on Kernels
• The kernel matrix has all info (data + kernel)
K(1,1) K(1,2)…….K(1,n)
K(2,1) K(2,2)…….K(2,n)
………………………….
K(n,1) K(n,2)…….K(n,n)
• Kernel defines a distance in some feature
space (aka kernel-induced feature space)
• Kernel parameter controls nonlinearity
• Kernels can incorporate a priori knowledge
• Kernels can be defined over complex
structures (trees, sequences, sets, etc.)
42
New insights provided by SVM
• Why linear classifiers can generalize?
(1) Margin is large (relative to R)
(2) % of SV’s is small
(3) ratio d/n is small
• SVM offers an effective way to control
complexity (via margin + kernel selection)
i.e. implementing (1) or (2) or both
• Requires common-sense parameter tuning
43
OUTLINE
•
•
•
•
•
Objectives
Motivation for margin-based loss
Linear SVM Classifiers
Nonlinear SVM Classifiers
Practical Issues and Examples
- Model Selection
- Histogram of Projections
- SVM Extensions and Modifications
•
•
SVM for Regression
Summary and Discussion
44
SVM Model Selection
•
•
•
The quality of SVM classifiers depends on
proper tuning of model parameters:
- kernel type (poly, RBF, etc)
- kernel complexity parameter
- regularization parameter C
Note: VC-dimension depends on both C
and kernel parameters
These parameters are usually selected via
x-validation, by searching over wide range
of parameter values (on the log-scale)
45
SVM Example 1: Ripley’s data
• Ripley’s data set:
- 250 training samples
2

(
u
,
v
)

exp


u

v
- SVM using RBF kernel
- model selection via 10-fold cross-validation
• Cross-validation error table:

gamma
=2-3
=2-2
=2-1
=20
=21
=22
=23
C= 0.1
98.4%
51.6%
33.2%
28%
20.8%
19.2%
15.6%
C= 1
23.6%
22%
19.6%
18%
16.4%
14.4%
14%
C= 10
18.8%
20%
18.8%
16.4%
14%
13.6%
15.6%
C= 100
20.4%
20%
15.6%
14%
12.8%
15.6%
16.4%
C= 1000
18.4%
16%
13.6%
12.8%
16%
15.6%
18.4%

C= 10000
14.4%
14%
14.8%
15.6%
17.2%
16%
18.4%
 optimal C = 1,000, gamma = 1
Note: may be multiple optimal parameter values
46
Optimal SVM Model
• RBF SVM with optimal parameters C = 1,000, gamma = 1
• Test error is 9.8% (estimated using1,000 test samples)
1.2
1
0.8
x2
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
x1
0.5
1
47
SVM Example 2: Noisy Hyperbolas
• Noisy Hyperbolas data set:
- 100 training samples (50 per class)
- 100 validation samples (used for parameter tuning)
RBF SVM model:
Poly SVM model (5-th degree):
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
• Which model is ‘better’?
• Model interpretation?
48
SVM Example 3: handwritten digits
• MNIST handwritten digits (5 vs. 8) ~ high-dimensional data
- 1,000 training samples (500 per class)
- 1,000 validation samples (used for parameter tuning)
- 1,866 test samples
• Each sample is a real-valued vector of size 28*28=784:
28 pixels
28 pixels
• RBF SVM: optimal parameters C=1,
  28
49
How to visualize high-dim SVM model?
• Histogram of projections for linear SVM:
- project training data onto normal vector w (of SVM model)
- show univariate histogram of projected training samples
• On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins
• Similar histograms can be obtained for nonlinear SVM
50
Histogram of Projections for Digits Data
• Projections of training/test data onto normal
direction of RBF SVM decision boundary:
Training data
Test data
500
180
450
160
400
140
350
120
300
100
250
80
200
60
150
40
100
20
50
0
-1.5
-1
-0.5
0
0.5
1
1.5
0
-1.5
-1
-0.5
0
0.5
1
1.5
51
Practical Issues for SVM Classifiers
• Pre-processing
all inputs pre-scaled to the range [0,1] or [-1,+1]
• Model Selection (parameter tuning)
• SVM Extensions
- multi-class problems
- unbalanced data sets
- unequal misclassification costs
52
SVM for multi-class problems
• Digit recognition ~ ten-class problem:
- estimate 10 binary classifiers
(one digit vs the rest)
• For prediction:
a test input is applied to all 10 binary
SVM classifiers, and the class with the
largest SVM output value is selected
53
Unbalanced Settings and Unequal Costs
• Unbalanced settings:
- different number of positive and negative samples
- different prior probabilities for training /test data
• Different Misclassification Costs
- two types of errors, FP and FN
- Cost (false_positive) vs. Cost(false_negative)
- Loss function: C  Pfp  C  Pfn  min
- these ‘costs’ need to be specified a priori, based
on application requirements
54
SVM Modifications
• The Problem:
- How to modify standard SVM formulation?
• Unbalanced Data + Unequal Costs
1
2
C i  C i  w  min
2
i   class
i  class

 
where
C  Cost false neg  S


C   Cost false pos  S
• In practice, need to specify
C
C , 
C
55
Ripley’s data set (as before) where
- negative samples ~ ‘triangles’
- given misclassification costs C  / C   3:1
Note: boundary shifted away from positive samples
1.2
1
0.8
0.6
x2
•
Example: SVM with unequal costs
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
x1
0.5
1
56
SVM Applications
• Handwritten digit recognition
• Face detection in unrestricted
images
• Text/ document classification
• Image classification and retrieval
• …….
57
Handwritten Digit Recognition (mid-90’s)
• Data set:
postal images (zip-code), segmented, cropped;
~ 7K training samples, and 2K test samples
• Data encoding:
28x28 grey scale pixel image
• Original motivation: Compare SVM with custom
MLP network (LeNet) designed for this application
• Multi-class problem: one-vs-all approach
 10 SVM classifiers (one per each digit)
58
Digit Recognition Results
• Summary
- prediction accuracy better than custom NN’s
- accuracy does not depend on the kernel type
- 100 – 400 support vectors per class (digit)
• More details
Type of kernel No. of Support Vectors
Polynomial
274
RBF
291
Neural Network
254
Error%
4.0
4.1
4.2
• ~ 80-90% of SV’s coincide (for different kernels)
• Reduced-set SVM (Burges, 1996) ~ 15 per class
59
Document Classification (Joachims, 1998)
• The Problem: Classification of text documents in
large data bases, for text indexing and retrieval
• Traditional approach: human categorization (i.e. via
feature selection) – relies on a good indexing scheme.
This is time-consuming and costly
• Predictive Learning Approach (SVM): construct a
classifier using all possible features (words)
• Document/ Text Representation:
individual words = input features (possibly weighted)
• SVM performance:
– Very promising (~ 90% accuracy vs 80% by other classifiers)
– Most problems are linearly separable  use linear SVM
60
Image Data Mining (Chapelle et al, 1999)
• Example image data:
• Classification of images in
data bases, for image
indexing etc
• DATA SET
Corel photo images:
2670 samples divided into
7 classes: airplanes,
birds, fish, vehicles etc.
Training data: 1375 images; Test data: 1375 images (50%)
MAIN ISSUE: invariant representation/ data encoding
61
OUTLINE
•
•
•
•
•
•
Objectives
Motivation for margin-based loss
Linear SVM Classifiers
Nonlinear SVM Classifiers
Practical Issues and Examples
SVM for Regression
- SV Regression formulation
- Dual optimization formulation
- Model selection
- Example: Boston Housing
•
Summary and Discussion
62
General SVM Modeling Approach
1
For linear model
f (x, )  w  x  b
minimize SVM functional
1
2
RSVM (w, b, Z n )  w  C  Remp (w, b, Z n )
2
using SVM loss suitable for the learning problem at hand
2
Transform (1) to dual optimization formulation
(using only dot products)
3
Use kernels to obtain nonlinear version of (2).
Note: this approach is used for all learning problems.
However, tunable parameters of margin-based loss are
different for various types of learning problems
63
SVM Regression
•
For linear model
f (x, )  w  x  b
minimize SVM functional
1
2
RSVM (w, b, Z n )  w  C  Remp (w, b, Z n )
2
where empirical loss (for regression) is given by
L ( y , f( x , ))  max| y  f( x , ) |  ,0
•
Two distinct ways to control model complexity:
- by the value of C (with fixed epsilon)
- by the value of epsilon (with fixed large C)
•
SVM regression tunes both epsilon and C for
optimal performance
64
Linear SVM regression
For linear parameterization
f (x, )  w  x  b
SVM regression functional:
1
RSVM ( w , b, Z n ) 
w
2
where
2
 C  Remp ( , Z n )  min
1 n
Remp (, Z n )   L ( yi , f (x i ,  ))
n i 1
L ( y, f (x, ))  max| y  f (x, ) |  ,0
65
Direct Optimization Formulation
y
Given training data
xi ,yi 

1

i  1,...,n
2*
Minimize
n
x
1
(w  w)  C  (i  i* )
2
i 1
 yi  (w  x i )  b     i

*
Under constraints
(
w

x
)

b

y





i
i
i
  ,  *  0, i  1,...,n
i
i

66
Dual Formulation for SVM Regression
Given training data x i ,yi  i  1,...,n
And the values of
,C
Find coefficients
 i* , i* , i  1,...,n which maximize
n
n
1 n
L(  i ,  i )    ( i   i )   yi ( i   i )   (  i   i )( j   j )( x i  x j )
2 i , j 1
i 1
i 1
n
Under constraints

i 1
n
i
0  i  C
  i
i 1
0  i  C
Yields the following solution
f (x) 
 (
iSV
*
i
  )(xi  x)  b
*
i
*
67
 xc 2 

j 
f  x, w    w j exp 
2 
j 1
 0.20  


m
Example: RBF regression
1
0.8
y
0.6
0.4
0.2
0
-0.2
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
RBF estimate (dashed line) using   0.16, C  2000
SVM model uses only 5 SV’s (out of the 40 points)
68
Example: decomposition of RBF model
4
3
2
y
1
0
-1
-2
-3
-4
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
Weighted sum of 5 RBF kernel fcts gives the SVM model
69
SVM Model Selection: General
• Setting/ tuning of SVM hyper-parameters
- usually performed by experts
- more recently, by non-expert practitioners
• Issues for SVM model selection
(1) parameters controlling the ‘margin’ size
(2) kernel type and kernel complexity
• Strategies for model selection
- exhaustive search in the parameter space(via resampling)
- efficient search using VC analytic bounds
- rule-of-thumb analytic strategies (for a particular type of
learning problem)
70
Model Selection: continued
• Parameters controlling margin size
- for classification, parameter C
- for regression, the value of epsilon
- for single-class learning, the radius
• Complexity control ~ the fraction of SV’s( -SVM)
- for classification, replace C with   [0,1]
- for regression, specify the fraction of points
allowed to lie outside  -insensitive zone

• For very sparse data (d/n>>1) use linear SVM
71
Parameter Selection for SVM Regression
• Selection of parameter C
Recall the SVM solution f (x)   (   ) (x  x)  b
iSV
*
i
*
i
*
i
0  i*  C, i  1,...,n
where
and
0   C
 with bounded kernels (RBF) C  ymax  ymin
*
i
• Selection of 
in general,  ~ (noise level)
•
But this does not reflect dependency
on sample size
2
For linear regression:  y2 / x   suggesting   
n
n
The final prescription   3 ln n
n
72
Effect of SVM parameters on test error
•
Training data
univariate Sinc(x) function t ( x)  sin( x)
x
x  [ 10,10 ]
with additive Gaussian noise (sigma=0.2)
(a) small sample size 50
(b) large sample size 200
Prediction
Risk
Prediction
Risk
Prediction Risk
0.20.2
0.2
0.15
0.15
0.15
0.10.1
0.1
0.05
0.05
0.05
0 0
0.60.6
0
0.6
0.40.4
0.20.2
epsilon
epsilon
0 0
0 0
6 6
4 4
2 2
C/nC/n
8 8
1010
0.4
0.2
epsilon
0 0
2
4
6
8
10
C/n
73
SVM vs Regularization
• System imitation  SVM
• System identification  regularization
But their risk functionals ‘look similar’
n
1
RSVM (w, b)  C  L( yi , f (xi ,  ))  w
2
i 1
n
Rreg (w, b)   ( yi  f (xi ,  ))2  w
2
2
i 1
Recent claims:
SVM = special case of regularization
• These claims neglect the role of margin loss
74
Comparison for Classification
• Linear SVM vs Penalized LDA –
comparison is fair
• Data Sets: small (20 samples per class)
large (100 samples per class)
75
Comparison results: classification
• Small sample size:
Linear SVM yields 0.5% - 1.1% error rate
Penalized LDA yields 2.8% - 3% error
• Large sample size:
Linear SVM yields 0.4% - 1.1% error rate
Penalized LDA yields 1.1% - 2.2% error
• Conclusion: margin based complexity
control is better than regularization
76
Comparison for regression
• Linear SVM vs linear ridge regression
Note: Linear SVM has 2 parameters
• Sparse Data Set:
30 noisy samples, using target function
x [0,1]5
t (x)  2 x1  x2  0  x3  0  x4  0  x5
corrupted with gaussian noise with   0.2
• Complexity Control:
- for RR vary regularization parameter
- for SVM ~ epsilon and C parameters
77
Control for ridge regression
• Coefficient shrinkage for ridge regression
2.5
w1
2
Coefficients
1.5
w2
1
0.5
w3
w4
0
w5
-0.5
-8
-6
-4
-2
0
2
Log(Lambda)
4
6
8
78
Complexity control for SVM
• Coefficient shrinkage for SVM:
(a)Vary C (epsilon=0)
(b)Vary epsilon(C=large)
2.5
2.5
w1
2
2
1.5
1.5
Coefficients
Coefficients
w2
1
0.5
1
w2
0.5
w3
0
w5
w5
-0.5
-8
-6
w3
w4
w4
0
w1
-4
-2
0
Log(n/C)
2
4
6
8
-0.5
0
0.5
1
epsilon
1.5
2
79
Comparison: ridge regression vs SVM
• Sparse setting: n=10, noise   0.2
• Ridge regression:  chosen by cross-validation
• SV Regression:     0.2
C selected by cross-validation
• Ave Risk (100 realizations): 0.44 (RR) vs 0.37 (SVM)
10
0
Risk (MSE)
10
1
10
10
-1
-2
Ridge
SVM
80
OUTLINE
•
•
•
•
•
•
•
Objectives
Motivation for margin-based loss
Linear SVM Classifiers
Nonlinear SVM Classifiers
Practical Issues and Examples
SVM for Regression
Summary and Discussion
81
Summary
• Direct approach  different formulations
• Margin-based loss: robust, controls
complexity (falsifiability)
• SRM: new type of structure
• Nonlinear feature selection (~ SV’s):
incorporated into model estimation
• Appropriate applications
- high-dimensional data
- content-based /content-dependent
82