Transcript Part 2

Part 2:
Support Vector Machines
Vladimir Cherkassky
University of Minnesota
[email protected]
Presented at Tech Tune Ups, ECE Dept, June 1, 2011
Electrical and Computer Engineering
1
SVM: Brief History
1963 Margin (Vapnik & Lerner)
1964 Margin (Vapnik and Chervonenkis, 1964)
1964 RBF Kernels (Aizerman)
1965 Optimization formulation (Mangasarian)
1971 Kernels (Kimeldorf annd Wahba)
1992-1994 SVMs (Vapnik et al)
1996 – present Rapid growth, numerous apps
1996 – present Extensions to other problems
2
MOTIVATION for SVM
•
Problems with ‘conventional’ methods:
- model complexity ~ dimensionality (# features)
- nonlinear methods  multiple minima
- hard to control complexity
•
SVM solution approach
- adaptive loss function (to control complexity
independent of dimensionality)
- flexible nonlinear models
- tractable optimization formulation
3
SVM APPROACH
•
•
Linear approximation in Z-space using
special adaptive loss function
Complexity independent of dimensionality
x
gx






z
wz
yˆ
4
OUTLINE
•
•
•
•
•
Margin-based loss
SVM for classification
SVM examples
Support vector regression
Summary
5
Example: binary classification
• Given: Linearly separable data
How to construct linear decision boundary?
6
Linear Discriminant Analysis
LDA solution
Separation margin
7
Perceptron (linear NN)
• Perceptron solutions and separation margin
8
Largest-margin solution
• All solutions explain the data well (zero error)
All solutions ~ the same linear parameterization
Larger margin ~ more confidence (falsifiability)
M  2
9
Complexity of  -margin hyperplanes
• If data samples belong to a sphere of radius R,
then the set of -margin hyperplanes has VC
dimension bounded by
h  min(R /  , d )  1
2
2
• For large margin hyperplanes, VC-dimension
controlled independent of dimensionality d.
10
Motivation: philosophical
•
Classical view: good model
explains the data + low complexity
 Occam’s razor (complexity ~ # parameters)
• VC theory: good model
explains the data + low VC-dimension
~ VC-falsifiability: good model:
explains the data + has large falsifiability
The idea: falsifiability ~ empirical loss function
11
Adaptive loss functions
• Both goals (explanation + falsifiability) can
encoded into empirical loss function where
- (large) portion of the data has zero loss
- the rest of the data has non-zero loss,
i.e. it falsifies the model
• The trade-off (between the two goals) is
adaptively controlled  adaptive loss fct
• Examples of such loss functions for
different learning problems are shown next
12
Margin-based loss for classification
Margin  2
L ( y, f (x, ))  max  yf (x, ),0
13
Margin-based loss for classification:
margin is adapted to training data
Class +1
Margin
Class -1
y  f (x,  )
L ( y, f (x, ))  max  yf (x, ),0
Epsilon loss for regression
L ( y, f (x, ))  max| y  f (x, ) |  ,0
15
Parameter epsilon is adapted to training data
Example: linear regression y = x + noise
where noise = N(0, 0.36), x ~ [0,1], 4 samples
Compare: squared, linear and SVM loss (eps = 0.6)
2
1.5
y
1
0.5
0
-0.5
-1
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
OUTLINE
• Margin-based loss
• SVM for classification
- Linear SVM classifier
- Inner product kernels
- Nonlinear SVM classifier
• SVM examples
• Support Vector regression
• Summary
17
SVM Loss for Classification
Continuous quantity yf (x, w) measures how
close a sample x is to decision boundary
18
Optimal Separating Hyperplane
Distance btwn hyperplane and sample f (x' ) / w
 Margin   1/ w
Shaded points are SVs
19
Linear SVM Optimization Formulation
(for separable data)
•
•
Given training data x i ,yi  i  1,...,n
Find parameters w, b of linear hyperplane
•
Quadratic optimization with linear constraints
tractable for moderate dimensions d
For large dimensions use dual formulation:
- scales with sample size(n) rather than d
- uses only dot products xi , x j 
•
f x  w  x  b
2
that minimize
 w   0.5  w
yi w  x i   b  1
under constraints
20
Classification for non-separable data
slack _ variables
    yf (x,  )
L ( y, f (x, ))  max  yf (x, ),0
21
SVM for non-separable data
x 1 = 1 - f (x1 )
f (x)  w  x  b
x1
x 2 = 1 - f (x 2 )
x2
f ( x) = +1
x3
x 3 = 1 + f (x 3 )
f ( x) = 0
f ( x) = - 1
n
Minimize
1 2
C  i  w  min
2
i 1
under constraints yi w  xi   b  1  i
22
SVM Dual Formulation
•
•
Given training data x i ,yi  i  1,...,n
*
*
Find parameters  i ,b of an opt. hyperplane
as a solution to maximization problem
1 n
L    i   i j yi y j x i  x j   max
2 i , j 1
i 1
n
n
under constraints
y
i 1
n
•
•
i
i
 0,
0  i  C
f x    i* yi x  x i   b*
Solution
i 1
where samples with nonzero  i* are SVs
Needs only inner products x  x'
23
Nonlinear Decision Boundary
• Fixed (linear) parameterization is too rigid
• Nonlinear curved margin may yield larger margin
(falsifiability) and lower error
24
Nonlinear Mapping via Kernels
Nonlinear f(x,w) + margin-based loss = SVM
• Nonlinear mapping to feature z space, i.e.
x ~ ( x1, x2 )  z ~ (1, x1, x2 , x1 x2 , x , x )
2
1
2
2
• Linear in z-space ~ nonlinear in x-space
• BUT z z  Hx, x ~ kernel trick
 Compute dot product via kernel analytically
x
gx






z
wz
yˆ
25
SVM Formulation (with kernels)
•
•
Replacing z  z   H x, x leads to:
*
*

,b
Find parameters i of an optimal
hyperplane Dx   y H x , x  b
n
i 1
*
i
*
i
i
as a solution to maximization problem
1 n
L    i   i j yi y j H x i , x j   max
2 i , j 1
i 1
n
n
under constraints
y
i 1
i
Given: the training data
i
 0,
xi ,yi 
0  i  C
i  1,...,n
an inner product kernel H x, x
regularization parameter C
26
Examples of Kernels
Kernel H x, x is a symmetric function satisfying general
math conditions (Mercer’s conditions)
Examples of kernels for different mappings xz
• Polynomials of degree q
q

H x, x   x  x'  1
• RBF kernel
• Neural Networks
for given parameters
2


x

x
'


H x, x  exp

2





H x, x  tanhv(x  x' )  a
v, a
Automatic selection of the number of hidden units (SV’s)
27
More on Kernels
• The kernel matrix has all info (data + kernel)
H(1,1) H(1,2)…….H(1,n)
H(2,1) H(2,2)…….H(2,n)
………………………….
H(n,1) H(n,2)…….H(n,n)
• Kernel defines a distance in some feature
space (aka kernel-induced feature space)
• Kernels can incorporate apriori knowledge
• Kernels can be defined over complex
structures (trees, sequences, sets etc)
28
Support Vectors
• SV’s ~ training samples with non-zero loss
• SV’s are samples that falsify the model
• The model depends only on SVs
 SV’s ~ robust characterization of the data
WSJ Feb 27, 2004:
About 40% of us (Americans) will vote for a Democrat, even if the
candidate is Genghis Khan. About 40% will vote for a Republican,
even if the candidate is Attila the Han. This means that the election
is left in the hands of one-fifth of the voters.
• SVM Generalization ~ data compression
29
New insights provided by SVM
• Why linear classifiers can generalize?
h  min(R /  , d )  1
2
2
(1) Margin is large (relative to R)
(2) % of SV’s is small
(3) ratio d/n is small
• SVM offers an effective way to control
complexity (via margin + kernel selection)
i.e. implementing (1) or (2) or both
• Requires common-sense parameter tuning
30
OUTLINE
•
•
•
•
•
Margin-based loss
SVM for classification
SVM examples
Support Vector regression
Summary
31
Ripley’s data set
• 250 training samples, 1,000 test samples
2
• SVM using RBF kernel  (u, v)  exp  u  v
• Model selection via 10-fold cross-validation


32
Ripley’s data set: SVM model
• Decision boundary and margin borders
• SV’s are circled
1.2
1
0.8
x2
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
x1
0.5
1
33
Ripley’s data set: model selection
• SVM tuning parameters C,
• Select opt parameter values via 10-fold x-validation
• Results of cross-validation are summarized below:
C= 0.1
C= 1
C= 10
C= 100
C= 1000 C= 10000
=2-3
98.4%
23.6%
18.8%
20.4%
18.4%
14.4%
=2-2
51.6%
22%
20%
20%
16%
14%
=2-1
33.2%
19.6%
18.8%
15.6%
13.6%
14.8%
=20
28%
18%
16.4%
14%
12.8%
15.6%
=21
20.8%
16.4%
14%
12.8%
16%
17.2%
=22
19.2%
14.4%
13.6%
15.6%
15.6%
16%
=23
15.6%
14%
15.6%
16.4%
18.4%
18.4%
34
Noisy Hyperbolas data set
• This example shows application of different kernels
• Note: decision boundaries are quite different
RBF kernel
Polynomial
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
35
Many challenging applications
•
Mimic human recognition capabilities
- high-dimensional data
- content-based
- context-dependent
•
Example: read the sentence
Sceitnitss osbevred: it is nt inptrant how
lteters are msspled isnide the word. It is
ipmoratnt that the fisrt and lsat letetrs do not
chngae, tehn the txet is itneprted corrcetly
•
SVM is suitable for sparse high-dimensional
data
36
Example SVM Applications
• Handwritten digit recognition
• Genomics
• Face detection in unrestricted
images
• Text/ document classification
• Image classification and retrieval
• …….
37
Handwritten Digit Recognition (mid-90’s)
• Data set:
postal images (zip-code), segmented, cropped;
~ 7K training samples, and 2K test samples
• Data encoding:
16x16 pixel image  256-dim. vector
• Original motivation: Compare SVM with custom
MLP network (LeNet) designed for this application
• Multi-class problem: one-vs-all approach
 10 SVM classifiers (one per each digit)
38
Digit Recognition Results
• Summary
- prediction accuracy better than custom NN’s
- accuracy does not depend on the kernel type
- 100 – 400 support vectors per class (digit)
• More details
Type of kernel No. of Support Vectors
Polynomial
274
RBF
291
Neural Network
254
Error%
4.0
4.1
4.2
• ~ 80-90% of SV’s coincide (for different kernels)
39
Document Classification (Joachims, 1998)
• The Problem: Classification of text documents in
large data bases, for text indexing and retrieval
• Traditional approach: human categorization (i.e. via
feature selection) – relies on a good indexing scheme.
This is time-consuming and costly
• Predictive Learning Approach (SVM): construct a
classifier using all possible features (words)
• Document/ Text Representation:
individual words = input features (possibly weighted)
• SVM performance:
– Very promising (~ 90% accuracy vs 80% by other classifiers)
– Most problems are linearly separable  use linear SVM
40
OUTLINE
•
•
•
•
•
Margin-based loss
SVM for classification
SVM examples
Support vector regression
Summary
41
Linear SVM regression
Assume linear parameterization
y
f (x, )  w  x  b

1

2*
x
L ( y, f (x, ))  max| y  f (x, ) |  ,0
42
Direct Optimization Formulation
y
Given training data
xi ,yi 

1

i  1,...,n
2*
Minimize
n
x
1
(w  w)  C  (i  i* )
2
i 1
 yi  (w  x i )  b     i

*
Under constraints
(
w

x
)

b

y





i
i
i
  ,  *  0, i  1,...,n
i
i

43
Example:
SVM regression using RBF kernel
1
0.8
y
0.6
0.4
0.2
0
-0.2
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
SVM estimate is shown in dashed line
SVM model uses only 5 SV’s (out of the 40 points)
44
 xc 2 

j 
f  x, w    w j exp 
2 
j 1
  0.20  
m
RBF regression model
4
3
2
y
1
0
-1
-2
-3
-4
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
Weighted sum of 5 RBF kernels gives the SVM model
45
Summary
• Margin-based loss: robust +
performs complexity control
• Nonlinear feature selection (~
SV’s): performed automatically
• Tractable model selection – easier
than most nonlinear methods.
• SVM is not a magic bullet solution
- similar to other methods when n >> h
- SVM is better when n << h or n ~ h
46