Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012
Download
Report
Transcript Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012
Non-linear SVMs &
libSVM
Advanced Statistical Methods in NLP
Ling 572
February 23, 2012
Roadmap
Non-linear SVMs:
Motivation: Non-linear data
The kernel trick
Linear Non-linear SVM models
LibSVM:
svm-train & svm-predict
Models
HW #7
2
Non-Linear SVMs
Problem:
Sometimes data really isn’t linearly separable
3
Non-Linear SVMs
Problem:
Sometimes data really isn’t linearly separable
Approach:
Map data non-linearly into higher dimensional space
Data is separable in the higher dimensional space
4
Non-Linear SVMs
Problem:
Sometimes data really isn’t linearly separable
Approach:
Map data non-linearly into higher dimensional space
Data is separable in the higher dimensional space
Figure from
Hearst et al ‘98
5
Feature Space
Basic approach:
Original data is not linearly separable
6
Feature Space
Basic approach:
Original data is not linearly separable
Map data into ‘feature space’
Higher dimensional dot product space
Mapping via non-linear map:Φ
7
Feature Space
Basic approach:
Original data is not linearly separable
Map data into ‘feature space’
Higher dimensional dot product space
Mapping via non-linear map:Φ
Compute separating hyperplane
In higher dimensional space
8
Issues with Feature Space
Mapping idea is simple,
But has some practical problems
9
Issues with Feature Space
Mapping idea is simple,
But has some practical problems
Feature space can be very high – infinite? – dimensional
10
Issues with Feature Space
Mapping idea is simple,
But has some practical problems
Feature space can be very high – infinite? – dimensional
Approach depends on computing similarity (dot product)
Computationally expensive
11
Issues with Feature Space
Mapping idea is simple,
But has some practical problems
Feature space can be very high – infinite? – dimensional
Approach depends on computing similarity (dot product)
Computationally expensive
Approach depends on mapping:
Also possibly intractable to compute
12
Solution
“Kernel trick”:
Use a kernel function K: X x X R
K(xi , x j ) =< f (xi ), f (x j ) >
13
Solution
“Kernel trick”:
Use a kernel function K: X x X R
K(xi , x j ) =< f (xi ), f (x j ) >
Computes similarity measure on images of data points
14
Solution
“Kernel trick”:
Use a kernel function K: X x X R
K(xi , x j ) =< f (xi ), f (x j ) >
Computes similarity measure on images of data points
Can often compute similarity efficiently even on high (or infinite)
dimensional space
15
Solution
“Kernel trick”:
Use a kernel function K: X x X R
K(xi , x j ) =< f (xi ), f (x j ) >
Computes similarity measure on images of data points
Can often compute similarity efficiently even on high (or infinite)
dimensional space
Choice of K equivalent to selection of Φ
16
Example (Russell & Norvig)
17
Example (cont’d)
Original 2-D data : x=(x1,x2)
18
Example (cont’d)
Original 2-D data : x=(x1,x2)
Mapping to new values in 3-D feature space F(x):
f1=x12; f2=x22; f3=
2x1 x2
19
Example (cont’d)
Original 2-D data : x=(x1,x2)
Mapping to new values in 3-D feature space F(x):
f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2);z = (-2,3)
20
Example (cont’d)
Original 2-D data : x=(x1,x2)
Mapping to new values in 3-D feature space F(x):
f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2,3)
f (x) =
21
Example (cont’d)
Original 2-D data : x=(x1,x2)
Mapping to new values in 3-D feature space F(x):
f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2, 3)
f (x) = (1, 4, 2 2 ); f (z) =
22
Example (cont’d)
Original 2-D data : x=(x1,x2)
Mapping to new values in 3-D feature space F(x):
f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2, 3)
f (x) = (1, 4, 2 2 ); f (z ) = (4, 9, -6 2 )
K(x, z ) =< f (x), f (z) >
=
23
Example (cont’d)
Original 2-D data : x=(x1,x2)
Mapping to new values in 3-D feature space F(x):
f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2, 3)
f (x) = (1, 4, 2 2 ); f (z ) = (4, 9, -6 2 )
K(x, z ) =< f (x), f (z ) >
= 1* 4 + 4 * 9 + 2 2 * -6 2
= 16
24
Example (cont’d)
More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) =
25
Example (cont’d)
More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x12, x22 , 2x1 x2 ); f (z) =
26
Example (cont’d)
More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x12 , x22 , 2x1 x2 ); f (z) = (z12 , z22 , 2z1z2 )
< f (x), f (z ) >=
27
Example (cont’d)
More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x12 , x22 , 2x1 x2 ); f (z) = (z12 , z22 , 2z1z2 )
< f (x), f (z ) >= x z + x z + 2x1 x2 z1z2
=
2 2
1 1
2 2
2 2
28
Example (cont’d)
More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x , x , 2x1 x2 ); f (z ) = (z , z , 2z1z2 )
2
1
2
2
2
1
2
2
< f (x), f (z ) >= x12 z12 + x22 z22 + 2x1 x2 z1z2
= (x1z1 + x2 z2 )
=
2
29
Example (cont’d)
More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x , x , 2x1 x2 ); f (z ) = (z , z , 2z1z2 )
2
1
2
2
2
1
2
2
< f (x), f (z ) >= x12 z12 + x22 z22 + 2x1 x2 z1z2
= (x1z1 + x2 z2 )
2
=< x, z >2
30
Kernel Trick: Summary
Avoids explicit mapping to high-dimensional space
Avoids explicit computation of inner product in feature space
Avoids explicit computation of mapping function
Or even feature vector
Replace all inner products in SVM train/test with K
31
Non-Linear SVM Training
Linear version:
Maximize
subject to
1
åi ai - 2 åi, j aia j yi y j < xi, x j >
ai ³ 0; å ai yi = 0
i
32
Non-Linear SVM Training
Linear version:
Maximize
subject to
1
åi ai - 2 åi, j aia j yi y j < xi, x j >
ai ³ 0; å ai yi = 0
i
Non-linear version:
Maximize
1
åi ai - 2 åi, j aia j yi y j K(xi, x j )
33
Decoding
Linear SVM:
f (x) = å ai yi < xi , x > +b
i
34
Decoding
Linear SVM:
f (x) = å ai yi < xi , x > +b
i
Non-linear SVM:
f (x) = å ai yi K(xi , x) + b
i
35
Common Kernel Functions
Implemented in most packages
K(x, z) =< x, z >
d
Polynomial: K(x, z) = (g < x, z > +c)
-g ( x-z )2
Radial Basis Function (RBF):
K(x, z) = e
Sigmoid K(x, z) = tanh(g < x, z > +c)
x
-x
e
e
:
tanh(x) = x - x
e +e
Linear:
36
x = (x1 ,...., xn )
z = (z1,...., z n )
x-z
= (x1 - z1,...., xn - z n )
x-z
= (x1 - z1 ) +... + (xn - zn )
2
2
37
Kernels
Many conceivable kernels:
Function is a kernel if obeys Mercer’s theorem:
Is symmetric, continuous, and matrix is positive definite
38
Kernels
Many conceivable kernels:
Function is a kernel if obeys Mercer’s theorem:
Is symmetric, continuous, and matrix is positive definite
Selection of kernel can have huge impact
Dramatic differences in accuracy
39
Kernels
Many conceivable kernels:
Function is a kernel if obeys Mercer’s theorem:
Is symmetric, continuous, and matrix is positive definite
Selection of kernel can have huge impact
Dramatic differences in accuracy
Knowledge about ‘shape’ of data can help select
Ironically, linear SVMs perform well on many tasks
40
Summary
Find decision hyperplane that maximizes margain
41
Summary
Find decision hyperplane that maximizes margain
Employ soft-margin to support noisy data
42
Summary
Find decision hyperplane that maximizes margain
Employ soft-margin to support noisy data
For non-linearly separable data, use non-linear SVMs
Project to higher dimensional space to separate
Use kernel trick to avoid intractable computation of
Projection or inner products
43
MaxEnt vs SVM
MaxEnt
SVM
Modeling
Maximize P(Y|X,λ)
Maximize margin
Training
Learn λi for each feature
function
Learn αi for each training
instance
Decoding
Calculate P(y|x)
Calculate sign f(x)
Things to decode
Features
Regularization
Training algorithm
Kernel
Regularization
Training algorithm
Binarization
Due to F. Xia
44
LibSVM
Well-known SVM package
Chang & Lin, most recent version 2011
Supports:
Many different SVM variants
Range of kernels
Efficient multi-class implementation
Range of language interfaces, as well as CLI
Frameworks for tuning – cross-validation, grid search
Why libSVM? SVM not in Mallet
45
libSVM Information
Main website:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Documentation & examples
FAQ
46
libSVM Information
Main website:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Documentation & examples
FAQ
Installation on patas:
/NLP_TOOLS/ml_tools/svm/libsvm/latest/
Calling prints usage
47
libSVM
Main components:
svm-train:
Trains model on provided data
svm-predict:
Applies trained model to test data
48
svm-train
Usage:
svm-train [options] training_data model_file
Options:
-t: kernel type
[0-3]
-d: degree – used in polynomial
-g: gamma – used in polynomial, RBF, sigmoid
-r: coef0 – used in polynomial, sigmoid
lots of others
49
LibSVM Kernels
Predefined libSVM kernels
Set with –t switch: default 2
0 -- linear: u'*v
1 -- polynomial: (gamma*u'*v + coef0)^degree
2 -- radial basis function: exp(-gamma*|u-v|^2)
3 -- sigmoid: tanh(gamma*u'*v + coef0)
From libSVM docs
50
svm-predict
Usage:
svm-predict testing_file model_file output_file
Prints results to output_file
Accuracy printed to stderr
HW #8: Implement SVM decoder using SVM model
51
Training File Format
Training file:
Sequence of lines:
Each line represents a training instance
52
Training File Format
Training file:
Sequence of lines:
Each line represents a training instance
Mallet format:
instanceID classLabel f1 v1 …
LibSVM format:
classNumber featidx1:v1 featidx2:v2…..
53
Training File Format
Training file:
Sequence of lines:
Each line represents a training instance
Mallet format:
instanceID classLabel f1 v1 …
LibSVM format:
classNumber featidx1:v1 featidx2:v2…..
featidxes numeric, values numeric
Sparse: omit idx:value pair if value = 0
Increasing order by idx
54
Model File Format Example
svm_type c_svc
kernel_type linear
nr_class 2
total_sv 535
Parameters
rho 0.281122
label 0 1
nr_sv 272 263
SV
0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
55
Model File Format Example
svm_type c_svc
kernel_type linear
nr_class 2
total_sv 535
rho 0.281122
Parameters
corresponds to -b
label 0 1
nr_sv 272 263
SV
0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
56
Model File Format Example
svm_type c_svc
kernel_type linear
nr_class 2
total_sv 535
rho 0.281122
Parameters
corresponds to -b
label 0 1
nr_sv 272 263
SV
0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
Weight:
ai yi
57
Model File Format Example
svm_type c_svc
kernel_type linear
nr_class 2
total_sv 535
rho 0.281122
Parameters
corresponds to -b
label 0 1
nr_sv 272 263
SV
0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
Weight:
ai yi
Corresponding support vector
58
Applying the Model File
Classification:
f(x) = sign (<w.x>+b)
In papers:
f (x) = å ai yi K(xi , x) + b
i
where yi is class of x, c0=+1 and c1= -1
59
Applying the Model File
Classification:
f(x) = sign (<w.x>+b)
In papers:
f (x) = å ai yi K(xi , x) + b
i
where yi is class of x, c0=+1 and c1= -1
In libSVM:
f (x) = å weighti K(xi , x) - r
i
60
Applying the Model File
Classification:
f(x) = sign (<w.x>+b)
In papers:
f (x) = å ai yi K(xi , x) + b
i
where yi is class of x, c0=+1 and c1= -1
In libSVM:
f (x) = å weighti K(xi , x) - r
i
if f(x) >0,
label with class c0
else c1
61
Output File Format
Output format:
Sequence of lines
One label per line – corresponding to input
Binary classification: classes 0/1 (not -1,1)
0 c0
0
1 c1
1
0
62
Notation Mapping
Papers
libSVM
Model
αi yi xi b
weighti xi ρ
Decoding
f (x) = å ai yi K(xi , x) + b
f (x) = å weighti K(xi , x) - r
+1
-1
0
1
i
Labels
i
Due to F. Xia
63
HW #8
Run libSVM on binary text classification task
Binary in both senses:
2 classes
Values are 0/1
Build SVM decoder using model in step 1
Implement different kernel functions
64