Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012

Download Report

Transcript Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012

Non-linear SVMs &
libSVM
Advanced Statistical Methods in NLP
Ling 572
February 23, 2012
Roadmap
 Non-linear SVMs:
 Motivation: Non-linear data
 The kernel trick
 Linear  Non-linear SVM models
 LibSVM:
 svm-train & svm-predict
 Models
 HW #7
2
Non-Linear SVMs
 Problem:
 Sometimes data really isn’t linearly separable
3
Non-Linear SVMs
 Problem:
 Sometimes data really isn’t linearly separable
 Approach:
 Map data non-linearly into higher dimensional space
 Data is separable in the higher dimensional space
4
Non-Linear SVMs
 Problem:
 Sometimes data really isn’t linearly separable
 Approach:
 Map data non-linearly into higher dimensional space
 Data is separable in the higher dimensional space
Figure from
Hearst et al ‘98
5
Feature Space
 Basic approach:
 Original data is not linearly separable
6
Feature Space
 Basic approach:
 Original data is not linearly separable
 Map data into ‘feature space’
 Higher dimensional dot product space
 Mapping via non-linear map:Φ
7
Feature Space
 Basic approach:
 Original data is not linearly separable
 Map data into ‘feature space’
 Higher dimensional dot product space
 Mapping via non-linear map:Φ
 Compute separating hyperplane
 In higher dimensional space
8
Issues with Feature Space
 Mapping idea is simple,
 But has some practical problems
9
Issues with Feature Space
 Mapping idea is simple,
 But has some practical problems
 Feature space can be very high – infinite? – dimensional
10
Issues with Feature Space
 Mapping idea is simple,
 But has some practical problems
 Feature space can be very high – infinite? – dimensional
 Approach depends on computing similarity (dot product)
 Computationally expensive
11
Issues with Feature Space
 Mapping idea is simple,
 But has some practical problems
 Feature space can be very high – infinite? – dimensional
 Approach depends on computing similarity (dot product)
 Computationally expensive
 Approach depends on mapping:
 Also possibly intractable to compute
12
Solution
 “Kernel trick”:
 Use a kernel function K: X x X  R
K(xi , x j ) =< f (xi ), f (x j ) >
13
Solution
 “Kernel trick”:
 Use a kernel function K: X x X  R
K(xi , x j ) =< f (xi ), f (x j ) >
 Computes similarity measure on images of data points
14
Solution
 “Kernel trick”:
 Use a kernel function K: X x X  R
K(xi , x j ) =< f (xi ), f (x j ) >
 Computes similarity measure on images of data points
 Can often compute similarity efficiently even on high (or infinite)
dimensional space
15
Solution
 “Kernel trick”:
 Use a kernel function K: X x X  R
K(xi , x j ) =< f (xi ), f (x j ) >
 Computes similarity measure on images of data points
 Can often compute similarity efficiently even on high (or infinite)
dimensional space
 Choice of K equivalent to selection of Φ
16
Example (Russell & Norvig)
17
Example (cont’d)
 Original 2-D data : x=(x1,x2)
18
Example (cont’d)
 Original 2-D data : x=(x1,x2)
 Mapping to new values in 3-D feature space F(x):
 f1=x12; f2=x22; f3=
2x1 x2
19
Example (cont’d)
 Original 2-D data : x=(x1,x2)
 Mapping to new values in 3-D feature space F(x):
 f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2);z = (-2,3)
20
Example (cont’d)
 Original 2-D data : x=(x1,x2)
 Mapping to new values in 3-D feature space F(x):
 f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2,3)
f (x) =
21
Example (cont’d)
 Original 2-D data : x=(x1,x2)
 Mapping to new values in 3-D feature space F(x):
 f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2, 3)
f (x) = (1, 4, 2 2 ); f (z) =
22
Example (cont’d)
 Original 2-D data : x=(x1,x2)
 Mapping to new values in 3-D feature space F(x):
 f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2, 3)
f (x) = (1, 4, 2 2 ); f (z ) = (4, 9, -6 2 )
K(x, z ) =< f (x), f (z) >
=
23
Example (cont’d)
 Original 2-D data : x=(x1,x2)
 Mapping to new values in 3-D feature space F(x):
 f1=x12; f2=x22; f3=
2x1 x2
x = (1, 2); z = (-2, 3)
f (x) = (1, 4, 2 2 ); f (z ) = (4, 9, -6 2 )
K(x, z ) =< f (x), f (z ) >
= 1* 4 + 4 * 9 + 2 2 * -6 2
= 16
24
Example (cont’d)
 More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) =
25
Example (cont’d)
 More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x12, x22 , 2x1 x2 ); f (z) =
26
Example (cont’d)
 More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x12 , x22 , 2x1 x2 ); f (z) = (z12 , z22 , 2z1z2 )
< f (x), f (z ) >=
27
Example (cont’d)
 More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x12 , x22 , 2x1 x2 ); f (z) = (z12 , z22 , 2z1z2 )
< f (x), f (z ) >= x z + x z + 2x1 x2 z1z2
=
2 2
1 1
2 2
2 2
28
Example (cont’d)
 More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x , x , 2x1 x2 ); f (z ) = (z , z , 2z1z2 )
2
1
2
2
2
1
2
2
< f (x), f (z ) >= x12 z12 + x22 z22 + 2x1 x2 z1z2
= (x1z1 + x2 z2 )
=
2
29
Example (cont’d)
 More generally
x = (x1, x2 ); z = (z1, z2 )
f (x) = (x , x , 2x1 x2 ); f (z ) = (z , z , 2z1z2 )
2
1
2
2
2
1
2
2
< f (x), f (z ) >= x12 z12 + x22 z22 + 2x1 x2 z1z2
= (x1z1 + x2 z2 )
2
=< x, z >2
30
Kernel Trick: Summary
 Avoids explicit mapping to high-dimensional space
 Avoids explicit computation of inner product in feature space
 Avoids explicit computation of mapping function
 Or even feature vector
 Replace all inner products in SVM train/test with K
31
Non-Linear SVM Training
 Linear version:
 Maximize
 subject to
1
åi ai - 2 åi, j aia j yi y j < xi, x j >
ai ³ 0; å ai yi = 0
i
32
Non-Linear SVM Training
 Linear version:
 Maximize
 subject to
1
åi ai - 2 åi, j aia j yi y j < xi, x j >
ai ³ 0; å ai yi = 0
i
 Non-linear version:
 Maximize
1
åi ai - 2 åi, j aia j yi y j K(xi, x j )
33
Decoding
 Linear SVM:
f (x) = å ai yi < xi , x > +b
i
34
Decoding
 Linear SVM:
f (x) = å ai yi < xi , x > +b
i
 Non-linear SVM:
f (x) = å ai yi K(xi , x) + b
i
35
Common Kernel Functions
 Implemented in most packages
K(x, z) =< x, z >
d
Polynomial: K(x, z) = (g < x, z > +c)
-g ( x-z )2
Radial Basis Function (RBF):
K(x, z) = e
Sigmoid K(x, z) = tanh(g < x, z > +c)
x
-x
e
e
:
tanh(x) = x - x
e +e
 Linear:




36
x = (x1 ,...., xn )
z = (z1,...., z n )
x-z
= (x1 - z1,...., xn - z n )
x-z
= (x1 - z1 ) +... + (xn - zn )
2
2
37
Kernels
 Many conceivable kernels:
 Function is a kernel if obeys Mercer’s theorem:
 Is symmetric, continuous, and matrix is positive definite
38
Kernels
 Many conceivable kernels:
 Function is a kernel if obeys Mercer’s theorem:
 Is symmetric, continuous, and matrix is positive definite
 Selection of kernel can have huge impact
 Dramatic differences in accuracy
39
Kernels
 Many conceivable kernels:
 Function is a kernel if obeys Mercer’s theorem:
 Is symmetric, continuous, and matrix is positive definite
 Selection of kernel can have huge impact
 Dramatic differences in accuracy
 Knowledge about ‘shape’ of data can help select
 Ironically, linear SVMs perform well on many tasks
40
Summary
 Find decision hyperplane that maximizes margain
41
Summary
 Find decision hyperplane that maximizes margain
 Employ soft-margin to support noisy data
42
Summary
 Find decision hyperplane that maximizes margain
 Employ soft-margin to support noisy data
 For non-linearly separable data, use non-linear SVMs
 Project to higher dimensional space to separate
 Use kernel trick to avoid intractable computation of
 Projection or inner products
43
MaxEnt vs SVM
MaxEnt
SVM
Modeling
Maximize P(Y|X,λ)
Maximize margin
Training
Learn λi for each feature
function
Learn αi for each training
instance
Decoding
Calculate P(y|x)
Calculate sign f(x)
Things to decode
Features
Regularization
Training algorithm
Kernel
Regularization
Training algorithm
Binarization
Due to F. Xia
44
LibSVM
 Well-known SVM package
 Chang & Lin, most recent version 2011
 Supports:





Many different SVM variants
Range of kernels
Efficient multi-class implementation
Range of language interfaces, as well as CLI
Frameworks for tuning – cross-validation, grid search
 Why libSVM? SVM not in Mallet
45
libSVM Information
 Main website:
 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
 Documentation & examples
 FAQ
46
libSVM Information
 Main website:
 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
 Documentation & examples
 FAQ
 Installation on patas:
 /NLP_TOOLS/ml_tools/svm/libsvm/latest/
 Calling prints usage
47
libSVM
 Main components:
 svm-train:
 Trains model on provided data
 svm-predict:
 Applies trained model to test data
48
svm-train
 Usage:
 svm-train [options] training_data model_file
 Options:
 -t: kernel type
 [0-3]
 -d: degree – used in polynomial
 -g: gamma – used in polynomial, RBF, sigmoid
 -r: coef0 – used in polynomial, sigmoid
 lots of others
49
LibSVM Kernels
 Predefined libSVM kernels
 Set with –t switch: default 2
 0 -- linear: u'*v
 1 -- polynomial: (gamma*u'*v + coef0)^degree
 2 -- radial basis function: exp(-gamma*|u-v|^2)
 3 -- sigmoid: tanh(gamma*u'*v + coef0)
From libSVM docs
50
svm-predict
 Usage:
 svm-predict testing_file model_file output_file
 Prints results to output_file
 Accuracy printed to stderr
 HW #8: Implement SVM decoder using SVM model
51
Training File Format
 Training file:
 Sequence of lines:
 Each line represents a training instance
52
Training File Format
 Training file:
 Sequence of lines:
 Each line represents a training instance
 Mallet format:
 instanceID classLabel f1 v1 …
 LibSVM format:
 classNumber featidx1:v1 featidx2:v2…..
53
Training File Format
 Training file:
 Sequence of lines:
 Each line represents a training instance
 Mallet format:
 instanceID classLabel f1 v1 …
 LibSVM format:
 classNumber featidx1:v1 featidx2:v2…..
 featidxes numeric, values numeric
 Sparse: omit idx:value pair if value = 0
 Increasing order by idx
54
Model File Format Example
 svm_type c_svc
 kernel_type linear
 nr_class 2
 total_sv 535
Parameters
 rho 0.281122
 label 0 1
 nr_sv 272 263
 SV
 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
55
Model File Format Example
 svm_type c_svc
 kernel_type linear
 nr_class 2
 total_sv 535
 rho 0.281122
Parameters
corresponds to -b
 label 0 1
 nr_sv 272 263
 SV
 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
56
Model File Format Example
 svm_type c_svc
 kernel_type linear
 nr_class 2
 total_sv 535
 rho 0.281122
Parameters
corresponds to -b
 label 0 1
 nr_sv 272 263
 SV
 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
Weight:
ai yi
57
Model File Format Example
 svm_type c_svc
 kernel_type linear
 nr_class 2
 total_sv 535
 rho 0.281122
Parameters
corresponds to -b
 label 0 1
 nr_sv 272 263
 SV
 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
Due to F. Xia
Weight:
ai yi
Corresponding support vector
58
Applying the Model File
 Classification:
 f(x) = sign (<w.x>+b)
 In papers:
f (x) = å ai yi K(xi , x) + b
i
 where yi is class of x, c0=+1 and c1= -1
59
Applying the Model File
 Classification:
 f(x) = sign (<w.x>+b)
 In papers:
f (x) = å ai yi K(xi , x) + b
i
 where yi is class of x, c0=+1 and c1= -1
 In libSVM:
f (x) = å weighti K(xi , x) - r
i
60
Applying the Model File
 Classification:
 f(x) = sign (<w.x>+b)
 In papers:
f (x) = å ai yi K(xi , x) + b
i
 where yi is class of x, c0=+1 and c1= -1
 In libSVM:
f (x) = å weighti K(xi , x) - r
i
 if f(x) >0,
 label with class c0
 else c1
61
Output File Format
 Output format:
 Sequence of lines
 One label per line – corresponding to input
 Binary classification: classes 0/1 (not -1,1)





0  c0
0
1  c1
1
0
62
Notation Mapping
Papers
libSVM
Model
αi yi xi b
weighti xi ρ
Decoding
f (x) = å ai yi K(xi , x) + b
f (x) = å weighti K(xi , x) - r
+1
-1
0
1
i
Labels
i
Due to F. Xia
63
HW #8
 Run libSVM on binary text classification task
 Binary in both senses:
 2 classes
 Values are 0/1
 Build SVM decoder using model in step 1
 Implement different kernel functions
64