Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012
Download ReportTranscript Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012
Non-linear SVMs & libSVM Advanced Statistical Methods in NLP Ling 572 February 23, 2012 Roadmap Non-linear SVMs: Motivation: Non-linear data The kernel trick Linear Non-linear SVM models LibSVM: svm-train & svm-predict Models HW #7 2 Non-Linear SVMs Problem: Sometimes data really isn’t linearly separable 3 Non-Linear SVMs Problem: Sometimes data really isn’t linearly separable Approach: Map data non-linearly into higher dimensional space Data is separable in the higher dimensional space 4 Non-Linear SVMs Problem: Sometimes data really isn’t linearly separable Approach: Map data non-linearly into higher dimensional space Data is separable in the higher dimensional space Figure from Hearst et al ‘98 5 Feature Space Basic approach: Original data is not linearly separable 6 Feature Space Basic approach: Original data is not linearly separable Map data into ‘feature space’ Higher dimensional dot product space Mapping via non-linear map:Φ 7 Feature Space Basic approach: Original data is not linearly separable Map data into ‘feature space’ Higher dimensional dot product space Mapping via non-linear map:Φ Compute separating hyperplane In higher dimensional space 8 Issues with Feature Space Mapping idea is simple, But has some practical problems 9 Issues with Feature Space Mapping idea is simple, But has some practical problems Feature space can be very high – infinite? – dimensional 10 Issues with Feature Space Mapping idea is simple, But has some practical problems Feature space can be very high – infinite? – dimensional Approach depends on computing similarity (dot product) Computationally expensive 11 Issues with Feature Space Mapping idea is simple, But has some practical problems Feature space can be very high – infinite? – dimensional Approach depends on computing similarity (dot product) Computationally expensive Approach depends on mapping: Also possibly intractable to compute 12 Solution “Kernel trick”: Use a kernel function K: X x X R K(xi , x j ) =< f (xi ), f (x j ) > 13 Solution “Kernel trick”: Use a kernel function K: X x X R K(xi , x j ) =< f (xi ), f (x j ) > Computes similarity measure on images of data points 14 Solution “Kernel trick”: Use a kernel function K: X x X R K(xi , x j ) =< f (xi ), f (x j ) > Computes similarity measure on images of data points Can often compute similarity efficiently even on high (or infinite) dimensional space 15 Solution “Kernel trick”: Use a kernel function K: X x X R K(xi , x j ) =< f (xi ), f (x j ) > Computes similarity measure on images of data points Can often compute similarity efficiently even on high (or infinite) dimensional space Choice of K equivalent to selection of Φ 16 Example (Russell & Norvig) 17 Example (cont’d) Original 2-D data : x=(x1,x2) 18 Example (cont’d) Original 2-D data : x=(x1,x2) Mapping to new values in 3-D feature space F(x): f1=x12; f2=x22; f3= 2x1 x2 19 Example (cont’d) Original 2-D data : x=(x1,x2) Mapping to new values in 3-D feature space F(x): f1=x12; f2=x22; f3= 2x1 x2 x = (1, 2);z = (-2,3) 20 Example (cont’d) Original 2-D data : x=(x1,x2) Mapping to new values in 3-D feature space F(x): f1=x12; f2=x22; f3= 2x1 x2 x = (1, 2); z = (-2,3) f (x) = 21 Example (cont’d) Original 2-D data : x=(x1,x2) Mapping to new values in 3-D feature space F(x): f1=x12; f2=x22; f3= 2x1 x2 x = (1, 2); z = (-2, 3) f (x) = (1, 4, 2 2 ); f (z) = 22 Example (cont’d) Original 2-D data : x=(x1,x2) Mapping to new values in 3-D feature space F(x): f1=x12; f2=x22; f3= 2x1 x2 x = (1, 2); z = (-2, 3) f (x) = (1, 4, 2 2 ); f (z ) = (4, 9, -6 2 ) K(x, z ) =< f (x), f (z) > = 23 Example (cont’d) Original 2-D data : x=(x1,x2) Mapping to new values in 3-D feature space F(x): f1=x12; f2=x22; f3= 2x1 x2 x = (1, 2); z = (-2, 3) f (x) = (1, 4, 2 2 ); f (z ) = (4, 9, -6 2 ) K(x, z ) =< f (x), f (z ) > = 1* 4 + 4 * 9 + 2 2 * -6 2 = 16 24 Example (cont’d) More generally x = (x1, x2 ); z = (z1, z2 ) f (x) = 25 Example (cont’d) More generally x = (x1, x2 ); z = (z1, z2 ) f (x) = (x12, x22 , 2x1 x2 ); f (z) = 26 Example (cont’d) More generally x = (x1, x2 ); z = (z1, z2 ) f (x) = (x12 , x22 , 2x1 x2 ); f (z) = (z12 , z22 , 2z1z2 ) < f (x), f (z ) >= 27 Example (cont’d) More generally x = (x1, x2 ); z = (z1, z2 ) f (x) = (x12 , x22 , 2x1 x2 ); f (z) = (z12 , z22 , 2z1z2 ) < f (x), f (z ) >= x z + x z + 2x1 x2 z1z2 = 2 2 1 1 2 2 2 2 28 Example (cont’d) More generally x = (x1, x2 ); z = (z1, z2 ) f (x) = (x , x , 2x1 x2 ); f (z ) = (z , z , 2z1z2 ) 2 1 2 2 2 1 2 2 < f (x), f (z ) >= x12 z12 + x22 z22 + 2x1 x2 z1z2 = (x1z1 + x2 z2 ) = 2 29 Example (cont’d) More generally x = (x1, x2 ); z = (z1, z2 ) f (x) = (x , x , 2x1 x2 ); f (z ) = (z , z , 2z1z2 ) 2 1 2 2 2 1 2 2 < f (x), f (z ) >= x12 z12 + x22 z22 + 2x1 x2 z1z2 = (x1z1 + x2 z2 ) 2 =< x, z >2 30 Kernel Trick: Summary Avoids explicit mapping to high-dimensional space Avoids explicit computation of inner product in feature space Avoids explicit computation of mapping function Or even feature vector Replace all inner products in SVM train/test with K 31 Non-Linear SVM Training Linear version: Maximize subject to 1 åi ai - 2 åi, j aia j yi y j < xi, x j > ai ³ 0; å ai yi = 0 i 32 Non-Linear SVM Training Linear version: Maximize subject to 1 åi ai - 2 åi, j aia j yi y j < xi, x j > ai ³ 0; å ai yi = 0 i Non-linear version: Maximize 1 åi ai - 2 åi, j aia j yi y j K(xi, x j ) 33 Decoding Linear SVM: f (x) = å ai yi < xi , x > +b i 34 Decoding Linear SVM: f (x) = å ai yi < xi , x > +b i Non-linear SVM: f (x) = å ai yi K(xi , x) + b i 35 Common Kernel Functions Implemented in most packages K(x, z) =< x, z > d Polynomial: K(x, z) = (g < x, z > +c) -g ( x-z )2 Radial Basis Function (RBF): K(x, z) = e Sigmoid K(x, z) = tanh(g < x, z > +c) x -x e e : tanh(x) = x - x e +e Linear: 36 x = (x1 ,...., xn ) z = (z1,...., z n ) x-z = (x1 - z1,...., xn - z n ) x-z = (x1 - z1 ) +... + (xn - zn ) 2 2 37 Kernels Many conceivable kernels: Function is a kernel if obeys Mercer’s theorem: Is symmetric, continuous, and matrix is positive definite 38 Kernels Many conceivable kernels: Function is a kernel if obeys Mercer’s theorem: Is symmetric, continuous, and matrix is positive definite Selection of kernel can have huge impact Dramatic differences in accuracy 39 Kernels Many conceivable kernels: Function is a kernel if obeys Mercer’s theorem: Is symmetric, continuous, and matrix is positive definite Selection of kernel can have huge impact Dramatic differences in accuracy Knowledge about ‘shape’ of data can help select Ironically, linear SVMs perform well on many tasks 40 Summary Find decision hyperplane that maximizes margain 41 Summary Find decision hyperplane that maximizes margain Employ soft-margin to support noisy data 42 Summary Find decision hyperplane that maximizes margain Employ soft-margin to support noisy data For non-linearly separable data, use non-linear SVMs Project to higher dimensional space to separate Use kernel trick to avoid intractable computation of Projection or inner products 43 MaxEnt vs SVM MaxEnt SVM Modeling Maximize P(Y|X,λ) Maximize margin Training Learn λi for each feature function Learn αi for each training instance Decoding Calculate P(y|x) Calculate sign f(x) Things to decode Features Regularization Training algorithm Kernel Regularization Training algorithm Binarization Due to F. Xia 44 LibSVM Well-known SVM package Chang & Lin, most recent version 2011 Supports: Many different SVM variants Range of kernels Efficient multi-class implementation Range of language interfaces, as well as CLI Frameworks for tuning – cross-validation, grid search Why libSVM? SVM not in Mallet 45 libSVM Information Main website: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Documentation & examples FAQ 46 libSVM Information Main website: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Documentation & examples FAQ Installation on patas: /NLP_TOOLS/ml_tools/svm/libsvm/latest/ Calling prints usage 47 libSVM Main components: svm-train: Trains model on provided data svm-predict: Applies trained model to test data 48 svm-train Usage: svm-train [options] training_data model_file Options: -t: kernel type [0-3] -d: degree – used in polynomial -g: gamma – used in polynomial, RBF, sigmoid -r: coef0 – used in polynomial, sigmoid lots of others 49 LibSVM Kernels Predefined libSVM kernels Set with –t switch: default 2 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) 3 -- sigmoid: tanh(gamma*u'*v + coef0) From libSVM docs 50 svm-predict Usage: svm-predict testing_file model_file output_file Prints results to output_file Accuracy printed to stderr HW #8: Implement SVM decoder using SVM model 51 Training File Format Training file: Sequence of lines: Each line represents a training instance 52 Training File Format Training file: Sequence of lines: Each line represents a training instance Mallet format: instanceID classLabel f1 v1 … LibSVM format: classNumber featidx1:v1 featidx2:v2….. 53 Training File Format Training file: Sequence of lines: Each line represents a training instance Mallet format: instanceID classLabel f1 v1 … LibSVM format: classNumber featidx1:v1 featidx2:v2….. featidxes numeric, values numeric Sparse: omit idx:value pair if value = 0 Increasing order by idx 54 Model File Format Example svm_type c_svc kernel_type linear nr_class 2 total_sv 535 Parameters rho 0.281122 label 0 1 nr_sv 272 263 SV 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 Due to F. Xia 55 Model File Format Example svm_type c_svc kernel_type linear nr_class 2 total_sv 535 rho 0.281122 Parameters corresponds to -b label 0 1 nr_sv 272 263 SV 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 Due to F. Xia 56 Model File Format Example svm_type c_svc kernel_type linear nr_class 2 total_sv 535 rho 0.281122 Parameters corresponds to -b label 0 1 nr_sv 272 263 SV 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 Due to F. Xia Weight: ai yi 57 Model File Format Example svm_type c_svc kernel_type linear nr_class 2 total_sv 535 rho 0.281122 Parameters corresponds to -b label 0 1 nr_sv 272 263 SV 0.004437478408154137 0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 Due to F. Xia Weight: ai yi Corresponding support vector 58 Applying the Model File Classification: f(x) = sign (<w.x>+b) In papers: f (x) = å ai yi K(xi , x) + b i where yi is class of x, c0=+1 and c1= -1 59 Applying the Model File Classification: f(x) = sign (<w.x>+b) In papers: f (x) = å ai yi K(xi , x) + b i where yi is class of x, c0=+1 and c1= -1 In libSVM: f (x) = å weighti K(xi , x) - r i 60 Applying the Model File Classification: f(x) = sign (<w.x>+b) In papers: f (x) = å ai yi K(xi , x) + b i where yi is class of x, c0=+1 and c1= -1 In libSVM: f (x) = å weighti K(xi , x) - r i if f(x) >0, label with class c0 else c1 61 Output File Format Output format: Sequence of lines One label per line – corresponding to input Binary classification: classes 0/1 (not -1,1) 0 c0 0 1 c1 1 0 62 Notation Mapping Papers libSVM Model αi yi xi b weighti xi ρ Decoding f (x) = å ai yi K(xi , x) + b f (x) = å weighti K(xi , x) - r +1 -1 0 1 i Labels i Due to F. Xia 63 HW #8 Run libSVM on binary text classification task Binary in both senses: 2 classes Values are 0/1 Build SVM decoder using model in step 1 Implement different kernel functions 64