Transcript Part 2
Part 2: Support Vector Machines Vladimir Cherkassky University of Minnesota [email protected] Presented at Tech Tune Ups, ECE Dept, June 1, 2011 Electrical and Computer Engineering 1 SVM: Brief History 1963 Margin (Vapnik & Lerner) 1964 Margin (Vapnik and Chervonenkis, 1964) 1964 RBF Kernels (Aizerman) 1965 Optimization formulation (Mangasarian) 1971 Kernels (Kimeldorf annd Wahba) 1992-1994 SVMs (Vapnik et al) 1996 – present Rapid growth, numerous apps 1996 – present Extensions to other problems 2 MOTIVATION for SVM • Problems with ‘conventional’ methods: - model complexity ~ dimensionality (# features) - nonlinear methods multiple minima - hard to control complexity • SVM solution approach - adaptive loss function (to control complexity independent of dimensionality) - flexible nonlinear models - tractable optimization formulation 3 SVM APPROACH • • Linear approximation in Z-space using special adaptive loss function Complexity independent of dimensionality x gx z wz yˆ 4 OUTLINE • • • • • Margin-based loss SVM for classification SVM examples Support vector regression Summary 5 Example: binary classification • Given: Linearly separable data How to construct linear decision boundary? 6 Linear Discriminant Analysis LDA solution Separation margin 7 Perceptron (linear NN) • Perceptron solutions and separation margin 8 Largest-margin solution • All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (falsifiability) M 2 9 Complexity of -margin hyperplanes • If data samples belong to a sphere of radius R, then the set of -margin hyperplanes has VC dimension bounded by h min(R / , d ) 1 2 2 • For large margin hyperplanes, VC-dimension controlled independent of dimensionality d. 10 Motivation: philosophical • Classical view: good model explains the data + low complexity Occam’s razor (complexity ~ # parameters) • VC theory: good model explains the data + low VC-dimension ~ VC-falsifiability: good model: explains the data + has large falsifiability The idea: falsifiability ~ empirical loss function 11 Adaptive loss functions • Both goals (explanation + falsifiability) can encoded into empirical loss function where - (large) portion of the data has zero loss - the rest of the data has non-zero loss, i.e. it falsifies the model • The trade-off (between the two goals) is adaptively controlled adaptive loss fct • Examples of such loss functions for different learning problems are shown next 12 Margin-based loss for classification Margin 2 L ( y, f (x, )) max yf (x, ),0 13 Margin-based loss for classification: margin is adapted to training data Class +1 Margin Class -1 y f (x, ) L ( y, f (x, )) max yf (x, ),0 Epsilon loss for regression L ( y, f (x, )) max| y f (x, ) | ,0 15 Parameter epsilon is adapted to training data Example: linear regression y = x + noise where noise = N(0, 0.36), x ~ [0,1], 4 samples Compare: squared, linear and SVM loss (eps = 0.6) 2 1.5 y 1 0.5 0 -0.5 -1 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 OUTLINE • Margin-based loss • SVM for classification - Linear SVM classifier - Inner product kernels - Nonlinear SVM classifier • SVM examples • Support Vector regression • Summary 17 SVM Loss for Classification Continuous quantity yf (x, w) measures how close a sample x is to decision boundary 18 Optimal Separating Hyperplane Distance btwn hyperplane and sample f (x' ) / w Margin 1/ w Shaded points are SVs 19 Linear SVM Optimization Formulation (for separable data) • • Given training data x i ,yi i 1,...,n Find parameters w, b of linear hyperplane • Quadratic optimization with linear constraints tractable for moderate dimensions d For large dimensions use dual formulation: - scales with sample size(n) rather than d - uses only dot products xi , x j • f x w x b 2 that minimize w 0.5 w yi w x i b 1 under constraints 20 Classification for non-separable data slack _ variables yf (x, ) L ( y, f (x, )) max yf (x, ),0 21 SVM for non-separable data x 1 = 1 - f (x1 ) f (x) w x b x1 x 2 = 1 - f (x 2 ) x2 f ( x) = +1 x3 x 3 = 1 + f (x 3 ) f ( x) = 0 f ( x) = - 1 n Minimize 1 2 C i w min 2 i 1 under constraints yi w xi b 1 i 22 SVM Dual Formulation • • Given training data x i ,yi i 1,...,n * * Find parameters i ,b of an opt. hyperplane as a solution to maximization problem 1 n L i i j yi y j x i x j max 2 i , j 1 i 1 n n under constraints y i 1 n • • i i 0, 0 i C f x i* yi x x i b* Solution i 1 where samples with nonzero i* are SVs Needs only inner products x x' 23 Nonlinear Decision Boundary • Fixed (linear) parameterization is too rigid • Nonlinear curved margin may yield larger margin (falsifiability) and lower error 24 Nonlinear Mapping via Kernels Nonlinear f(x,w) + margin-based loss = SVM • Nonlinear mapping to feature z space, i.e. x ~ ( x1, x2 ) z ~ (1, x1, x2 , x1 x2 , x , x ) 2 1 2 2 • Linear in z-space ~ nonlinear in x-space • BUT z z Hx, x ~ kernel trick Compute dot product via kernel analytically x gx z wz yˆ 25 SVM Formulation (with kernels) • • Replacing z z H x, x leads to: * * ,b Find parameters i of an optimal hyperplane Dx y H x , x b n i 1 * i * i i as a solution to maximization problem 1 n L i i j yi y j H x i , x j max 2 i , j 1 i 1 n n under constraints y i 1 i Given: the training data i 0, xi ,yi 0 i C i 1,...,n an inner product kernel H x, x regularization parameter C 26 Examples of Kernels Kernel H x, x is a symmetric function satisfying general math conditions (Mercer’s conditions) Examples of kernels for different mappings xz • Polynomials of degree q q H x, x x x' 1 • RBF kernel • Neural Networks for given parameters 2 x x ' H x, x exp 2 H x, x tanhv(x x' ) a v, a Automatic selection of the number of hidden units (SV’s) 27 More on Kernels • The kernel matrix has all info (data + kernel) H(1,1) H(1,2)…….H(1,n) H(2,1) H(2,2)…….H(2,n) …………………………. H(n,1) H(n,2)…….H(n,n) • Kernel defines a distance in some feature space (aka kernel-induced feature space) • Kernels can incorporate apriori knowledge • Kernels can be defined over complex structures (trees, sequences, sets etc) 28 Support Vectors • SV’s ~ training samples with non-zero loss • SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters. • SVM Generalization ~ data compression 29 New insights provided by SVM • Why linear classifiers can generalize? h min(R / , d ) 1 2 2 (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small • SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both • Requires common-sense parameter tuning 30 OUTLINE • • • • • Margin-based loss SVM for classification SVM examples Support Vector regression Summary 31 Ripley’s data set • 250 training samples, 1,000 test samples 2 • SVM using RBF kernel (u, v) exp u v • Model selection via 10-fold cross-validation 32 Ripley’s data set: SVM model • Decision boundary and margin borders • SV’s are circled 1.2 1 0.8 x2 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 x1 0.5 1 33 Ripley’s data set: model selection • SVM tuning parameters C, • Select opt parameter values via 10-fold x-validation • Results of cross-validation are summarized below: C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000 =2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4% =2-2 51.6% 22% 20% 20% 16% 14% =2-1 33.2% 19.6% 18.8% 15.6% 13.6% 14.8% =20 28% 18% 16.4% 14% 12.8% 15.6% =21 20.8% 16.4% 14% 12.8% 16% 17.2% =22 19.2% 14.4% 13.6% 15.6% 15.6% 16% =23 15.6% 14% 15.6% 16.4% 18.4% 18.4% 34 Noisy Hyperbolas data set • This example shows application of different kernels • Note: decision boundaries are quite different RBF kernel Polynomial 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 35 Many challenging applications • Mimic human recognition capabilities - high-dimensional data - content-based - context-dependent • Example: read the sentence Sceitnitss osbevred: it is nt inptrant how lteters are msspled isnide the word. It is ipmoratnt that the fisrt and lsat letetrs do not chngae, tehn the txet is itneprted corrcetly • SVM is suitable for sparse high-dimensional data 36 Example SVM Applications • Handwritten digit recognition • Genomics • Face detection in unrestricted images • Text/ document classification • Image classification and retrieval • ……. 37 Handwritten Digit Recognition (mid-90’s) • Data set: postal images (zip-code), segmented, cropped; ~ 7K training samples, and 2K test samples • Data encoding: 16x16 pixel image 256-dim. vector • Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application • Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit) 38 Digit Recognition Results • Summary - prediction accuracy better than custom NN’s - accuracy does not depend on the kernel type - 100 – 400 support vectors per class (digit) • More details Type of kernel No. of Support Vectors Polynomial 274 RBF 291 Neural Network 254 Error% 4.0 4.1 4.2 • ~ 80-90% of SV’s coincide (for different kernels) 39 Document Classification (Joachims, 1998) • The Problem: Classification of text documents in large data bases, for text indexing and retrieval • Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly • Predictive Learning Approach (SVM): construct a classifier using all possible features (words) • Document/ Text Representation: individual words = input features (possibly weighted) • SVM performance: – Very promising (~ 90% accuracy vs 80% by other classifiers) – Most problems are linearly separable use linear SVM 40 OUTLINE • • • • • Margin-based loss SVM for classification SVM examples Support vector regression Summary 41 Linear SVM regression Assume linear parameterization y f (x, ) w x b 1 2* x L ( y, f (x, )) max| y f (x, ) | ,0 42 Direct Optimization Formulation y Given training data xi ,yi 1 i 1,...,n 2* Minimize n x 1 (w w) C (i i* ) 2 i 1 yi (w x i ) b i * Under constraints ( w x ) b y i i i , * 0, i 1,...,n i i 43 Example: SVM regression using RBF kernel 1 0.8 y 0.6 0.4 0.2 0 -0.2 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 SVM estimate is shown in dashed line SVM model uses only 5 SV’s (out of the 40 points) 44 xc 2 j f x, w w j exp 2 j 1 0.20 m RBF regression model 4 3 2 y 1 0 -1 -2 -3 -4 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 Weighted sum of 5 RBF kernels gives the SVM model 45 Summary • Margin-based loss: robust + performs complexity control • Nonlinear feature selection (~ SV’s): performed automatically • Tractable model selection – easier than most nonlinear methods. • SVM is not a magic bullet solution - similar to other methods when n >> h - SVM is better when n << h or n ~ h 46