Transcript lecture22
Today’s Topics Support Vector Machines (SVMs) Three Key Ideas – Max Margins – Allowing Misclassified Training Examples – Kernels (for non-linear models; in next lecture) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 1 Three Key SVM Concepts • Maximize the Margin Don’t choose just any separating plane • Penalize Misclassified Examples Use soft constraints and ‘slack’ variables • Use the ‘Kernel Trick’ to get Non-Linearity Roughly like ‘hardwiring’ the input HU portion of ANNs (so only need a perceptron) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 2 Support Vector Machines Maximizing the Margin between Bounding Planes SVMs define some inequalities we want satisfied. We then use advanced optimization methods (eg, linear programming) to find the satisfying solutions, but in cs540 we’ll do a simpler approx ? Support Vectors 2 ||w||2 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 3 Margins and Learning Theory Theorems exist that connect learning (‘PAC’) theory to the size of the margin – Basically the larger the margin, the better the expected future accuracy – See, for example, Chapter 4 of Support Vector Machines by N. Christianini & J. Shawe-Taylor, Cambridge Press, 2000 (not an assigned reading) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 4 ‘Slack’ Variables Dealing with Data that is not Linearly Separable For each wrong example, we pay a penalty, which is the distance we’d have to move it to get on the right side of the decision boundary (ie, the separating plane) y 11/10/15 Support Vectors CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 If we deleted any/all of the non support vectors we’d get the same answer! 5 SVMs and Non-Linear Separating Surfaces f2 + _ Non-linearly map to new space _ _ + + + h(f1, f2) f1 Result is a non-linear separator in original space 11/10/15 _ g(f1, f2) CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 Linearly separate in new space 6 Math Review: Dot Products X Y X1 Y1 + X2 Y2 + … + Xn Yn So if X = [4, 5, -3, 7] and Y = [9, 0, -8, 2] Then X Y = 49 + 50 + (-3)(-8) + 72 = 74 (weighted sums in ANNs are dot products) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 7 Some Equations Separating Plane W x - + threshold + - - + weights input features - - + - + For all positive examples W x pos 1 For all negative examples W xneg 1 11/10/15 These 1’s result from dividing through by a constant for convenience (it is the distance from the dashed lines to the green line) CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 8 Idea #1: The Margin (derivation not on final) xA xj W (i) (ii) The green line is the set of all pts that satisfy this equation (ditto for red line) W x A 1 W xB 1 Subtracting (ii) from (i) gives xB (iv) xi (iii) W ( x A xB ) 2 W ( x A xB ) || W || || x A xB || cos() || x x A B || Combining (iii) and (iv) we get 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 = 1 since parallel lines 2 || W || 9 Our Initial ‘Mathematical Program’ min ||w|| w, 1 (this is the ‘1-norm’ length of the weight vector, which is the sum of the absolute values of the weights; some SVMs use quadratic programs, but 1-norms have some preferred properties) such that w · xpos ≥ + 1 w · xneg ≤ – 1 11/10/15 // for ‘+’ ex’s // for ‘–’ ex’s CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 10 The ‘p’ Norm – Generalization of the Familiar Euclidean Distance (p=2) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 11 Our Mathematical Program (cont.) Note: w and are our adjustable parameters (we could, of course, use the ANN ‘trick’ and move to the left side of our inequalities and treat as another weight) We can now use existing math programming optimization s/w to find a sol’n to our current program (covered in cs525) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 12 Idea #2: Dealing with Non-Separable Data • We can add what is called a ‘slack’ variable to each example • This variable can be viewed as = 0 if example correctly separated else = ‘distance’ we need to move ex to get it correct (ie, distance from decision boundary) • Note: we are NOT counting #misclassified would be nice to do so, but that becomes [mixed] integer programming, which is much harder 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 13 The Math Program with Slack Vars (this is the linear-programming version; there also is a quadratic-prog version - in cs540 we won’t worry about the difference) min ||w||1 + μ ||S||1 w, s, Dim = # of input features scalar Dim = # of training examples Notice we are solving the perceptron task with a complexity penalty (sum of wgts) – Hinton’s wgt decay! Scaling constant (use tuning set to select value) such that w · xposi + Si ≥ + 1 w · xnegj – Sj ≤ – 1 Sk ≥ 0 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 The S’s are how far we would need to move an example in order for it to be on the proper side of the decision surface 14 Slack’s and Separability • If training data is separable, will all Si = 0 ? • Not necessarily! – Might get a larger margin by misclassifying a few examples (just like in d-tree pruning) – This can also happen when using gradientdescent to minimize an ANN’s cost function 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 15 Brief Intro to Linear Programs (LP’s) - not on final • We need to convert our task into Az≥b which is the basic form of an LP (A is a constant matrix, b is a constant vector, z is a vector of variables) • Note Can convert inequalities containing ≤ into ones using ≥ by multiplying both sides by -1 eg, 5x ≤ 15 same as -5x ≥ -15 • Can also handle = (ie, equalities) could use ≥ and ≤ to get =, but more efficient methods exist 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 16 Brief Intro to Linear Programs (cont.) - not on final In addition, we want to min c z Yellow region are those points that satisfy the constraints; dotted lines are iso-cost lines under the linear Az ≥ b constraints Vector c says how to penalize settings for variables in vector z Highly optimized s/w for solving LP exists (eg, CPLEX, COINS [free]) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 Lecture #21, Slide 17 Review: Matrix Multiplication AB=C Matrix A is K by M Matrix B is N by K Matrix C is M by N From (code also there): http://www.cedricnugteren.nl/tutorial.php?page=2 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 18 Aside: Our SVM as an LP (not on final) Let Apos = our positive training examples Aneg = our negative training examples | f | e/2 | e/2 | 1 | f | e/2 e/2 e/2 e/2 f f Apos 1 (assume 50% pos and 50% neg for notational simplicity) 0 -1 0 -Aneg 0 1 1 0 0 1 0 0 0 0 0 1 0 0 -1 0 0 0 1 1 0 0 0 1 e/2 f W 1 e/2 Spos 1 e/2 0 e 0 f Z 0 f e/2 1 f Sneg ≥ The 1’s are identity matrices (often written as I) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 19 Our C Vector (determines the cost we’re minimizing, also not on final) C Note we min Z’s not W’s since only Z’s ≥ 0 min [ 0 μ 0 1 ] W S Aside: could also penalize (but would need to add more variables since can be negative) = min μ ● S + 1 ● Z Z = min μ ||S||1 + ||W||1 since all S are non-negative and the Z’s ‘squeeze’ the W’s Note here: S = Spos concatenated with Sneg CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 11/10/15 20 Where We are so Far • We have an ‘objective’ function that we can optimize by Linear Programming – min ||w||1 + μ ||S||1 subject to some constraints – Free LP solvers exist – CS 525 teaches Linear Programming • We could also use gradient descent – Perceptron learning with ‘weight decay’ quite similar, though uses SQUARED wgts and SQUARED error (the S is this error) 11/10/15 CS 540 - Fall 2015 (Shavlik©), Lecture 22, Week 10 21