Transcript Slide 1
Second Order Learning Koby Crammer Department of Electrical Engineering 1 ECML PKDD 2013 Prague Thanks • • • • • • • • 2 Mark Dredze Alex Kulesza Avihai Mejer Edward Moroshko Francesco Orabona Fernando Pereira Yoram Singer Nina Vaitz Tutorial Context Online Learning SVMs Tutorial Optimization Theory 3 Real-World Data Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 4 Online Learning Tyrannosaurus rex 5 Online Learning Triceratops 6 Online Learning Velocireptor Tyrannosaurus rex 7 Formal Setting – Binary Classification • Instances – Images, Sentences • Labels – Parse tree, Names • Prediction rule – Linear predictions rules • Loss – No. of mistakes 8 Predictions • Discrete Predictions: – Hard to optimize • Continuous predictions : – Label – Confidence 9 Loss Functions • Natural Loss: – Zero-One loss: • Real-valued-predictions loss: – Hinge loss: – Exponential loss (Boosting) – Log loss (Max Entropy, Boosting) 10 Loss Functions Hinge Loss Zero-One Loss 1 1 11 Online Learning Maintain Model M Get Instance x Update Model M Predict Label y=M(x) Suffer Loss l(y,y) Get True Label y Linear Classifiers • Any Features • W.l.o.g. • Binary Classifiers of the form Notation 14 Abuse Linear Classifiers (cntd.) • Prediction : • Confidence in prediction: 15 Linear Classifiers Input Instance to be classified 16 Weight vector of classifier Margin • Margin of an example classifier : with respect to the • Note : • The set there exists such that 17 is separable iff Geometrical Interpretation 18 Geometrical Interpretation 19 Geometrical Interpretation 20 Geometrical Interpretation Margin <<0 Margin >0 Margin >>0 21 Margin <0 Hinge Loss 22 Why Online Learning? • Fast • Memory efficient - process one example at a time • Simple to implement • Formal guarantees – Mistake bounds • Online to Batch conversions • No statistical assumptions • Adaptive • Not as good as a well designed batch algorithms 23 Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 24 Rosenblat 1958 The Perceptron Algorithm • If No-Mistake – Do nothing • If Mistake – Update • Margin after update : 25 Geometrical Interpretation 26 Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 27 Gradient Descent • Consider the batch problem • Simple algorithm: – Initialize – Iterate, for – Compute – Set 28 Stochastic Gradient Descent • Consider the batch problem • Simple algorithm: – – – – Initialize Iterate, for Pick a random index Compute – Set 30 31 Stochastic Gradient Descent • “Hinge” loss The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples • The gradient • Simple algorithm: – – – – 32 Initialize Iterate, for Pick a random index If else – Set then Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 33 Motivation • Perceptron: No guaranties of margin after the update • PA :Enforce a minimal non-zero margin after the update • In particular : – If the margin is large enough (1), then do nothing – If the margin is less then unit, update such that the margin after the update is enforced to be unit 34 Input Space 35 Input Space vs. Version Space • Input Space : – Points are input data – One constraint is induced by weight vector – Primal space – Half space = all input examples that are classified correctly by a given predictor (weight vector) 36 • Version Space : – Points are weight vectors – One constraints is induced by input data – Dual space – Half space = all predictors (weight vectors) that classify correctly a given input example Weight Vector (Version) Space The algorithm forces to reside in this region 37 Passive Step Nothing to do. already resides on the 38 desired side. Aggressive Step The algorithm projects on the desired 39 half-space Aggressive Update Step • Set to be the solution of the following optimization problem : • Solution: 40 Perceptron vs. PA • Common Update : • Perceptron • Passive-Aggressive 41 Perceptron vs. PA No-Error, Small Margin Error Margin 42 No-Error, Large Margin Perceptron vs. PA 43 Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 44 Geometrical Assumption • All examples are bounded in a ball of radius R 45 Separablity • There exists a unit vector classifies the data correctly 46 that Perceptron’s Mistake Bound • The number of mistakes the algorithm makes is bounded by • Simple case: positive points negative points • Separating hyperplane 47 • Bound is : Geometrical Motivation 48 SGD on such data 49 Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 50 Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron • Assume all inputs are given • Compute “whitening” matrix • Run the Perceptron on “wightened” data • New “whitening” matrix 51 Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron • Bound: • Same simple case: • Thus • Bound is : 52 Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005 Second Order Perceptron • If No-Mistake – Do nothing • If Mistake – Update 53 SGD on weightened data 54 Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 55 Span-based Update Rules • The weight vector is a linear combination of examples Weight of Learning rate • Two feature f Learning rate rate schedules Target label (manyEither many-1others or 1 ): – Perceptron algorithm, Conservative – Passive - Aggressive 56 Feature-value of input instance Sentiment Classification • Who needs this Simpsons book? You DOOOOOOOO This is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended! 57 Pang, Lee, Vaithyanathan, EMNLP 2002 Sentiment Classification • Many positive reviews with the word best Wbest • Later negative review – “boring book – best if you want to sleep in seconds” 58 • Linear update will reduce both Wbest Wboring • But best appeared more than boring • The model know’s more about best than boring • Better to reduce words in different rate Wboring Wbest • Big datasets, large number of features • Many features are only weakly correlated with target label • Linear classifiers: features are associated with word-counts • Heavy-tailed feature distribution 59 Counts Natural Language Processing Feature Rank Natural Language Processing New Prediction Models • Gaussian distributions over weight vectors • The covariance is either full or diagonal • In NLP we have many features and use a diagonal covariance 61 Classification • Given a new example • Stochastic: – Draw a weight vector – Make a prediction • Collective: – Average weight vector – Average margin – Average prediction 62 The Margin is Random Variable • The signed margin is random 1-d Gaussian • Thus: 63 Linear Model Distribution over Linear Models Mean weight-vector Example 64 Weight Vector (Version) Space The algorithm forces that most of the values of would reside in this region 65 Passive Step Nothing to do, most of the weight vectors already classifies the example correctly 66 Aggressive Step The mean is moved beyond the mistake-line (Large Margin) The algorithm covariance is projects shrunk inthe thecurrent Gaussian direction ofdistribution the on input theexample half-space 67 Projection Update • Vectors (aka PA): Confidence Parameter • Distributions (New Update) : 68 Divergence • Sum of two divergences of parameters : Matrix Itakura-Saito Divergence Mahanabolis Distance • Convex in both arguments simultaneously 69 Constraint • Probabilistic Constraint : • Equivalent Margin Constraint : • Convex in • Solutions: , concave in – Linear approximation – Change variables to get a convex formulation – Relax (AROW) 70 Dredze, Crammer, Pereira. ICML 2008 Crammer, Dredze, Pereira. NIPS 2008 Crammer, Dredze, Kulesza. NIPS 2009 Crammer, Dredze, Pereira. NIPS 2008 Convexity • Change variables • Equivalent convex formulation 71 Crammer, Dredze, Kulesza. NIPS 2009 AROW • PA: • CW : • Similar update form as CW 72 The Update • Optimization update can be solved analytically • Coefficients depend on specific algorithm 73 Definitions 74 Updates CW (Linearization) 75 CW AROW (Change Variables) Per-feature Learning Rate Per-feature Learning rate Reducing the Learning rate and eigenvalues of covariance matrix 76 Diagonal Matrix • Given a matrix we define be only the diagonal part of the matrix, • Make matrix diagonal • Make inverse diagonal 77 to Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 78 (Back to) Stochastic Gradient Descent • Consider the batch problem • Simple algorithm: – – – – Initialize Iterate, for Pick a random index Compute – Set 79 Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent • Consider the batch problem • Simple algorithm: – – – – Initialize Iterate, for Pick a random index Compute – Set 80 – Set Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent • Very general! Can be used to solve with various regularizations • The matrix A can be either full or diagonal • Comes with convergence and regret bounds • Similar performance to AROW Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010 Adaptive Stochastic Gradient Descent SGD AdaGrad Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 86 Kernels 87 Proof • Show that we can write • Induction 88 Proof (cntd) • By update rule : • Thus 89 Proof (cntd) • By update rule : 90 Proof (cntd) • Thus 91 Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 92 Statistical Interpretation • Margin Constraint : • Distribution over weight-vectors : • Assume input is corrupted with Gaussian noise 94 Statistical Interpretation Mean weight-vector Bad realization Example Input Instance Linear Separator Good realization Version Space 95 Input Space Orabona and Crammer, NIPS 2010 Mistake Bound • For any reference weight vector , the number of mistakes made by AROW is upper bounded by where – – – 96 set of example indices with a mistake set of example indices with an update but not a mistake Comment I • Separable case and no updates: where 97 Comment II • For large the bound becomes: • When no updates are performed: Perceptron 98 Orabona and Crammer, NIPS 2010 Bound for Diagonal Algorithm • No. of mistakes is bounded by • Is low when either a feature is rare or non-informative • Exactly as in NLP … Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 100 Synthetic Data • 20 features • 2 informative (rotated skewed Gaussian) • 18 noisy • Using a single feature is as good as a random prediction 101 Synthetic Data (cntd.) 102 Distribution after 50 examples (x1) Synthetic Data (no noise) Perceptron PA SOP CW-full CW-diag 103 Synthetic Data (10% noise) 104 Outline • Background: – – – – Online learning + notation Perceptron Stochastic-gradient descent Passive-aggressive • Second-Order Algorithms – Second order Perceptron – Confidence-Weighted and AROW – AdaGrad • Properties – Kernels – Analysis • Empirical Evaluation – Synthetic – Real Data 105 Data • Sentiment – Sentiment reviews from 6 Amazon domains (Blitzer et al) – Classify a product review as either positive or negative • Reuters, pairs of labels – Three divisions: • Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail. – Bag of words representation with binary features. • 20 News Groups, pairs of labels – Three divisions: • comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances. – Bag of words representation with binary features. 106 Experimental Design • Online to batch : – Multiple passes over the training data – Evaluate on a different test set after each pass – Compute error/accuracy • • • • 107 Set parameter using held-out data 10 Fold Cross-Validation ~2000 instances per problem Balanced class-labels Results vs Online- Sentiment 108 • • StdDev and Variance – always better than baseline Variance – 5/6 significantly better Results vs Online – 20NG + Reuters 109 • • StdDev and Variance – always better than baseline Variance – 4/6 significantly better Results vs Batch - Sentiment 110 • always better than batch methods • 3/6 significantly better Results vs Batch - 20NG + Reuters 111 • 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse 112 113 114 Results - Sentiment O PA O CW O PA O CW Accuracy O PA O CW O PA O CW O PA O CW O PA O CW Passes of Training Data 115 • • CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes Results – Reuters + 20NG Accuracy O PA O CW O PA O CW O PA O CW O PA O CW O PA O CW Passes of Training Data 116 O PA O CW • • CW is better (5/6 cases), statistically significant (4/6) CW benefit less from many passes Error Reduction by Multiple Passes 117 • • PA benefits more from multiple passes (8/12) Amount of benefit is data dependent T. Jaakkola and M. Jordan. 1997 Bayesian Logistic Regression BLR • Covariance CW/AROW • Covariance • Mean • Mean Based on the Variational Approximation 118 Conceptually decoupled update Function of the margin/hinge-loss Algorithms Summary 1st 119 Order 2nd Order Perceptron SOP PA CW+AROW SGD AdaGrad Logisitic Regression LR • Different motivation, similar algorithms • All algorithms can be kernelized • Work well for data NOT isotropic / symmetric • State-of-the-art results in various domains • Accompanied with theory