Transcript Slide 1

Online Learning Algorithms

1

Outline

• • • Online learning Framework Design principles of online learning algorithms (additive updates)    Perceptron, Passive-Aggressive and Confidence weighted classification Classification – binary, multi-class and structured prediction Hypothesis averaging and Regularization Multiplicative updates  Weighted majority, Winnow, and connections to Gradient Descent(GD) and Exponentiated Gradient Descent (EGD) 2

Formal setting – Classification

• • • • Instances  Images, Sentences Labels  Parse tree, Names Prediction rule  Linear prediction rule Loss  No. of mistakes 3

Predictions

• Continuous predictions :  Label  Confidence • Linear Classifiers  Prediction :  Confidence: 4

Loss Functions

• Natural Loss:  Zero-One loss: • Real-valued-predictions loss:  Hinge loss:  Exponential loss (Boosting) 5

Loss Functions

Zero-One Loss Hinge Loss

1 1

6

Online Framework

• • • Initialize Classifier Algorithm works in rounds On round the online algorithm :      Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule • Goal :  Suffer small cumulative loss 7

Margin

Margin

of an example with respect to the classifier : • Note : • The set is separable iff there exists

u

such that 8

Geometrical Interpretation

Margin <<0 Margin >0 Margin >>0 Margin <0 9

Hinge Loss

10

Why Online Learning?

• • • • • • • Fast Memory efficient - process one example at a time Simple to implement Formal guarantees – Mistake bounds Online to Batch conversions No statistical assumptions Adaptive 11

Update Rules

• • Online algorithms are based on an update rule which defines from (and possibly other information) Linear Classifiers : find from based on the input • Some Update Rules :     Perceptron (Rosenblat) ALMA (Gentile) ROMMA (Li & Long) NORMA (Kivinen et. al)  12    MIRA (Crammer & Singer) EG (Littlestown and Warmuth) Bregman Based (Warmuth) CWL (Dredge et. al)

Design Principles of Algorithms

•   If the learner suffers non-zero loss at any round, then we want to balance two goals:

Corrective:

Change weights enough so that we don ’t make this error again (1)

Conservative:

Don ’t change the weights too much (2) How to define too much ?

13

Design Principles of Algorithms

• • • If we use Euclidean distance to measure the change between old and new weights   Enforcing (1) and minimizing (2) e.g., Perceptron for squared loss (Windrow-Hoff or Least Mean Squares) Passive-Aggressive algorithms do exactly same  except (1) is much stronger – we want to make a correct classification with margin of at least 1 Confidence-Weighted classifiers    maintains a distribution over weight vectors (1) is same as passive-aggressive with a probabilistic notion of margin Change is measured by KL divergence between two distributions 14

Design Principles of Algorithms

• If we assume all weights are positive  we can use (unnormalized) KL divergence measure the change to  Multiplicative update or EG algorithm (Kivinen and Warmuth) 15

The Perceptron Algorithm

• If No-Mistake  Do nothing • If Mistake  Update • Margin after update: 16

Passive-Aggressive Algorithms

17

Passive-Aggressive: Motivation

• • • Perceptron: No guaranties of margin

after

the update PA: Enforce a minimal non-zero margin after the update In particular:  If the margin is large enough (1), then do nothing  If the margin is less then unit, update such that the margin

after

be unit 18 the update is

enforced

to

Aggressive Update Step

• Set to be the solution of the following optimization problem:

(2) (1)

• Closed-form update: where, 19

Passive-Aggressive Update

20

Unrealizable Case

21

Confidence Weighted Classification

22

Confidence-Weighted Classification: Motivation

• • • • Many positive reviews with the word best W best Later negative review  “

boring book – best if you want to sleep in seconds

” Linear update will reduce both W best W boring But best appeared more than boring • How to adjust weights at different rates ?

W boring W best 23

Update Rules

• The weight vector is a linear combination of examples • Two rate schedules (among others):  Perceptron algorithm, conservative:  Passive-aggressive 24

Distributions in Version Space

Mean weight-vector ickTime™ and compr es Example 25

Margin as a Random Variable

• Signed margin is a Gaussian-distributed variable • Thus: 26

PA-like Update

• PA: • New Update : 27

Weight Vector (Version) Space

Place most of the probability mass in this region 28

Passive Step

Nothing to do, most weight vectors already classify the example correctly 29

Aggressive Step

Mean moved past the mistake line (large margin) The covariance is Project the current shirked in the Gaussian distribution direction of the onto the half-space new example 30

Extensions:

Multi-class and Structured Prediction

31

Multiclass Representation I

• • • k Prototypes New instance Compute Class r 1 2 3 4 -1.08

1.66

0.37

-2.09

• Prediction: the class achieving the highest Score 32

Multiclass Representation II

• Map all input and labels into a joint vector space F Estimated volume was a light 2.4 million ounces .

B I O B I I I I O = (0 1 1 0 … ) • Score labels by projecting the corresponding feature vector 33

Multiclass Representation II

• Predict label with highest score (Inference) • Naïve search is expensive if the set of possible labels is large  Estimated volume was a light 2.4 million ounces .

B I O B I I I I O No. of labelings = 3 No. of words 34 Efficient Viterbi decoding for sequences!

Two Representations

• Weight-vector per class (Representation I)  Intuitive  Improved algorithms • Single weight-vector (Representation II)  Generalizes representation I F(x,4) = 0 0 0 x 0  Allows complex interactions between input and output 35

• Binary:

Margin for Multi Class

• Multi Class: 36

Margin for Multi Class

• • But different mistakes cost (aka loss function) differently – so use it!

Margin scaled by loss function: 37

Perceptron Multiclass online algorithm

• • Initialize For      Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule 38

PA Multiclass online algorithm

• • Initialize For      Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule 39

Regularization

• • Key Idea:  If an online algorithm works well on a sequence of i.i.d examples, then an ensemble of online hypotheses should generalize well.

Popular choices:   the averaged hypothesis the majority vote  use validation set to make a choice 40