Transcript Slide 1

Second Order Learning
Koby Crammer
Department of Electrical Engineering
1
ECML PKDD 2013
Prague
Thanks
•
•
•
•
•
•
•
•
2
Mark Dredze
Alex Kulesza
Avihai Mejer
Edward Moroshko
Francesco Orabona
Fernando Pereira
Yoram Singer
Nina Vaitz
Tutorial Context
Online
Learning
SVMs
Tutorial
Optimization
Theory
3
Real-World
Data
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
4
Online Learning
Tyrannosaurus rex
5
Online Learning
Triceratops
6
Online Learning
Velocireptor
Tyrannosaurus rex
7
Formal Setting – Binary Classification
• Instances
– Images, Sentences
• Labels
– Parse tree, Names
• Prediction rule
– Linear predictions rules
• Loss
– No. of mistakes
8
Predictions
• Discrete Predictions:
– Hard to optimize
• Continuous predictions :
– Label
– Confidence
9
Loss Functions
• Natural Loss:
– Zero-One loss:
• Real-valued-predictions loss:
– Hinge loss:
– Exponential loss (Boosting)
– Log loss (Max Entropy, Boosting)
10
Loss Functions
Hinge Loss
Zero-One Loss
1
1
11
Online Learning
Maintain Model M
Get Instance x
Update Model M
Predict Label y=M(x)
Suffer Loss l(y,y)
Get True Label y
Linear Classifiers
• Any Features
• W.l.o.g.
• Binary Classifiers of the form
Notation
14
Abuse
Linear Classifiers (cntd.)
• Prediction :
• Confidence in prediction:
15
Linear Classifiers
Input Instance
to be classified
16
Weight vector of
classifier
Margin
• Margin of an example
classifier
:
with respect to the
• Note :
• The set
there exists such that
17
is separable iff
Geometrical Interpretation
18
Geometrical Interpretation
19
Geometrical Interpretation
20
Geometrical Interpretation
Margin <<0
Margin >0
Margin >>0
21
Margin <0
Hinge Loss
22
Why Online Learning?
• Fast
• Memory efficient - process one example at a
time
• Simple to implement
• Formal guarantees – Mistake bounds
• Online to Batch conversions
• No statistical assumptions
• Adaptive
• Not as good as a well designed batch algorithms
23
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
24
Rosenblat 1958
The Perceptron Algorithm
• If No-Mistake
– Do nothing
• If Mistake
– Update
• Margin after update :
25
Geometrical Interpretation
26
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
27
Gradient Descent
• Consider the batch problem
• Simple algorithm:
– Initialize
– Iterate, for
– Compute
– Set
28
Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:
–
–
–
–
Initialize
Iterate, for
Pick a random index
Compute
– Set
30
31
Stochastic Gradient Descent
• “Hinge” loss
The preceptron is a stochastic gradient descent algorithm with
a sum of “hinge”-loss and a specific order of examples
• The gradient
• Simple algorithm:
–
–
–
–
32
Initialize
Iterate, for
Pick a random index
If
else
– Set
then
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
33
Motivation
• Perceptron: No guaranties of margin after the
update
• PA :Enforce a minimal non-zero margin after
the update
• In particular :
– If the margin is large enough (1), then do nothing
– If the margin is less then unit, update such that the
margin after the update is enforced to be unit
34
Input Space
35
Input Space vs. Version Space
• Input Space :
– Points are input data
– One constraint is induced
by weight vector
– Primal space
– Half space = all input
examples that are
classified correctly by a
given predictor (weight
vector)
36
• Version Space :
– Points are weight vectors
– One constraints is induced
by input data
– Dual space
– Half space = all predictors
(weight vectors) that
classify correctly a given
input example
Weight Vector (Version) Space
The algorithm
forces
to
reside in this
region
37
Passive Step
Nothing to do.
already
resides on the
38
desired side.
Aggressive Step
The algorithm
projects
on the desired
39
half-space
Aggressive Update Step
• Set
to be the solution of the following
optimization problem :
• Solution:
40
Perceptron vs. PA
• Common Update :
• Perceptron
• Passive-Aggressive
41
Perceptron vs. PA
No-Error, Small Margin
Error
Margin
42
No-Error, Large Margin
Perceptron vs. PA
43
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
44
Geometrical Assumption
• All examples are bounded in a ball of radius R
45
Separablity
• There exists a unit vector
classifies the data correctly
46
that
Perceptron’s Mistake Bound
• The number of
mistakes the
algorithm makes is
bounded by
• Simple case: positive points
negative points
• Separating hyperplane
47
• Bound is :
Geometrical Motivation
48
SGD on such data
49
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
50
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
Second Order Perceptron
• Assume all inputs are given
• Compute “whitening” matrix
• Run the Perceptron on “wightened” data
• New “whitening” matrix
51
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
Second Order Perceptron
• Bound:
• Same simple case:
• Thus
• Bound is :
52
Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005
Second Order Perceptron
• If No-Mistake
– Do nothing
• If Mistake
– Update
53
SGD on weightened data
54
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
55
Span-based Update Rules
• The weight vector is a linear combination of
examples
Weight of
Learning rate
• Two
feature
f
Learning rate
rate schedules
Target label
(manyEither
many-1others
or 1 ):
– Perceptron algorithm, Conservative
– Passive - Aggressive
56
Feature-value of
input instance
Sentiment Classification
• Who needs this Simpsons book?
You DOOOOOOOO
This is one of the most extraordinary
volumes I've ever encountered … .
Exhaustive, informative, and ridiculously
entertaining, it is the best accompaniment
to the best television show … .
… Very highly recommended!
57
Pang, Lee, Vaithyanathan, EMNLP 2002
Sentiment Classification
• Many positive reviews with the word best
Wbest
• Later negative review
– “boring book – best if you want to sleep in seconds”
58
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring
• The model know’s more about best than boring
• Better to reduce words in different rate
Wboring Wbest
• Big datasets, large
number of features
• Many features are
only weakly
correlated with
target label
• Linear classifiers:
features are
associated with
word-counts
• Heavy-tailed feature
distribution
59
Counts
Natural Language Processing
Feature Rank
Natural Language Processing
New Prediction Models
• Gaussian distributions over weight vectors
• The covariance is either full or diagonal
• In NLP we have many features and use a
diagonal covariance
61
Classification
• Given a new example
• Stochastic:
– Draw a weight vector
– Make a prediction
• Collective:
– Average weight vector
– Average margin
– Average prediction
62
The Margin is Random Variable
• The signed margin
is random 1-d Gaussian
• Thus:
63
Linear Model 
Distribution over Linear Models
Mean weight-vector
Example
64
Weight Vector (Version) Space
The algorithm forces
that most of the values
of
would reside in
this region
65
Passive Step
Nothing to do, most of
the weight vectors
already classifies the
example correctly
66
Aggressive Step
The mean is
moved beyond
the mistake-line
(Large Margin)
The algorithm
covariance is
projects
shrunk inthe
thecurrent
Gaussian
direction ofdistribution
the
on
input
theexample
half-space
67
Projection Update
• Vectors (aka PA):
Confidence
Parameter
• Distributions (New Update) :
68
Divergence
• Sum of two divergences of parameters :
Matrix
Itakura-Saito
Divergence
Mahanabolis
Distance
• Convex in both arguments simultaneously
69
Constraint
• Probabilistic Constraint :
• Equivalent Margin Constraint :
• Convex in
• Solutions:
, concave in
– Linear approximation
– Change variables to
get a convex formulation
– Relax (AROW)
70
Dredze, Crammer, Pereira. ICML 2008
Crammer, Dredze, Pereira. NIPS 2008
Crammer, Dredze, Kulesza. NIPS 2009
Crammer, Dredze, Pereira. NIPS 2008
Convexity
• Change variables
• Equivalent convex formulation
71
Crammer, Dredze, Kulesza. NIPS 2009
AROW
• PA:
• CW :
• Similar update form as CW
72
The Update
• Optimization update can be solved analytically
• Coefficients depend on specific algorithm
73
Definitions
74
Updates
CW (Linearization)
75
CW
AROW
(Change Variables)
Per-feature Learning Rate
Per-feature Learning rate
Reducing the Learning rate
and eigenvalues of
covariance matrix
76
Diagonal Matrix
• Given a matrix
we define
be only the diagonal part of the matrix,
• Make matrix diagonal
• Make inverse diagonal
77
to
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
78
(Back to)
Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:
–
–
–
–
Initialize
Iterate, for
Pick a random index
Compute
– Set
79
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
Adaptive Stochastic Gradient Descent
• Consider the batch problem
• Simple algorithm:
–
–
–
–
Initialize
Iterate, for
Pick a random index
Compute
– Set
80
– Set
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
Adaptive Stochastic Gradient Descent
• Very general! Can be used to solve with various
regularizations
• The matrix A can be either full or diagonal
• Comes with convergence and regret bounds
• Similar performance to AROW
Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010
Adaptive Stochastic Gradient Descent
SGD
AdaGrad
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
86
Kernels
87
Proof
• Show that we can write
• Induction
88
Proof (cntd)
• By update rule :
• Thus
89
Proof (cntd)
• By update rule :
90
Proof (cntd)
• Thus
91
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
92
Statistical Interpretation
• Margin Constraint :
• Distribution over weight-vectors :
• Assume input is corrupted with Gaussian noise
94
Statistical Interpretation
Mean weight-vector
Bad realization
Example
Input Instance
Linear Separator
Good realization
Version Space
95
Input Space
Orabona and Crammer, NIPS 2010
Mistake Bound
• For any reference weight vector
, the
number of mistakes made by AROW is upper
bounded by
where
–
–
–
96
set of example indices with a mistake
set of example indices with an update
but not a mistake
Comment I
• Separable case and no updates:
where
97
Comment II
• For large
the bound becomes:
• When no updates are performed: Perceptron
98
Orabona and Crammer, NIPS 2010
Bound for Diagonal Algorithm
• No. of mistakes is bounded by
• Is low when either
a feature is rare or non-informative
• Exactly as in NLP …
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
100
Synthetic Data
• 20 features
• 2 informative (rotated
skewed Gaussian)
• 18 noisy
• Using a single feature is
as good as a random
prediction
101
Synthetic Data (cntd.)
102
Distribution after 50 examples (x1)
Synthetic Data (no noise)
Perceptron
PA
SOP
CW-full
CW-diag
103
Synthetic Data (10% noise)
104
Outline
• Background:
–
–
–
–
Online learning + notation
Perceptron
Stochastic-gradient descent
Passive-aggressive
• Second-Order Algorithms
– Second order Perceptron
– Confidence-Weighted and AROW
– AdaGrad
• Properties
– Kernels
– Analysis
• Empirical Evaluation
– Synthetic
– Real Data
105
Data
• Sentiment
– Sentiment reviews from 6 Amazon domains (Blitzer et al)
– Classify a product review as either positive or negative
• Reuters, pairs of labels
– Three divisions:
• Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail
Distribution: Specialist Stores vs. Mixed Retail.
– Bag of words representation with binary features.
• 20 News Groups, pairs of labels
– Three divisions:
• comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances,
sci.electronics vs. sci.med.instances, and talk.politics.guns vs.
talk.politics.mideast.instances.
– Bag of words representation with binary features.
106
Experimental Design
• Online to batch :
– Multiple passes over the training data
– Evaluate on a different test set after each pass
– Compute error/accuracy
•
•
•
•
107
Set parameter using held-out data
10 Fold Cross-Validation
~2000 instances per problem
Balanced class-labels
Results vs Online- Sentiment
108
•
•
StdDev and Variance – always better than baseline
Variance – 5/6 significantly better
Results vs Online – 20NG + Reuters
109
•
•
StdDev and Variance – always better than baseline
Variance – 4/6 significantly better
Results vs Batch - Sentiment
110
• always better than batch methods
• 3/6 significantly better
Results vs Batch - 20NG + Reuters
111
• 5/6 better than batch methods
• 3/5 significantly better, 1/1 significantly worse
112
113
114
Results - Sentiment
O PA
O CW
O PA
O CW
Accuracy
O PA
O CW
O PA
O CW
O PA
O CW
O PA
O CW
Passes of Training Data
115
•
•
CW is better (5/6 cases), statistically significant (4/6)
CW benefit less from many passes
Results – Reuters + 20NG
Accuracy
O PA
O CW
O PA
O CW
O PA
O CW
O PA
O CW
O PA
O CW
Passes of Training Data
116
O PA
O CW
•
•
CW is better (5/6 cases), statistically significant (4/6)
CW benefit less from many passes
Error Reduction by Multiple Passes
117
•
•
PA benefits more from multiple passes (8/12)
Amount of benefit is data dependent
T. Jaakkola and M. Jordan. 1997
Bayesian Logistic Regression
BLR
• Covariance
CW/AROW
• Covariance
• Mean
• Mean
Based on the Variational
Approximation
118
Conceptually decoupled
update
Function of the
margin/hinge-loss
Algorithms Summary
1st
119
Order
2nd
Order
Perceptron
SOP
PA
CW+AROW
SGD
AdaGrad
Logisitic
Regression
LR
• Different motivation,
similar algorithms
• All algorithms can be
kernelized
• Work well for data NOT
isotropic / symmetric
• State-of-the-art results
in various domains
• Accompanied with
theory