Stochastic Gradient Descent Training for L1
Download
Report
Transcript Stochastic Gradient Descent Training for L1
Stochastic Gradient Descent Training for
L1-regularizaed Log-linear Models with
Cumulative Penalty
Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou
University of Manchester
1
Log-linear models in NLP
• Maximum entropy models
– Text classification (Nigam et al., 1999)
– History-based approaches (Ratnaparkhi, 1998)
• Conditional random fields
– Part-of-speech tagging (Lafferty et al., 2001),
chunking (Sha and Pereira, 2003), etc.
• Structured prediction
– Parsing (Clark and Curan, 2004), Semantic Role
Labeling (Toutanova et al, 2005), etc.
2
Log-linear models
• Log-linear (a.k.a. maximum entropy) model
Weight
Feature function
1
py | x; w
exp wi f i x, y
Z x
i
Partition function: Z x exp wi f i x, y
y
i
• Training
– Maximize the conditional likelihood of the training data
N
Lw log p y j | x j ; w Rw
j 1
3
Regularization
• To avoid overfitting to the training data
– Penalize the weights of the features
• L1 regularization
N
Lw log p y j | x j ; w C wi
j 1
i
– Most of the weights become zero
– Produces sparse (compact) models
– Saves memory and storage
4
Training log-linear models
• Numerical optimization methods
–
–
–
–
Gradient descent (steepest descent or hill-climbing)
Quasi-Newton methods (e.g. BFGS, OWL-QN)
Stochastic Gradient Descent (SGD)
etc.
• Training can take several hours (or even days),
depending on the complexity of the model, the size
of training data, etc.
5
Gradient Descent (Hill Climbing)
objective
w2
w1
6
Stochastic Gradient Descent (SGD)
objective
Compute an approximate
gradient using one
training sample
w2
w1
7
Stochastic Gradient Descent (SGD)
• Weight update procedure
– very simple (similar to the Perceptron algorithm)
wi
k 1
C
j
j
wi
log p y | x ; w wi
wi
N
k
Not differentiable
: learning rate
8
Using subgradients
• Weight update procedure
wi
k 1
C
j
j
wi
log p y | x ; w wi
wi
N
k
1 if wi 0
wi 0 if wi 0
wi
1 if w 0
i
9
Using subgradients
wi
k 1
C
j
j
wi
log p y | x ; w wi
wi
N
k
• Problems
– L1 penalty needs to be applied to all features
(including the ones that are not used in the
current sample).
– Few weights become zero as a result of training.
10
Clipping-at-zero approach
w
• Carpenter (2008)
• Special case of the FOLOS algorithm (Duchi and
Singer, 2008) and the truncated gradient method
(Langford et al., 2009)
• Enables lazy update
11
Clipping-at-zero approach
wi
k 1
wi
wi
k
wi
k
1
2
if wi
wi
k
k
wi
1
2
k 1
wi
log p y j | x j ; w
wi
0 then
1
C
k
max 0, wi 2
N
else if wi
k 1
C
j
j
log
p
y
|
x
;
w
w
i
N
k
1
2
0 then
1
C
k
2
min 0, wi
N
12
• Text chunking
Number of non-zero features
Quasi-Newton
18,109
SGD (Naive)
455,651
SGD (Clipping-at-zero)
87,792
• Named entity recognition
Number of non-zero features
Quasi-Newton
30,710
SGD (Naive)
1,032,962
SGD (Clipping-at-zero)
279,886
• Part-of-speech tagging
Number of non-zero features
Quasi-Newton
SGD (Naive)
SGD (Clipping-at-zero)
50,870
2,142,130
323,199
13
Why it does not produce sparse models
• In SGD, weights are not updated smoothly
Fails to become
zero!
L1 penalty is wasted away
14
Cumulative L1 penalty
• The absolute value of the total L1 penalty which
should have been applied to each weight
C k
uk t
N t 1
• The total L1 penalty which has actually been applied
to each weight
1
t 1
t
qk wi wi 2
t 1
k
15
Applying L1 with cumulative penalty
• Penalize each weight according to the difference
k 1
q
u
between k and i
wi
k
1
2
if wi
wi
log p y j | x j ; w
wi
k
wi
1
2
k 1
k 1
0 t hen
1
k
k 1
max 0, wi 2 u k qi
else if wi
wi
k
k
1
2
0 t hen
1
k
k 1
min 0, wi 2 u k qi
Implementation
10 lines of code!
17
Experiments
• Model: Conditional Random Fields (CRFs)
• Baseline: OWL-QN (Andrew and Gao, 2007)
• Tasks
– Text chunking (shallow parsing)
• CoNLL 2000 shared task data
• Recognize base syntactic phrases (e.g. NP, VP, PP)
– Named entity recognition
• NLPBA 2004 shared task data
• Recognize names of genes, proteins, etc.
– Part-of-speech (POS) tagging
• WSJ corpus (sections 0-18 for training)
18
CoNLL 2000 chunking task: objective
19
CoNLL 2000 chunking: non-zero features
20
CoNLL 2000 chunking
• Performance of the produced model
Passes
OWL-QN
Obj.
# Features
Time (sec)
F-score
160
-1.583
18,109
598
93.62
SGD (Naive)
30
-1.671
455,651
1,117
93.64
SGD (Clipping + Lazy Update)
30
-1.671
87,792
144
93.65
SGD (Cumulative)
30
-1.653
28,189
149
93.68
SGD (Cumulative + ED)
30
-1.622
23,584
148
93.66
• Training is 4 times faster than OWL-QN
• The model is 4 times smaller than the clipping-at-zero approach
• The objective is also slightly better
21
NLPBA 2004 named entity recognition
Passes
OWL-QN
Obj.
# Features
Time (sec)
F-score
160
-2.448
30,710
2,253
71.76
SGD (Naive)
30
-2.537
1,032,962
4,528
71.20
SGD (Clipping + Lazy Update)
30
-2.538
279,886
585
71.20
SGD (Cumulative)
30
-2.479
31,986
631
71.40
SGD (Cumulative + ED)
30
-2.443
25,965
631
71.63
Part-of-speech tagging on WSJ
Passes
OWL-QN
Obj.
# Features
Time (sec)
Accuracy
124
-1.941
50,870
5,623
97.16
SGD (Naive)
30
-2.013
2,142,130
18,471
97.18
SGD (Clipping + Lazy Update)
30
-2.013
323,199
1,680
97.18
SGD (Cumulative)
30
-1.987
62,043
1,777
97.19
SGD (Cumulative + ED)
30
-1.954
51,857
1,774
97.17
22
Discussions
• Convergence
– Demonstrated empirically
– Penalties applied are not i.i.d.
• Learning rate
– The need for tuning can be annoying
– Rule of thumb:
• Exponential decay (passes = 30, alpha = 0.85)
23
Conclusions
• Stochastic gradient descent training for L1regularized log-linear models
– Force each weight to receive the total L1 penalty
that would have been applied if the true
(noiseless) gradient were available
• 3 to 4 times faster than OWL-QN
• Extremely easy to implement
24