Stochastic Gradient Descent Training for L1

Download Report

Transcript Stochastic Gradient Descent Training for L1

Stochastic Gradient Descent Training for
L1-regularizaed Log-linear Models with
Cumulative Penalty
Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou
University of Manchester
1
Log-linear models in NLP
• Maximum entropy models
– Text classification (Nigam et al., 1999)
– History-based approaches (Ratnaparkhi, 1998)
• Conditional random fields
– Part-of-speech tagging (Lafferty et al., 2001),
chunking (Sha and Pereira, 2003), etc.
• Structured prediction
– Parsing (Clark and Curan, 2004), Semantic Role
Labeling (Toutanova et al, 2005), etc.
2
Log-linear models
• Log-linear (a.k.a. maximum entropy) model
Weight
Feature function
1


py | x; w  
exp  wi f i x, y 
Z x
 i



Partition function: Z x    exp  wi f i x, y 
y
 i

• Training
– Maximize the conditional likelihood of the training data
N


Lw   log p y  j  | x j  ; w  Rw 
j 1
3
Regularization
• To avoid overfitting to the training data
– Penalize the weights of the features
• L1 regularization
N


Lw   log p y  j  | x j  ; w  C  wi
j 1
i
– Most of the weights become zero
– Produces sparse (compact) models
– Saves memory and storage
4
Training log-linear models
• Numerical optimization methods
–
–
–
–
Gradient descent (steepest descent or hill-climbing)
Quasi-Newton methods (e.g. BFGS, OWL-QN)
Stochastic Gradient Descent (SGD)
etc.
• Training can take several hours (or even days),
depending on the complexity of the model, the size
of training data, etc.
5
Gradient Descent (Hill Climbing)
objective
w2
w1
6
Stochastic Gradient Descent (SGD)
objective
Compute an approximate
gradient using one
training sample
w2
w1
7
Stochastic Gradient Descent (SGD)
• Weight update procedure
– very simple (similar to the Perceptron algorithm)
wi
k 1
 
C

 j
 j
 wi  
 log p y | x ; w  wi 
wi 
N

k


Not differentiable
 : learning rate
8
Using subgradients
• Weight update procedure
wi
k 1
 
C

 j
 j
 wi  
 log p y | x ; w  wi 
wi 
N

k


 1 if wi  0


wi   0 if wi  0
wi
  1 if w  0
i

9
Using subgradients
wi
k 1
 
C

 j
 j
 wi  
 log p y | x ; w  wi 
wi 
N

k


• Problems
– L1 penalty needs to be applied to all features
(including the ones that are not used in the
current sample).
– Few weights become zero as a result of training.
10
Clipping-at-zero approach
w
• Carpenter (2008)
• Special case of the FOLOS algorithm (Duchi and
Singer, 2008) and the truncated gradient method
(Langford et al., 2009)
• Enables lazy update
11
Clipping-at-zero approach
wi
k 1

 wi  
wi
k
wi
k
1
2
if wi
 wi  
k
k
wi
1
2
k 1
wi


log p y  j  | x  j  ; w
wi


 0 then
1
C

k
 max 0, wi 2   
N

else if wi
k 1

C


 j
 j
log
p
y
|
x
;
w

w

i 
N


k
1
2
 0 then
1
C

k
2
 min 0, wi   
N

12
• Text chunking
Number of non-zero features
Quasi-Newton
18,109
SGD (Naive)
455,651
SGD (Clipping-at-zero)
87,792
• Named entity recognition
Number of non-zero features
Quasi-Newton
30,710
SGD (Naive)
1,032,962
SGD (Clipping-at-zero)
279,886
• Part-of-speech tagging
Number of non-zero features
Quasi-Newton
SGD (Naive)
SGD (Clipping-at-zero)
50,870
2,142,130
323,199
13
Why it does not produce sparse models
• In SGD, weights are not updated smoothly
Fails to become
zero!
L1 penalty is wasted away
14
Cumulative L1 penalty
• The absolute value of the total L1 penalty which
should have been applied to each weight
C k
uk  t
N t 1
• The total L1 penalty which has actually been applied
to each weight
1
t 1
t 

qk    wi  wi 2 

t 1 
k
15
Applying L1 with cumulative penalty
• Penalize each weight according to the difference
k 1
q
u
between k and i
wi
k
1
2
if wi

 wi  
log p y  j  | x  j  ; w
wi
k
wi
1
2
k 1
k 1

 0 t hen




1
k
k 1

 max 0, wi 2  u k  qi 


else if wi
wi

k
k
1
2
 0 t hen
1
k
k 1

 min 0, wi 2  u k  qi 


Implementation
10 lines of code!
17
Experiments
• Model: Conditional Random Fields (CRFs)
• Baseline: OWL-QN (Andrew and Gao, 2007)
• Tasks
– Text chunking (shallow parsing)
• CoNLL 2000 shared task data
• Recognize base syntactic phrases (e.g. NP, VP, PP)
– Named entity recognition
• NLPBA 2004 shared task data
• Recognize names of genes, proteins, etc.
– Part-of-speech (POS) tagging
• WSJ corpus (sections 0-18 for training)
18
CoNLL 2000 chunking task: objective
19
CoNLL 2000 chunking: non-zero features
20
CoNLL 2000 chunking
• Performance of the produced model
Passes
OWL-QN
Obj.
# Features
Time (sec)
F-score
160
-1.583
18,109
598
93.62
SGD (Naive)
30
-1.671
455,651
1,117
93.64
SGD (Clipping + Lazy Update)
30
-1.671
87,792
144
93.65
SGD (Cumulative)
30
-1.653
28,189
149
93.68
SGD (Cumulative + ED)
30
-1.622
23,584
148
93.66
• Training is 4 times faster than OWL-QN
• The model is 4 times smaller than the clipping-at-zero approach
• The objective is also slightly better
21
NLPBA 2004 named entity recognition
Passes
OWL-QN
Obj.
# Features
Time (sec)
F-score
160
-2.448
30,710
2,253
71.76
SGD (Naive)
30
-2.537
1,032,962
4,528
71.20
SGD (Clipping + Lazy Update)
30
-2.538
279,886
585
71.20
SGD (Cumulative)
30
-2.479
31,986
631
71.40
SGD (Cumulative + ED)
30
-2.443
25,965
631
71.63
Part-of-speech tagging on WSJ
Passes
OWL-QN
Obj.
# Features
Time (sec)
Accuracy
124
-1.941
50,870
5,623
97.16
SGD (Naive)
30
-2.013
2,142,130
18,471
97.18
SGD (Clipping + Lazy Update)
30
-2.013
323,199
1,680
97.18
SGD (Cumulative)
30
-1.987
62,043
1,777
97.19
SGD (Cumulative + ED)
30
-1.954
51,857
1,774
97.17
22
Discussions
• Convergence
– Demonstrated empirically
– Penalties applied are not i.i.d.
• Learning rate
– The need for tuning can be annoying
– Rule of thumb:
• Exponential decay (passes = 30, alpha = 0.85)
23
Conclusions
• Stochastic gradient descent training for L1regularized log-linear models
– Force each weight to receive the total L1 penalty
that would have been applied if the true
(noiseless) gradient were available
• 3 to 4 times faster than OWL-QN
• Extremely easy to implement
24