MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 Roadmap Maxent: Training Smoothing Case study: POS Tagging (redux) Beam search.
Download ReportTranscript MaxEnt: Training, Smoothing, Tagging Advanced Statistical Methods in NLP Ling572 February 7, 2012 Roadmap Maxent: Training Smoothing Case study: POS Tagging (redux) Beam search.
MaxEnt: Training,
Smoothing, Tagging
Advanced Statistical Methods in NLP
Ling572
February 7, 2012
1
Roadmap
Maxent:
Training
Smoothing
Case study:
POS Tagging (redux)
Beam search
2
Training
3
Training
Learn λs from training data
4
Training
Learn λs from training data
Challenge: Usually can’t solve analytically
Employ numerical methods
5
Training
Learn λs from training data
Challenge: Usually can’t solve analytically
Employ numerical methods
Main different techniques:
Generalized Iterative Scaling (GIS, Darroch &Ratcliffe, ‘72)
Improved Iterative Scaling (IIS, Della Pietra et al, ‘95)
L-BFGS,…..
6
Generalized Iterative Scaling
GIS Setup:
GIS required constraint:
k
"(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
7
Generalized Iterative Scaling
GIS Setup:
GIS required constraint:
k
"(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
If not, then set
:
8
Generalized Iterative Scaling
GIS Setup:
GIS required constraint:
k
"(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
If not, then set
C = max
( xi ,yi )ÎS
k
å f (x , y )
j
i
i
j=1
9
Generalized Iterative Scaling
GIS Setup:
GIS required constraint:
k
"(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
If not, then set
C = max
( xi ,yi )ÎS
k
å f (x , y )
j
i
i
j=1
and add a correction feature function fk+1:
k
"(x, y) Î (X,Y ) fk+1 (x, y) = C - å f j (x, y)
j=1
10
Generalized Iterative Scaling
GIS Setup:
GIS required constraint:
k
"(x, y) Î (X,Y )å f (x, y) =, where
C is a constant
C
j
j=1
If not, then set
C = max
( xi ,yi )ÎS
k
å f (x , y )
j
i
i
j=1
and add a correction feature function fk+1:
k
"(x, y) Î (X,Y ) fk+1 (x, y) = C - å f j (x, y)
j=1
GIS also requires at least one active feature for any event
Default feature functions solve this problem
11
GIS Iteration
Compute the empirical expectation
12
GIS Iteration
Compute the empirical expectation
Initialization:λj(0) ; set to 0 or some value
13
GIS Iteration
Compute the empirical expectation
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence for each j:
14
GIS Iteration
Compute the empirical expectation
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence for each j:
Compute p(y|x) under the current model
15
GIS Iteration
Compute the empirical expectation
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence for each j:
Compute p(y|x) under the current model
Compute model expectation under current model
16
GIS Iteration
Compute the empirical expectation
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence for each j:
Compute p(y|x) under the current model
Compute model expectation under current model
Update model parameters by weighted ratio of empirical
and model expectations
17
GIS Iteration
Compute
1 N
d j = E p ( f j ) = å f j (xi , yi )
N i=1
18
GIS Iteration
1 N
Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
Initialization:λj(0) ; set to 0 or some value
19
GIS Iteration
1 N
Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence:
Compute
20
GIS Iteration
1 N
Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence:
Compute
p(n)(y|x)=
e
å j l j f j ( x,y)
Z
21
GIS Iteration
1 N
Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence:
Compute
Compute
p(n)(y|x)=
e
å j l j f j ( x,y)
Z
1 N
E p( n ) ( f j ) = å å p(n) (y | xi ) f j (xi , y)
N i=1 yÎY
22
GIS Iteration
N
1
Compute d j = E p ( f j ) = å f j (xi , yi )
N i=1
Initialization:λj(0) ; set to 0 or some value
Iterate until convergence:
Compute
p(n)(y|x)=
Z
1 N
E p( n ) ( f j ) = å å p(n) (y | xi ) f j (xi , y)
N i=1 yÎY
Compute
Update
e
å j l j f j ( x,y)
l
(n+1)
j
=l
(n)
j
dj
1
+ (log
)
C
E p( n ) ( f j )
23
Convergence
Methods have convergence guarantees
24
Convergence
Methods have convergence guarantees
However, full convergence may take very long time
25
Convergence
Methods have convergence guarantees
However, full convergence may take very long time
Frequently use threshold
L( p) =
å
p(x, y)log p(y | x)
( x,y)ÎS
L( p(n) ) =
å
p(x, y)log p(n) (y | x)
( x,y)ÎS
26
Convergence
Methods have convergence guarantees
However, full convergence may take very long time
Frequently use threshold
L( p) =
å
p(x, y)log p(y | x)
( x,y)ÎS
L( p(n) ) =
å
p(x, y)log p(n) (y | x)
( x,y)ÎS
L( p(n+1) ) - L( p(n) ) < threshold
L( p(n+1) ) - L( p(n) )
< threshold
(n)
L( p )
27
Calculating LL(p)
LL = 0
For each sample x in the training data
Let y be the true label of x
prob = p(y|x)
LL += 1/N * prob
28
Running Time
For each iteration the running time is:
29
Running Time
For each iteration the running time is O(NPA), where:
N: number of training instances
P: number of classes
A: Average number of active features for instance (x,y)
30
L-BFGS
Limited-memory version of
Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
31
L-BFGS
Limited-memory version of
Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
Quasi-Newton method for unconstrained optimization
32
L-BFGS
Limited-memory version of
Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
Quasi-Newton method for unconstrained optimization
Good for optimization problems with many variables
33
L-BFGS
Limited-memory version of
Broyden–Fletcher–Goldfarb–Shanno (BFGS) method
Quasi-Newton method for unconstrained optimization
Good for optimization problems with many variables
“Algorithm of choice” for MaxEnt and related models
34
L-BFGS
References:
Nocedal, J. (1980). "Updating Quasi-Newton Matrices with
Limited Storage". Mathematics of Computation 35: 773–782
Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for
Large Scale Optimization". Mathematical Programming B 45 (3):
503–528
35
L-BFGS
References:
Nocedal, J. (1980). "Updating Quasi-Newton Matrices with
Limited Storage". Mathematics of Computation 35: 773–782
Liu, D. C.; Nocedal, J. (1989)"On the Limited Memory Method for
Large Scale Optimization". Mathematical Programming B 45 (3):
503–528
Implementations:
Java, Matlab, Python via scipy, R, etc
See Wikipedia page
36
Smoothing
Based on Klein & Manning, 2003; F. Xia
37
Smoothing
Problems of scale:
38
Smoothing
Problems of scale:
Large numbers of features
Some NLP problems in MaxEnt 1M features
Storage can be a problem
39
Smoothing
Problems of scale:
Large numbers of features
Some NLP problems in MaxEnt 1M features
Storage can be a problem
Sparseness problems
Ease of overfitting
40
Smoothing
Problems of scale:
Large numbers of features
Some NLP problems in MaxEnt 1M features
Storage can be a problem
Sparseness problems
Ease of overfitting
Optimization problems
Features can be near infinite, take long time to converge
41
Smoothing
Consider the coin flipping problem
Three empirical distributions
Models
From K&M ‘03
42
Need for Smoothing
Two problems
From K&M ‘03
43
Need for Smoothing
Two problems
Optimization:
Optimal value of λ?
∞
Slow to optimize
From K&M ‘03
44
Need for Smoothing
Two problems
Optimization:
Optimal value of λ?
∞
Slow to optimize
No smoothing
Learned distribution just
as spiky (K&M’03)
From K&M ‘03
45
Possible Solutions
46
Possible Solutions
Early stopping
Feature selection
Regularization
47
Early Stopping
Prior use of early stopping
48
Early Stopping
Prior use of early stopping
Decision tree heuristics
49
Early Stopping
Prior use of early stopping
Decision tree heuristics
Similarly here
Stop training after a few iterations
λwill have increased
Guarantees bounded, finite training time
50
Feature Selection
Approaches:
51
Feature Selection
Approaches:
Heuristic: Drop features based on fixed thresholds
i.e. number of occurrences
52
Feature Selection
Approaches:
Heuristic: Drop features based on fixed thresholds
i.e. number of occurrences
Wrapper methods:
Add feature selection to training loop
53
Feature Selection
Approaches:
Heuristic: Drop features based on fixed thresholds
i.e. number of occurrences
Wrapper methods:
Add feature selection to training loop
Heuristic approaches:
Simple, reduce features, but could harm performance
54
Regularization
In statistics and machine learning, regularization is any
method of preventing overfitting of data by a model.
55
From K&M ’03, F. Xia
Regularization
In statistics and machine learning, regularization is any
method of preventing overfitting of data by a model.
Typical examples of regularization in statistical machine
learning include ridge regression, lasso, and L2-normin
support vector machines.
56
From K&M ’03, F. Xia
Regularization
In statistics and machine learning, regularization is any
method of preventing overfitting of data by a model.
Typical examples of regularization in statistical machine
learning include ridge regression, lasso, and L2-normin
support vector machines.
In this case, we change the objective function:
log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)
57
From K&M ’03, F. Xia
Prior
Possible prior distributions: uniform, exponential
58
Prior
Possible prior distributions: uniform, exponential
Gaussian prior:
1
(li - mi )2
P(li ) =
exp()
2
2s
s i 2p
59
Prior
Possible prior distributions: uniform, exponential
Gaussian prior:
1
(li - mi )2
P(li ) =
exp()
2
2s
s i 2p
log P(Y,λ|X) = log P(λ)+log P(Y|X,λ)
= å log P(li ) + log P(Y | X, l )
k
i=1
(li - m )2
= -k log 2ps - å
+ log P(Y | X, l )
2
i=1
2s
k
60
Maximize P(Y|X,λ)
Ep ( f j ) = Ep ( f j )
Maximize P(Y, λ|X)
lj - m
E p ( f j ) = E p ( f j )- 2
s
In practice, μ=0; 2σ2=1
61
L1 and L2 Regularization
l
L1 = å log P(yi , l | xi ) i
s
2
l
L2 = å log P(yi , l | xi ) i
s
62
Smoothing: POS Example
63
Advantages of Smoothing
Smooths distributions
64
Advantages of Smoothing
Smooths distributions
Moves weight onto more informative features
65
Advantages of Smoothing
Smooths distributions
Moves weight onto more informative features
Enables effective use of larger numbers of features
66
Advantages of Smoothing
Smooths distributions
Moves weight onto more informative features
Enables effective use of larger numbers of features
Can speed up convergence
67
Summary: Training
Many training methods:
Generalized Iterative Scaling (GIS)
Smoothing:
Early stopping, feature selection, regularization
Regularization:
Change objective function – add prior
Common prior: Gaussian prior
Maximizing posterior not equivalent to max ent
68
MaxEnt POS Tagging
69
Notation
(Ratnaparkhi, 1996)
h: history x
Word and tag history
t: tag y
70
POS Tagging Model
P(t1,…,tn|w1,…,wn)
n
= Õ P(ti | w , t )
n
1
i-1
1
i=1
n
» Õ P(ti | hi )
i=1
p(t, h)
p(t | h) =
å p(t¢, h)
t¢ÎT
where hi={wi,wi-1,wi-2,wi+1,wi+2,ti-1,ti-2}
71
MaxEnt Feature Set
72
Example
Feature for ‘about’
Exclude features seen < 10 times
73
Training
GIS
Training time: O(NTA)
N: training set size
T: number of tags
A: average number of features active for event (h,t)
24 hours on a ‘96 machine
74
Finding Features
In training, where do features come from?
Where do features come from in testing?
w-1
w0
w-1w0
w+1
t-1
y
x1(Time <s>
)
Time
<s>Time
flies
BOS
N
x2
(flies)
flies
Time flies
like
N
N
like
flies like
an
N
V
Time
x3 (like) flies
75
Finding Features
In training, where do features come from?
Where do features come from in testing?
tag features come from classification of prior word
w-1
w0
w-1w0
w+1
t-1
y
x1(Time <s>
)
Time
<s>Time
flies
BOS
N
x2
(flies)
flies
Time flies
like
N
N
like
flies like
an
N
V
Time
x3 (like) flies
76
Decoding
Goal: Identify highest probability tag sequence
77
Decoding
Goal: Identify highest probability tag sequence
Issues:
Features include tags from previous words
Not immediately available
78
Decoding
Goal: Identify highest probability tag sequence
Issues:
Features include tags from previous words
Not immediately available
Uses tag history
Just knowing highest probability preceding tag insufficient
79
Beam Search
Intuition:
Breadth-first search explores all paths
Lots of paths are (pretty obviously) bad
Why explore bad paths?
Restrict to (apparently best) paths
Approach:
Perform breadth-first search, but
80
Beam Search
Intuition:
Breadth-first search explores all paths
Lots of paths are (pretty obviously) bad
Why explore bad paths?
Restrict to (apparently best) paths
Approach:
Perform breadth-first search, but
Retain only k ‘best’ paths thus far
k: beam width
81
Beam Search, k=3
<s>
time
flies
like
an
arrow
82
Beam Search, k=3
<s>
time
flies
like
an
arrow
83
Beam Search, k=3
<s>
time
flies
like
an
arrow
84
Beam Search, k=3
<s>
time
flies
like
an
arrow
85
Beam Search, k=3
<s>
time
flies
like
an
arrow
56
86
Beam Search
W={w1,w2,…,wn}: test sentence
87
Beam Search
W={w1,w2,…,wn}: test sentence
sij: jth highest prob. sequence up to & inc. word wi
88
Beam Search
W={w1,w2,…,wn}: test sentence
sij: jth highest prob. sequence up to & inc. word wi
Generate tags for w1, keep top k, set s1j accordingly
89
Beam Search
W={w1,w2,…,wn}: test sentence
sij: jth highest prob. sequence up to & inc. word wi
Generate tags for w1, keep top k, set s1j accordingly
for i=2 to n:
90
Beam Search
W={w1,w2,…,wn}: test sentence
sij: jth highest prob. sequence up to & inc. word wi
Generate tags for w1, keep top k, set s1j accordingly
for i=2 to n:
Extension: add tags for wi to each s(i-1)j
91
Beam Search
W={w1,w2,…,wn}: test sentence
sij: jth highest prob. sequence up to & inc. word wi
Generate tags for w1, keep top k, set s1j accordingly
for i=2 to n:
Extension: add tags for wi to each s(i-1)j
Beam selection:
Sort sequences by probability
Keep only top k sequences
92
Beam Search
W={w1,w2,…,wn}: test sentence
sij: jth highest prob. sequence up to & inc. word wi
Generate tags for w1, keep topN, set s1j accordingly
for i=2 to n:
For each s(i-1)j
for wi form vector, keep topN tags for wi
Beam selection:
Sort sequences by probability
Keep only top sequences, using pruning on next slide
Return highest probability sequence sn1
93
Beam Search
Pruning and storage:
W = beam width
For each node, store:
Tag for wi
Probability of sequence so far, probi,j=
t
Õ p(t
j
| hj )
j=1
For each candidate j, si,j
Keep the node if probi,j in topK, and
probi,j is sufficiently high
e.g. lg(probi,j)+W>=lg(max_prob)
94
Decoding
Tag dictionary:
known word: returns tags seen with word in training
unknown word: returns all tags
Beam width = 5
Running time: O(NTAB)
N,T,A as before
B: beam width
95
POS Tagging
Overall accuracy: 96.3+%
Unseen word accuracy: 86.2%
Comparable to HMM tagging accuracy or TBL
Provides
Probabilistic framework
Better able to model different info sources
Topline accuracy 96-97%
Consistency issues
96
Beam Search
Beam search decoding:
Variant of breadth first search
At each layer, keep only top k sequences
Advantages:
97
Beam Search
Beam search decoding:
Variant of breadth first search
At each layer, keep only top k sequences
Advantages:
Efficient in practice: beam 3-5 near optimal
Empirically, beam 5-10% of search space; prunes 90-95%
98
Beam Search
Beam search decoding:
Variant of breadth first search
At each layer, keep only top k sequences
Advantages:
Efficient in practice: beam 3-5 near optimal
Empirically, beam 5-10% of search space; prunes 90-95%
Simple to implement
Just extensions + sorting, no dynamic programming
99
Beam Search
Beam search decoding:
Variant of breadth first search
At each layer, keep only top k sequences
Advantages:
Efficient in practice: beam 3-5 near optimal
Empirically, beam 5-10% of search space; prunes 90-95%
Simple to implement
Just extensions + sorting, no dynamic programming
Running time:
100
Beam Search
Beam search decoding:
Variant of breadth first search
At each layer, keep only top sequences
Advantages:
Efficient in practice: beam 3-5 near optimal
Empirically, beam 5-10% of search space; prunes 90-95%
Simple to implement
Just extensions + sorting, no dynamic programming
Disadvantage: Not guaranteed optimal (or complete)
101
MaxEnt POS Tagging
Part of speech tagging by classification:
Feature design
word and tag context features
orthographic features for rare words
102
MaxEnt POS Tagging
Part of speech tagging by classification:
Feature design
word and tag context features
orthographic features for rare words
Sequence classification problems:
Tag features depend on prior classification
103
MaxEnt POS Tagging
Part of speech tagging by classification:
Feature design
word and tag context features
orthographic features for rare words
Sequence classification problems:
Tag features depend on prior classification
Beam search decoding
Efficient, but inexact
Near optimal in practice
104