A Maximum Entropy Approach to Natural Language Processing
Download
Report
Transcript A Maximum Entropy Approach to Natural Language Processing
Maximum Entropy (ME)
Maximum Entropy Markov Model
(MEMM)
Conditional Random Field (CRF)
Boltzmann-Gibbs Distribution
Given:
States s1, s2, …, sn
Density p(s) = ps
Maximum entropy principle:
Without any information, one chooses the
density ps to maximize the entropy
ps log ps
s
subject to the constraints
ps f i (s) Di , i
s
Boltzmann-Gibbs (Cnt’d)
Consider the Lagrangian
L ps log ps i ( ps f i (s) Di ) ( ps 1)
i
s
s
Take partial derivatives of L with respect to ps
and set them to zero, we obtain BoltzmannGibbs density functions
exp i f i ( s )
i
ps
Z
where Z is the normalizing factor
Exercise
From the Lagrangian
L ps log ps i ( ps f i (s) Di ) ( ps 1)
i
s
derive
exp i f i ( s )
i
ps
Z
s
Boltzmann-Gibbs (Cnt’d)
Classification Rule
Use of Boltzmann-Gibbs as prior
distribution
Compute the posterior for given
observed data and features fi
Use the optimal posterior to classify
Boltzmann-Gibbs (Cnt’d)
Maximum Entropy (ME)
The posterior is the state probability density
p(s | X), where X = (x1, x2, …, xn)
Maximum entropy Markov model (MEMM)
The posterior consists of transition probability
densities p(s | s´, X)
Boltzmann-Gibbs (Cnt’d)
Conditional random field (CRF)
The posterior consists of both transition
probability densities p(s | s´, X) and
state probability densities
p(s | X)
References
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
Classification, 2nd Ed., Wiley Interscience, 2001.
T. Hastie, R. Tibshirani, and J. Friedman, The
Elements of Statistical Learning, Springer-Verlag,
2001.
P. Baldi and S. Brunak, Bioinformatics: The
Machine Learning Approach, The MIT Press,
2001.
Maximum Entropy Approach
An Example
Five possible French translations of the English
word in:
Certain constraints obeyed:
Dans, en, à, au cours de, pendant
When April follows in, the proper translation is en
How do we make the proper translation of a
French word y under an English context x?
Formalism
Probability assignment p(y|x):
y: French word, x: English context
Indicator function of a context feature f
1 if y en and April followsin
f ( x, y)
0 otherwise.
Expected Values of f
The expected value of f with respect to
p( x, y)
the empirical distribution ~
~
p( f ) ~
p ( x, y) f ( x, y)
x, y
The expected value of f with respect to
the conditional probability p(y|x)
p( f ) ~
p ( x) p( y | x) f ( x, y)
x, y
Constraint Equation
Set equal the two expected values:
~
p ( f ) p( f )
or equivalently,
~
~
p
(
x
,
y
)
f
(
x
,
y
)
p ( x) p( y | x) f ( x, y)
x, y
x, y
Maximum Entropy Principle
Given n feature functions fi, we want
p(y|x) to maximize the entropy measure
H ( p) ~
p( x) p( y | x) log p( y | x)
x, y
where p is chosen from
C { p | p( f i) ~
p( f i) i 1, 2, ...,n}
Constrained Optimization
Problem
The Lagrangian
( p, ) H ( p) i ( p( f i) ~
p ( f i) )
i
Solutions
1
p ( y | x)
exp i fi ( x, y)
i
Z ( x)
Z ( x) exp i f i ( x, y )
y
i
Iterative Solution
Compute the expectation of fi under the current
estimate of probability function
p ( n) ( f i ) ~
p( x) pi( n) ( y | x) f i ( x, y)
x
Update Lagrange multipliers
exp((i n 1)
y
- (i n) )
~
p ( fi )
( n)
p ( fi )
Update probability functions
pi( n1) ( y |
1
( n1)
x)
exp
f
(
x
,
y)
i
i
( n1)
Z ( x)
i
Feature Selection
Motivation:
For a large collection of candidate features,
we want to select a small subset
Incremental growth
Incremental Learning
Adding feature fˆ
to S to obtain S fˆ
~
ˆ
C
(
S
f
)
{
p
:
p
(
f
)
p ( f ) i 1, 2, ..., n}
Consider
The optimal model: PS fˆ aug max H ( p )
pC ( S fˆ )
Gain: L( S , fˆ ) L( PS fˆ ) L( PS ) ,
where L is the log-likelihood of training data
Algorithm: Feature Selection
1. Start with S as an empty set; PS is uniform
2. For each feature f, compute PS f
and L( S , f )
3. Check the termination condition (specified by the user)
4. Select fˆ aug max L ( S , f )
f
ˆ
f
5. Add
to S
6. Update PS
7. Go to step 2
Approximation
Computation of maximum entropy model
is costly for each candidate f
Simplification assumption:
The multipliers λ associated with S do not
change when f is added to S
Approximation (cnt’d)
The approximate solution for S f then has
the form
PS , f
1
PS ( y | x)ef ( x , y )
Z ( x)
Z ( x ) PS ( y | x )e f ( x , y )
y
Approximate Solution
The approximate gain is
GS , f ( ) L ( PS, f ) L ( p S ) ~
p ( x ) log Z ( x ) ~
p( f )
x
The approximate solution is then
~ PS f aug max GS , f ( )
PS f
Conditional Random Field
(CRF)
CRF
The probability of a label sequence y given observation
sequence x is the normalized product of potential functions,
each of the form
exp j t j ( yi 1 , yi , x, i ) k sk ( yi , x, i ) ,
k
j
where yi-1 and yi are labels at position i-1 and i
t j ( y i 1 , y i , x , i )
s k ( yi , x, i )
is a transition feature function, and
is a state function
Feature Functions
Example:
A feature given by
1 if the observatio n sequence at position i is the word "September"
b ( x, i )
0 otherwise.
Transition function:
1 if yi 1 IN and yi NNP
t j ( yi 1 , yi , x, i )
0 otherwise.
Difference from MEMM
If the state feature is dropped, we obtain
a MEMM model
The drawback of MEMM
The state probabilities are not learnt, but
inferred
Bias can be generated, since the transition
feature is dominating in the training
Difference from HMM
HMM is a generative model
In order to define a joint distribution, this
model must enumerate all possible
observation sequences and their
corresponding label sequences
This task is intractable, unless
observation elements are represented as
isolated units
CRF Training Methods
CRF training requires intensive efforts in
numerical manipulation
Preconditioned conjugate gradient
Limited-Memory Quasi-Newton
Instead of searching along the gradient, conjugate gradient
searches along a carefully chosen linear combination of the
gradient and the previous search direction
Limited-memory BFGS (L-BFGS) is a second-order method that
estimates the curvature numerically from previous gradients
and updates, avoiding the need for an exact Hessian inverse
computation
Voted perceptron
Voted Perceptron
Like the perceptron algorithm, this algorithm
scans through the training instances, updating
the weight vectorλt when a prediction error is
detected
Instead of taking just the final weight vector, the
voted perceptron algorithms takes the average
of theλt
Voted Perceptron (cnt’d)
Let
F (y , x) f j ( yi 1 , yi , x, i )
i
where fj is either a state function or a transition function.
For each training instance, the method computes a weight update
t 1 t F ( y k , x k ) F ( yˆ k , x k )
in which
yˆ k
is obtained in the Viterbi path
yˆ k aug max t F ( y , x k )
y
References
A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A maximum
entropy approach to natural language processing
A. McCallum and F. Pereira, Maximum entropy Markov
models for information extraction and segmentation
H. M. Wallach, Conditional random fields: an introduction
J. Lafferty, A. McCallum, F. Pereira, Conditional random
fields: probabilistic models for segmentation and labeling
sequence data
F. Sha and F. Pereira, Shallow parsing with conditional
random fields