Transcript slide
Conditional Random Fields:
Probabilistic Models for
Segmenting and Labeling
Sequence Data
John Lafferty
Andrew McCallum
Fernando Pereira
Goal: Sequence segmentation
and labeling
Computational biology
Computational linguistics
Computer science
Overview
HMM
RF
->
->
MEMM ->
MRF ->
Generative:
HMM
Task
s’
s
o
CRF
CRF
Discriminative / Conditional:
MEMM
s’
s
o
Evaluation
Find P(oT|M)
Decoding
=
Prediction
Find sT s.t. P(oT| sT, M)
is maximized
Find sT s.t. P(sT| oT, M)
is maximized
Given o, find M s.t.
P(o | M) is maximized
Given o and s, find M s.t.
P(s | o, M) is maximized
(Need EM because S is unknown)
(Simpler Max likelihood problem)
Learning
Conditional
CRF
Find sT s.t. P(sT|oT, M)
is maximized
Given o and s, find M
s.t. P(s|o,M)
is maximized
CWS Example
P(S|S)
S
S
P(我|S)
我
MEMM
CRF
P(B|S)
P(B|E)
B
P(是|S)
是
E
P(歌|B)
歌
P(手|E)
手
P(B|S,歌)
S
S
B
E
我
是
歌
手
S
S
B
E
我
是
歌
手
Generative Models
HMMs and stochastic grammars
Assign a joint probability to paired observation and
label sequences
Parameters are trained to maximize joint likelihood of
training examples
Generative Models
Need to enumerate all possible
observation sequences
To ensure tractability of inference
problem, must make strong
independence assumptions (i.e.,
conditional independence given labels)
Example: MEMMs
Maximum entropy Markov models
Each source state has an exponential
model that takes the observation
feature as input and outputs a
distribution over possible next states
Weakness: Label bias problem
Label Bias Problem
Per-state normalization of transition scores implies
“conservation Label
of scoreBias
mass”
Example
Bias
towards states with fewer outgoing transitions
D: determiner
N: noun
Fred
wheels
round
State
transition
effectively
V: verb with single outgoing
V
N
R
A: adjective
ignores
R: adverb observation
The
robot
s
D
e
N
wheels
N
are
V
round
A
Obs: “The robot wheels are round.”
But if P(V|N,wheels) > P(N|N,wheels),
then upper path is chosen regardless of obs.
After Wallach ‘02
5
Solving Label Bias (cont’d)
Start with fully-connected model and let
training procedure figure out a good
structure
Precludes use of prior structure knowledge
Overview
Generative models
Conditional models
Label bias problem
Conditional random fields
Experiments
Conditional Random Fields
Undirected graph (random field)
Construct conditional model p(Y|X)
Does not explicitly model marginal p(X)
Assumption: graph is fixed
Solving Label Bias (cont’d)
Start with fully-connected model and let
training procedure figure out a good
structure
Precludes use of prior structure knowledge
Normalization
CRFs: Distribution
i
i
i
CRFs: Parameter Estimation
Maximize log-likelihood objective
function
N sentences in training corpus
CRF Parameter Estimation
• Iterative Scaling:
N
• Maximizes likelihood O(! ) = ² log p! (y(i ) | x (i ) ) # ² p!(x, y)log p! (y | x)
i =1
x,y
by iteratively updating
! k ² ! k + #! k
m k ! m k + ²m k
• Define auxilliary function A() s.t. A(! ',! ) ² O(! ') # O(! )
• Initialize each ! k
• Do until convergence:
Solve dA(! ',! ) = 0 for each
d²# k
Update parameter:
!²k
! k ² ! k + #! k
12
CRF Parameter Estimation
• For chain CRF, setting
n +1
dA(! ',! )
=0
d²# k
gives
! f ] " ! p!(x, y)! f (y , y , x)
E[
k
k
i² 1
i
x, y
i =1
n +1
= ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k T (x, y))
x, y
i =1
• T (x, y) = ² fk (y i ! 1 , y i , x) + ² gk (y i , x) is total feature count
i,k
i, k
• Unfortunately, T(x,y) is a global property of (x,y)
• Dynamic programming will sum over sequences with
potentially varying T. Inefficient exp sum computation
13
Algorithm S
(Generalized Iterative Scaling)
• Introduce global slack feature s.t. T(x,y) becomes
constant S for all (x,y)
S(x, y) ! S !
²
i, k
fk (y i ! 1 , y i , x) + ² gk (y i , x)
i, k
• Define forward and backward variables
! i (y | x) = !
i² 1
(y | x)exp &# f k (y i ² 1 , y i , x) + #
%i,k
i, k
¢
gk (y i ,x))
(
#
&
! i (y | x) = ! i +1 (y | x)exp %² fk (yi +1 , y i , x) + ² gk (y i +1 , x)(
¢
i,k
i,k
14
Algorithm S
The update equations become:
Where
Note
Rate of convergence
governed by S
is like posterior as in HMM
15
Algorithm T
(Improved Iterative Scaling)
The equation we nwant
to solve
+1
! fk ] = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (#
E[
i =1
k
T (x, y))
is polynomial in exp (! ² k )
So can be solved with Newton’s method
x, y
n +1
! fk ] = ! p!(x) p(y | x)! f k (y i ² 1 , y i , x) exp (#
Define T (x) ! max T (x, y) E[
x, y
i =1
T
n +1
%
(
t
Then: ! ¢ ! p!(x) p(y | x)! fk (y i ² 1 , y i , x) exp (# k )
t = 0 &{
i =1
)
}
k
T (x))
max
x ,y|T ( x )= t
n +1
Now, let a k,t,bk,t be E[fk|T(x)=t]
ak,t = ! p!(x) p(y | x)! fk (y i ² 1 , y i , x)# ( t,T (x))
x, y
i =1
UPDATE:
16
Overview
Generative models
Conditional models
Label bias problem
Conditional random fields
Experiment
Experiment 1: Modeling Label
Bias
Generate data from simple HMM that encodes noisy
version of network
Each state emits designated symbol with prob. 29/32
2,000 training and 500 test samples
MEMM error: 42%; CRF error: 4.6%
Experiment 2: More synthetic
data
Five labels: a – e
26 observation values: A – Z
Generate data from a mixed-order HMM
Randomly generate model
For each model, generate sample of
1,000 sequences of length 25
MEMM vs. CRF
MEMM vs. HMM
CRF vs. HMM
Experiment 3: Part-of-speech
Tagging
Each word to be labeled with one of 45 syntactic tags.
50%-50% train-test split
out-of-vocabulary (oov) words: not observed in the
training set
Part-of-speech Tagging
Second set of experiments: add small set of
orthographic features (whether word is capitalized,
whether word ends in –ing, -ogy, -ed, -s, -ly …)
Overall error rate reduced by 25% and oov error
reduced by around 50%
Part-of-speech Tagging
Usually start training with zero parameter
vector (corresponds to uniform distribution)
Use optimal MEMM parameter vector as
starting point for training corresponding CRF
MEMM+ trained to convergence in around 100
iterations; CRF+ took additional 1,000
iterations
When starting from uniform distribution,
CRF+ had not converged after 2,000
iterations
Further Aspects of CRFs
Automatic feature selection
Start from feature-generating rules and
evaluate the benefit of the generated
features automatically on data
Conclusions
CRFs do not suffer from the label bias
problem!
Parameter estimation guaranteed to
find the global optimum
Limitation: Slow convergence of the
training algorithm
CRFs: Example Features
Corresponding parameters λ and μ similar to
the (logarithms of the) HMM parameters
p(y’|y) and p(x|y)