Conditional Random Fields William W. Cohen CALD Announcements • Upcoming assignments: – – – – – Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et.

Download Report

Transcript Conditional Random Fields William W. Cohen CALD Announcements • Upcoming assignments: – – – – – Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et.

Conditional Random Fields
William W. Cohen
CALD
Announcements
• Upcoming assignments:
–
–
–
–
–
Today: Sha & Pereira, Lafferty et al
Mon 2/23: Klein & Manning, Toutanova et al
Wed 2/25: no writeup due
Mon 3/1: no writeup due
Wed 3/3: project proposal due: personnel + 1-2
page
– Spring break week, no class
Review: motivation for CMM’s
Ideally we would like to use many, arbitrary, overlapping
features of words.
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Motivation for CMMs
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state
Pr(st | xt , st 1, )  ...
Implications of the model
• Does this do what we want?
• Q: does Y[i-1] depend on X[i+1] ?
– “a nodes is conditionally independent of its non-descendents given
its parents”
Label Bias Problem
• Consider this MEMM:
•
P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r)
P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)
In the training data, label value 2 is the only label value observed after label value 1
Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
Label Bias Problem
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rib)=1
Pr(0453|rob)=1
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3
= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’
= 0.5 * 1 *1
How important is label bias?
• Could be avoided in this case by changing structure:
• Our models are always wrong – is this “wrongness” a
problem?
• See Klein & Manning’s paper for next week….
Another view of label bias [Sha & Pereira]
So what’s the alternative?
Review of maxent
P r(x)   0   f i ( x )  exp( i f i ( x))
i
i
P r(x, y )  exp( i f i ( x, y ))
i
P r(y | x) 
exp( i f i ( x, y ))
i
 exp(  f ( x, y' ))
i i
y'
i
Review of maxent/MEMM/CMMs
P r(y | x) 
exp( i f i ( x, y ))
i
 exp(  f ( x, y' ))
i i
y'

exp( i f i ( x, y ))
i
Z  ( x)
i
for MEMM:
P r(y1...yn | x1...xn )   P r(y j | y j 1, x j )  
j
j
exp( i f i ( x j , y j , y j 1 ))
i
Z ( x j )
Details on CMMs
Pr(y1...yn | x1...xn )   Pr(y j | y j 1, x j )  
j


j
exp( i f i ( x j , y j , y j 1 ))
i
Z ( x j )
exp( i f i ( x j , y j , y j 1 ))
i
j
 Z ( x )
j
j

exp( i Fi ( x j , y j , y j 1 ))
i
 Z (x )
j
j
, where Fi ( x j , y j , y j 1 )   f i ( x j , y j , y j 1 )
j
From CMMs to CRFs


exp( i f i ( x j , y j , y j 1 ))
i
j
 Z ( x )
j
j

exp( i Fi ( x, y ))
i
 Z ( x )
Recall why we’re
unhappy: we don’t want
local normalization
, where Fi ( x, y )   f i ( x j , y j , y j 1 )
j
j
j
New model
exp( i Fi ( x, y ))
i
Z  ( x)
What’s the new model look like?
exp( i Fi ( x, y ))
i

exp( i  f i ( x j , y j , y j 1 )
i
Z  ( x)
j
Z  ( x)
What’s independent?
y1
y2
y3
x1
x2
x3
What’s the new model look like?
exp( i Fi ( x, y ))
i

exp( i  f i ( x, y j , y j 1 )
i
Z  ( x)
j
Z  ( x)
What’s independent now??
y1
y2
x
y3
Hammerley-Clifford
• For positive distributions P(x1,…,xn):
– Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi))
– Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is
a set that separates A and B
– P can be written as normalized product of “clique
potentials”
1
Pr(x) 
 ( xC )

Z cliqueC
So this is very general: any Markov distribution can be
written in this form (modulo nits like “positive distribution”)
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
Example of CRFs
Graphical comparison among
HMMs, MEMMs and CRFs
HMM
MEMM
CRF
Lafferty et al notation
If the graph G = (V, E) of Y is a tree, the conditional distribution over the
label sequence Y = y, given X = x, by fundamental theorem of random
fields is:


p (y | x)  exp   k f k (e, y |e , x)   k gk (v, y |v , x) 
vV ,k
 eE,k

x is a data sequence
y is a label sequence
v is a vertex from vertex set V = set of label random variables
e is an edge from edge set E over V
fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge feature
k is the number of features
  (1, 2 , , n ; 1, 2 , , n ); k and k are parameters to be estimated
y|e is the set of components of y defined by edge e
y|v is the set of components of y defined by vertex v
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the
conditional distributions:


1
p (y | x) 
exp   k f k (e, y |e , x)   k g k (v, y |v , x) 
Z (x)
vV ,k
 eE,k

Z(x) is a normalization over the data sequence x
• Learning:
– Lafferty et al’s IIS-based method is rather inefficient.
– Gradient-based methods are faster
– Trickiest bit is computing normalization, which is over
exponentially many y vectors.
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
Something like forward-backward
Idea:
• Define matrix of y,y’ “affinities” at stage i
• Mi[y,y’] = “unnormalized probability” of
transition from y to y’ at stage I
• Mi * Mi+1 = “unnormalized probability” of any
path through stages i and i+1
y1
y2
y3
x
y1
y2
y3
Forward backward ideas
a
name
c
b
nonName
d
a b   e
c d    g

 
e
name
g
f
nonName
name
h
nonName
f  ae  bg af  bh



h   ...
... 
CRF learning – from Sha & Pereira
CRF learning – from Sha & Pereira
Sha & Pereira results
CRF beats MEMM
(McNemar’s test); MEMM
probably beats voted
perceptron
Sha & Pereira results
in minutes, 375k examples
POS tagging Experiments in Lafferty et al
• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging
• Each word in a given input sentence must be labeled with one of 45 syntactic tags
• Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the
following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
• oov = out-of-vocabulary (not observed in the training set)
POS tagging vs MXPost