Conditional Random Fields William W. Cohen CALD Announcements • Upcoming assignments: – – – – – Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et.
Download ReportTranscript Conditional Random Fields William W. Cohen CALD Announcements • Upcoming assignments: – – – – – Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et.
Conditional Random Fields William W. Cohen CALD Announcements • Upcoming assignments: – – – – – Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel + 1-2 page – Spring break week, no class Review: motivation for CMM’s Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t-1 St S t+1 … is “Wisniewski” part of noun phrase … ends in “-ski” O t -1 Ot O t +1 Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S t-1 St S t+1 … is “Wisniewski” part of noun phrase … ends in “-ski” O t -1 Ot O t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state Pr(st | xt , st 1, ) ... Implications of the model • Does this do what we want? • Q: does Y[i-1] depend on X[i+1] ? – “a nodes is conditionally independent of its non-descendents given its parents” Label Bias Problem • Consider this MEMM: • P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) • Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x • However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). • Per-state normalization does not allow the required expectation Label Bias Problem • Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1 How important is label bias? • Could be avoided in this case by changing structure: • Our models are always wrong – is this “wrongness” a problem? • See Klein & Manning’s paper for next week…. Another view of label bias [Sha & Pereira] So what’s the alternative? Review of maxent P r(x) 0 f i ( x ) exp( i f i ( x)) i i P r(x, y ) exp( i f i ( x, y )) i P r(y | x) exp( i f i ( x, y )) i exp( f ( x, y' )) i i y' i Review of maxent/MEMM/CMMs P r(y | x) exp( i f i ( x, y )) i exp( f ( x, y' )) i i y' exp( i f i ( x, y )) i Z ( x) i for MEMM: P r(y1...yn | x1...xn ) P r(y j | y j 1, x j ) j j exp( i f i ( x j , y j , y j 1 )) i Z ( x j ) Details on CMMs Pr(y1...yn | x1...xn ) Pr(y j | y j 1, x j ) j j exp( i f i ( x j , y j , y j 1 )) i Z ( x j ) exp( i f i ( x j , y j , y j 1 )) i j Z ( x ) j j exp( i Fi ( x j , y j , y j 1 )) i Z (x ) j j , where Fi ( x j , y j , y j 1 ) f i ( x j , y j , y j 1 ) j From CMMs to CRFs exp( i f i ( x j , y j , y j 1 )) i j Z ( x ) j j exp( i Fi ( x, y )) i Z ( x ) Recall why we’re unhappy: we don’t want local normalization , where Fi ( x, y ) f i ( x j , y j , y j 1 ) j j j New model exp( i Fi ( x, y )) i Z ( x) What’s the new model look like? exp( i Fi ( x, y )) i exp( i f i ( x j , y j , y j 1 ) i Z ( x) j Z ( x) What’s independent? y1 y2 y3 x1 x2 x3 What’s the new model look like? exp( i Fi ( x, y )) i exp( i f i ( x, y j , y j 1 ) i Z ( x) j Z ( x) What’s independent now?? y1 y2 x y3 Hammerley-Clifford • For positive distributions P(x1,…,xn): – Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) – Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B – P can be written as normalized product of “clique potentials” 1 Pr(x) ( xC ) Z cliqueC So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”) Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences Example of CRFs Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF Lafferty et al notation If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: p (y | x) exp k f k (e, y |e , x) k gk (v, y |v , x) vV ,k eE,k x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features (1, 2 , , n ; 1, 2 , , n ); k and k are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v Conditional Distribution (cont’d) • CRFs use the observation-dependent normalization Z(x) for the conditional distributions: 1 p (y | x) exp k f k (e, y |e , x) k g k (v, y |v , x) Z (x) vV ,k eE,k Z(x) is a normalization over the data sequence x • Learning: – Lafferty et al’s IIS-based method is rather inefficient. – Gradient-based methods are faster – Trickiest bit is computing normalization, which is over exponentially many y vectors. CRF learning – from Sha & Pereira CRF learning – from Sha & Pereira CRF learning – from Sha & Pereira Something like forward-backward Idea: • Define matrix of y,y’ “affinities” at stage i • Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I • Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1 y1 y2 y3 x y1 y2 y3 Forward backward ideas a name c b nonName d a b e c d g e name g f nonName name h nonName f ae bg af bh h ... ... CRF learning – from Sha & Pereira CRF learning – from Sha & Pereira Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron Sha & Pereira results in minutes, 375k examples POS tagging Experiments in Lafferty et al • Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging • Each word in a given input sentence must be labeled with one of 45 syntactic tags • Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies • oov = out-of-vocabulary (not observed in the training set) POS tagging vs MXPost