Transcript PowerPoint
A Scalable Decoder for Parsing-based Machine Translation with Equivalent Language Model State Maintenance Zhifei Li and Sanjeev Khudanpur Johns Hopkins University JOSHUA: a scalable open-source parsing-based MT decoder New! Written in JAVA language Chart-parsing Beam and Cube pruning K-best extraction over a hypergraph Chiang (2007) m-gram LM Integration Parallel Decoding Distributed LM (Zhang et al., 2006; Brants et al., 2007) Equivalent LM state maintenance We plan to add more functions soon Chart-parsing Grammar formalism Synchronous Context-free Grammar (SCFG) Chart parsing Bottom-up parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a hypergraph. Hypergraph Goal Item (X0, X0) S item S X | 0, 4 | the mat | a cat (X0, X0) X | 0, 4 | a cat | the mat hyperedge X (X0 的 X1, X0 X1) X (X0 的 X1, X0 ’s X1) X X | 0, 2 | the mat | NA X on the mat X (X0 的 X1, X1 on X0) X | 3, 4 | a cat | NA (垫子 上, the mat) 垫子0 上1 (X0 的 X1, X1 of X0) X (猫, a cat) 的2 猫3 of a cat Hypergraph and Trees X00,) X0) S S(X0,(X X S (X0, X0) S (X0, X0) 0) 的 X , X on X ) (X0 的XX1,(X X00 的 X1)X1, X0 ’sXX1) (X0 的 X1, X1 Xof X(X 0 1 1 0 X the (猫, cat) X the (猫,mat) a cat) (猫, aacat) XX (垫子 (垫子上, 上,the themat) X mat)(垫子 X上, X mat) (垫子 上, 垫子 垫子 上 0 0上 11 垫子 的 的 猫 02上 0 上 2 1 猫 3 3 垫子的 2 1 cat the the matmat ’s aacat 猫3 X 的2 A cat of the mata cat on the mat (猫, a cat) 猫3 How to Integrate an m-gram LM? S | 0, 7 | <s> the | . </s> Three functions S Accumulate probability Estimate future cost State extraction New 3-grams (<s> S0 </s>, <s> S0 </s>) • will be held • be held in S | 0, 7 | the olympic | china . S • held in beijing • in beijing of (S0 X1, S0 X1) X | 1, 7 | will be | china . Estimated total prob • 0.01*0.04=0.004 X (将 在 X0 举行。, will be held in X0 .) 0.04=0.4*0.2*0.5 S | 0, 1 | the olympic | olympic game Future prob • P(beijing of)=0.01 X | 3, 6 | beijing of | of china New 3-gram S (X0, X0) X 0.4 X | 3, 4 | china | NA X | 0,1 | the olympic | olympic game X (奥运会, the olympic game) 奥运会0 the olympic game 将1 will be X 在2 held in 0.5 0.2 X | 5, 6 | beijing | NA X (中国, china) 中国3 • beijing of china (X0 的 X1, X1 of X0) 的4 beijing (北京, beijing) 北京5 of 举行。6 china . Equivalent State Maintenance: overview In a straightforward implementation, different LM state words lead to different items X | 0, 3 | below cat | some rat X X | 0, 3 | below cats | many rat (在 X0 的 X1 下, below X1 of X0) X (在 X0 的 X1 下, below X1 of X0) We merge multiple items into a single item by replacing some LM state words with asterisk wildcard X | 0, 3 | below * | * rat X | 0, 3 | under cat | some rat under 1 下, X X(在(在 X0 X 的0 的 X1 X 下, under X1 the of XX0)1 of X0) X | 0, 3 | below cat | many rat (在XX的 belowXthe X1 of X0) 0 的 1 下, XX (在 X1X下, below 0 1 of X0) By merging items, we can explore larger hypothesis space using less time. We only merge items when the length of English span l ≥ m-1 Back-off Parameterization of m-gram LMs LM probability computation Observations A larger m leads to more backoff Default backoff weight is 1 For a m-gram not listed, β(.) = 1 -4.250922 party files -4.741889 party filled -4.250922 party finance -0.1434139 -4.741889 party financed -4.741889 party finances -0.2361806 -4.741889 party financially -3.33127 party financing -0.1119054 -3.277455 party finished -0.4362795 -4.012205 party fired -4.741889 party fires Equivalent State Maintenance: Right-side • Why not right to left? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Independent from el-2 Backoff weight is one • For the case of a 4-gram LM state words future words P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el) State Prefix IS-A-PREFIX equivalent state el-2 el-1 el el+1el+2el+3… el-2 el-1 el no * el-1 el * el-1 el el+1el+2el+3… el-1 el no * * el * * el el+1el+2el+3… el no * * * IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no Equivalent State Maintenance: Left-side • Why not left to right? • Whether a word can be ignored depends on both its left and right sides, which complicates the procedure. Remember to factor in backoff weights later Independent from e3 Finalized probability P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2) • For the case of a 4-gram LM future words state words State Suffix IS-A-SUFFFIX equivalent state …e -2e-1e0 e1 e2 e3 e1 e2 e3 no e1 e2 * …e -2e-1e0 e1 e2 * e1 e2 no e1 * * …e -2e-1e0 e1 * * e1 no * * * P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0) P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1) Equivalent State Maintenance: summary Original Cost Function Finalized probability Estimated probability State extraction Modified Cost Function Experimental Results: Decoding Speed System Training Task: Chinese to English translation Sub-sampling of bitext of about 3M sentence pairs obtain 570k sentence pairs LM training data: Gigaword and English side of bitext Decoding speed Number of rules: 3M Number of m-grams: 49M 38 times faster than the baseline! Experimental Results: Distributed LM Distributed Language Model Eight 7-gram LMs Decoding speed: 12.2 sec/sent Experimental Results: Equivalent LM States Search effort versus search quality Equivalent LM State Maintenance Sparse LM: a 7-gram LM built on about 19M words 30 50 70 90 120 150 200 Dense LM: a 3-gram LM build on about 130M words The equivalent LM state maintenance is slower than the regular method. Backoff happens less frequently Inefficient suffix/prefix information lookup Summary We describe a scalable parsing-based MT decoder The decoder has been successfully used for decoding millions of sentences in a large-scale discriminative training task We propose a method to maintain equivalent LM states The decoder is available at http://www.cs.jhu.edu/~zfli/ Acknowledgements Thanks to Philip Resnik for letting me use the UMD Python decoder Thanks to UMD MT group members for very helpful discussions Thanks to David Chiang for Hiero and his original implementation in Python Thank you! Grammar Formalism Synchronous Context-free Grammar (SCFG) Ts: a set of source-language terminal symbols Tt: a set of target-language terminal symbols N: a shared set of nonterminal symbols A set of rules of the form a typical rule looks like: Chart-parsing Grammar formalism Synchronous Context-free Grammar (SCFG) Decoding task is defined as Chart parsing It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. m-gram LM Integration Three Functions Accumulate probability Estimate future cost State extraction Cost Function Finalized probability Estimated probability State extraction Parallel and Distributed Decoding Parallel Decoding Divide the test set into multiple parts Each part is decoded by a separate thread The threads share the language/translation models in memory Distributed Language Model (DLM) Training Divide the corpora into multiple parts Train a LM on each part Find the optimal weights among the LMs Maximize the likelihood of a dev set Decoding Load the LMs into different servers The decoder remotely calls the servers to obtain the probabilities The decoder then interpolates the probabilities on the fly To save communication overhead, a cache is maintained Chart-parsing Decoding task is defined as Chart parsing State of an Item It maintains a chart, which contains an array of cells or bins A cell maintains a list of items The parsing process starts from axioms, and proceeds by applying the inference rules to prove more and more items, until a goal item is proved. The hypotheses are stored in a structure called hypergraph. Source span, left-side nonterminal symbol, and left/right LM state Decoding complexity Hypergraph A hypergraph consists of a set of nodes and hyperedges in parsing, they correspond to item and deductive step, respectively Roughly, a hyperedge can be thought as a rule with pointers State of an item Source span, left-side nonterminal symbol, and left/right LM state Goal Item (X0, X0) S S X | 0, 4 | the mat | a cat X (X0 的 X1, X0 X1) X (X0, X0) X | 0, 4 | a cat | the mat (X0 的 X1, X0 ’s X1) X (X0 的 X1, X1 of X0) X (X0 的 X1, X1 on X0) item X | 0, 2 | the mat | NA hyperedge X X | 3, 4 | a cat | NA (垫子 上, the mat) 垫子0 上1 a cat X 的2 (猫, a cat) 猫3 on the mat