Transcript PowerPoint
A Scalable Decoder for Parsing-based Machine
Translation with Equivalent Language Model
State Maintenance
Zhifei Li and Sanjeev Khudanpur
Johns Hopkins University
JOSHUA: a scalable open-source
parsing-based MT decoder
New!
Written in JAVA language
Chart-parsing
Beam and Cube pruning
K-best extraction over a hypergraph Chiang (2007)
m-gram LM Integration
Parallel Decoding
Distributed LM (Zhang et al., 2006; Brants et al.,
2007)
Equivalent LM state maintenance
We plan to add more functions soon
Chart-parsing
Grammar formalism
Synchronous Context-free Grammar (SCFG)
Chart parsing
Bottom-up parsing
It maintains a chart, which contains an array of cells or bins
A cell maintains a list of items
The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a
goal item is proved.
The hypotheses are stored in a hypergraph.
Hypergraph
Goal Item
(X0, X0)
S
item
S
X | 0, 4 | the mat | a cat
(X0, X0)
X | 0, 4 | a cat | the mat
hyperedge
X
(X0 的 X1, X0 X1)
X
(X0 的 X1, X0 ’s X1)
X
X | 0, 2 | the mat | NA
X
on the mat
X
(X0 的 X1, X1 on X0)
X | 3, 4 | a cat | NA
(垫子 上, the mat)
垫子0 上1
(X0 的 X1, X1 of X0)
X
(猫, a cat)
的2
猫3
of
a cat
Hypergraph and Trees
X00,) X0)
S S(X0,(X
X
S
(X0, X0)
S
(X0, X0)
0) 的 X , X on X )
(X0 的XX1,(X
X00 的
X1)X1, X0 ’sXX1) (X0 的 X1, X1 Xof X(X
0
1
1
0
X the
(猫,
cat)
X the
(猫,mat)
a cat)
(猫,
aacat)
XX (垫子
(垫子上,
上,the
themat)
X
mat)(垫子 X上,
X
mat)
(垫子 上,
垫子
垫子
上
0 0上
11
垫子
的
的
猫
02上
0 上
2 1 猫
3 3 垫子的
2 1
cat
the the
matmat
’s aacat
猫3
X
的2
A cat of the mata cat on the mat
(猫, a cat)
猫3
How to Integrate an m-gram LM?
S | 0, 7 | <s> the | . </s>
Three functions
S
Accumulate probability
Estimate future cost
State extraction
New 3-grams
(<s> S0 </s>, <s> S0 </s>)
• will be held
• be held in
S | 0, 7 | the olympic | china .
S
• held in beijing
• in beijing of
(S0 X1, S0 X1)
X | 1, 7 | will be | china .
Estimated total prob
• 0.01*0.04=0.004
X
(将 在 X0 举行。, will be held in X0 .)
0.04=0.4*0.2*0.5
S | 0, 1 | the olympic | olympic game
Future prob
• P(beijing of)=0.01
X | 3, 6 | beijing of | of china
New 3-gram
S
(X0, X0)
X
0.4
X | 3, 4 | china | NA
X | 0,1 | the olympic | olympic game
X
(奥运会, the olympic game)
奥运会0
the olympic game
将1
will be
X
在2
held
in
0.5
0.2
X | 5, 6 | beijing | NA
X
(中国, china)
中国3
• beijing of china
(X0 的 X1, X1 of X0)
的4
beijing
(北京, beijing)
北京5
of
举行。6
china
.
Equivalent State Maintenance: overview
In a straightforward implementation, different LM
state words lead to different items
X | 0, 3 | below cat | some rat
X
X | 0, 3 | below cats | many rat
(在 X0 的 X1 下, below X1 of X0)
X
(在 X0 的 X1 下, below X1 of X0)
We merge multiple items into a single item by replacing
some LM state words with asterisk wildcard
X | 0, 3 | below * | * rat
X | 0, 3 | under cat | some rat
under
1 下,
X X(在(在
X0 X
的0 的
X1 X
下,
under
X1 the
of XX0)1 of X0)
X | 0, 3 | below cat | many rat
(在XX的
belowXthe
X1 of X0)
0 的
1 下,
XX (在
X1X下,
below
0
1 of X0)
By merging items, we can explore larger hypothesis space
using less time.
We only merge items when the length of English span l ≥ m-1
Back-off Parameterization of m-gram LMs
LM probability computation
Observations
A larger m leads to more backoff
Default backoff weight is 1
For a m-gram not listed, β(.) = 1
-4.250922
party files
-4.741889
party filled
-4.250922
party finance -0.1434139
-4.741889
party financed
-4.741889
party finances -0.2361806
-4.741889
party financially
-3.33127
party financing -0.1119054
-3.277455
party finished -0.4362795
-4.012205
party fired
-4.741889
party fires
Equivalent State Maintenance: Right-side
• Why not right to left?
• Whether a word can be ignored
depends on both its left and right
sides, which complicates the
procedure.
Independent from el-2
Backoff weight is one
• For the case of a 4-gram LM
state words
future words
P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el)
State Prefix
IS-A-PREFIX
equivalent state
el-2
el-1
el
el+1el+2el+3…
el-2 el-1 el
no
*
el-1
el
*
el-1
el
el+1el+2el+3…
el-1 el
no
*
*
el
*
*
el
el+1el+2el+3…
el
no
*
*
*
IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no
Equivalent State Maintenance: Left-side
• Why not left to right?
• Whether a word can be ignored
depends on both its left and right
sides, which complicates the
procedure.
Remember to factor in backoff weights later
Independent from e3
Finalized probability
P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2)
• For the case of a 4-gram LM
future words
state words
State Suffix
IS-A-SUFFFIX
equivalent state
…e
-2e-1e0
e1
e2
e3
e1 e2 e3
no
e1
e2
*
…e
-2e-1e0
e1
e2
*
e1 e2
no
e1
*
*
…e
-2e-1e0
e1
*
*
e1
no
*
*
*
P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0)
P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1)
Equivalent State Maintenance: summary
Original Cost Function
Finalized
probability
Estimated
probability
State
extraction
Modified Cost Function
Experimental Results: Decoding Speed
System Training
Task: Chinese to English translation
Sub-sampling of bitext of about 3M sentence pairs
obtain 570k sentence pairs
LM training data: Gigaword and English side of bitext
Decoding speed
Number of rules: 3M
Number of m-grams: 49M
38 times faster than
the baseline!
Experimental Results: Distributed LM
Distributed Language Model
Eight 7-gram LMs
Decoding speed: 12.2 sec/sent
Experimental Results: Equivalent LM States
Search effort versus search quality
Equivalent LM State Maintenance
Sparse LM: a 7-gram LM built on about 19M words
30
50
70
90
120
150 200
Dense LM: a 3-gram LM build on about 130M words
The equivalent LM state maintenance is slower than the regular method.
Backoff happens less frequently
Inefficient suffix/prefix information lookup
Summary
We describe a scalable parsing-based MT decoder
The decoder has been successfully used for decoding
millions of sentences in a large-scale discriminative
training task
We propose a method to maintain equivalent LM
states
The decoder is available at
http://www.cs.jhu.edu/~zfli/
Acknowledgements
Thanks to Philip Resnik for letting me use the
UMD Python decoder
Thanks to UMD MT group members for very
helpful discussions
Thanks to David Chiang for Hiero and his
original implementation in Python
Thank you!
Grammar Formalism
Synchronous Context-free Grammar (SCFG)
Ts: a set of source-language terminal symbols
Tt: a set of target-language terminal symbols
N: a shared set of nonterminal symbols
A set of rules of the form
a typical rule looks like:
Chart-parsing
Grammar formalism
Synchronous Context-free Grammar (SCFG)
Decoding task is defined as
Chart parsing
It maintains a chart, which contains an array of cells or bins
A cell maintains a list of items
The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a
goal item is proved.
The hypotheses are stored in a structure called hypergraph.
m-gram LM Integration
Three Functions
Accumulate probability
Estimate future cost
State extraction
Cost Function
Finalized
probability
Estimated
probability
State
extraction
Parallel and Distributed Decoding
Parallel Decoding
Divide the test set into multiple parts
Each part is decoded by a separate thread
The threads share the language/translation models in memory
Distributed Language Model (DLM)
Training
Divide the corpora into multiple parts
Train a LM on each part
Find the optimal weights among the LMs
Maximize the likelihood of a dev set
Decoding
Load the LMs into different servers
The decoder remotely calls the servers to obtain the probabilities
The decoder then interpolates the probabilities on the fly
To save communication overhead, a cache is maintained
Chart-parsing
Decoding task is defined as
Chart parsing
State of an Item
It maintains a chart, which contains an array of cells or bins
A cell maintains a list of items
The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a
goal item is proved.
The hypotheses are stored in a structure called hypergraph.
Source span, left-side nonterminal symbol, and left/right LM state
Decoding complexity
Hypergraph
A hypergraph consists of a set of nodes and hyperedges
in parsing, they correspond to item and deductive step, respectively
Roughly, a hyperedge can be thought as a rule with pointers
State of an item
Source span, left-side nonterminal symbol, and left/right LM state
Goal Item
(X0, X0)
S
S
X | 0, 4 | the mat | a cat
X
(X0 的 X1, X0 X1)
X
(X0, X0)
X | 0, 4 | a cat | the mat
(X0 的 X1, X0 ’s X1)
X
(X0 的 X1, X1 of X0)
X
(X0 的 X1, X1 on X0)
item
X | 0, 2 | the mat | NA
hyperedge
X
X | 3, 4 | a cat | NA
(垫子 上, the mat)
垫子0 上1
a cat
X
的2
(猫, a cat)
猫3
on the mat