Transcript PowerPoint

A Scalable Decoder for Parsing-based Machine
Translation with Equivalent Language Model
State Maintenance
Zhifei Li and Sanjeev Khudanpur
Johns Hopkins University
JOSHUA: a scalable open-source
parsing-based MT decoder







New!


Written in JAVA language
Chart-parsing
Beam and Cube pruning
K-best extraction over a hypergraph Chiang (2007)
m-gram LM Integration
Parallel Decoding
Distributed LM (Zhang et al., 2006; Brants et al.,
2007)
Equivalent LM state maintenance
We plan to add more functions soon
Chart-parsing

Grammar formalism


Synchronous Context-free Grammar (SCFG)
Chart parsing





Bottom-up parsing
It maintains a chart, which contains an array of cells or bins
A cell maintains a list of items
The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a
goal item is proved.
The hypotheses are stored in a hypergraph.
Hypergraph
Goal Item
(X0, X0)
S
item
S
X | 0, 4 | the mat | a cat
(X0, X0)
X | 0, 4 | a cat | the mat
hyperedge
X
(X0 的 X1, X0 X1)
X
(X0 的 X1, X0 ’s X1)
X
X | 0, 2 | the mat | NA
X
on the mat
X
(X0 的 X1, X1 on X0)
X | 3, 4 | a cat | NA
(垫子 上, the mat)
垫子0 上1
(X0 的 X1, X1 of X0)
X
(猫, a cat)
的2
猫3
of
a cat
Hypergraph and Trees
X00,) X0)
S S(X0,(X
X
S
(X0, X0)
S
(X0, X0)
0) 的 X , X on X )
(X0 的XX1,(X
X00 的
X1)X1, X0 ’sXX1) (X0 的 X1, X1 Xof X(X
0
1
1
0
X the
(猫,
cat)
X the
(猫,mat)
a cat)
(猫,
aacat)
XX (垫子
(垫子上,
上,the
themat)
X
mat)(垫子 X上,
X
mat)
(垫子 上,
垫子
垫子
上
0 0上
11
垫子
的
的
猫
02上
0 上
2 1 猫
3 3 垫子的
2 1
cat
the the
matmat
’s aacat
猫3
X
的2
A cat of the mata cat on the mat
(猫, a cat)
猫3
How to Integrate an m-gram LM?

S | 0, 7 | <s> the | . </s>
Three functions
S
Accumulate probability
Estimate future cost
State extraction



New 3-grams
(<s> S0 </s>, <s> S0 </s>)
• will be held
• be held in
S | 0, 7 | the olympic | china .
S
• held in beijing
• in beijing of
(S0 X1, S0 X1)
X | 1, 7 | will be | china .
Estimated total prob
• 0.01*0.04=0.004
X
(将 在 X0 举行。, will be held in X0 .)
0.04=0.4*0.2*0.5
S | 0, 1 | the olympic | olympic game
Future prob
• P(beijing of)=0.01
X | 3, 6 | beijing of | of china
New 3-gram
S
(X0, X0)
X
0.4
X | 3, 4 | china | NA
X | 0,1 | the olympic | olympic game
X
(奥运会, the olympic game)
奥运会0
the olympic game
将1
will be
X
在2
held
in
0.5
0.2
X | 5, 6 | beijing | NA
X
(中国, china)
中国3
• beijing of china
(X0 的 X1, X1 of X0)
的4
beijing
(北京, beijing)
北京5
of
举行。6
china
.
Equivalent State Maintenance: overview

In a straightforward implementation, different LM
state words lead to different items
X | 0, 3 | below cat | some rat
X

X | 0, 3 | below cats | many rat
(在 X0 的 X1 下, below X1 of X0)
X
(在 X0 的 X1 下, below X1 of X0)
We merge multiple items into a single item by replacing
some LM state words with asterisk wildcard
X | 0, 3 | below * | * rat
X | 0, 3 | under cat | some rat
under
1 下,
X X(在(在
X0 X
的0 的
X1 X
下,
under
X1 the
of XX0)1 of X0)


X | 0, 3 | below cat | many rat
(在XX的
belowXthe
X1 of X0)
0 的
1 下,
XX (在
X1X下,
below
0
1 of X0)
By merging items, we can explore larger hypothesis space
using less time.
We only merge items when the length of English span l ≥ m-1
Back-off Parameterization of m-gram LMs


LM probability computation
Observations


A larger m leads to more backoff
Default backoff weight is 1

For a m-gram not listed, β(.) = 1
-4.250922
party files
-4.741889
party filled
-4.250922
party finance -0.1434139
-4.741889
party financed
-4.741889
party finances -0.2361806
-4.741889
party financially
-3.33127
party financing -0.1119054
-3.277455
party finished -0.4362795
-4.012205
party fired
-4.741889
party fires
Equivalent State Maintenance: Right-side
• Why not right to left?
• Whether a word can be ignored
depends on both its left and right
sides, which complicates the
procedure.
Independent from el-2
Backoff weight is one
• For the case of a 4-gram LM
state words
future words
P(el+1| el-2 el-1 el)=P(el+1| el-1 el) β(el-2 el-1 el)=P(el+1| el-1 el)
State Prefix
IS-A-PREFIX
equivalent state
el-2
el-1
el
el+1el+2el+3…
el-2 el-1 el
no
*
el-1
el
*
el-1
el
el+1el+2el+3…
el-1 el
no
*
*
el
*
*
el
el+1el+2el+3…
el
no
*
*
*
IS-A-PREFIX(el-1 el)=no implies IS-A-PREFIX(el-1 el el+1)=no
Equivalent State Maintenance: Left-side
• Why not left to right?
• Whether a word can be ignored
depends on both its left and right
sides, which complicates the
procedure.
Remember to factor in backoff weights later
Independent from e3
Finalized probability
P(e3| e0 e1 e2)=P(e3| e1 e2) β(e0 e1 e2)
• For the case of a 4-gram LM
future words
state words
State Suffix
IS-A-SUFFFIX
equivalent state
…e
-2e-1e0
e1
e2
e3
e1 e2 e3
no
e1
e2
*
…e
-2e-1e0
e1
e2
*
e1 e2
no
e1
*
*
…e
-2e-1e0
e1
*
*
e1
no
*
*
*
P(e1| e-2 e-1 e0)=P(e1) β(e0) β(e-1 e0) β(e-2 e-1 e0)
P(e2| e-1 e0 e1)=P(e2| e1) β(e0 e1) β(e-1 e0 e1)
Equivalent State Maintenance: summary
Original Cost Function
Finalized
probability
Estimated
probability
State
extraction
Modified Cost Function
Experimental Results: Decoding Speed

System Training


Task: Chinese to English translation
Sub-sampling of bitext of about 3M sentence pairs



obtain 570k sentence pairs
LM training data: Gigaword and English side of bitext
Decoding speed


Number of rules: 3M
Number of m-grams: 49M
38 times faster than
the baseline!
Experimental Results: Distributed LM

Distributed Language Model


Eight 7-gram LMs
Decoding speed: 12.2 sec/sent
Experimental Results: Equivalent LM States


Search effort versus search quality
Equivalent LM State Maintenance

Sparse LM: a 7-gram LM built on about 19M words
30
50

70
90
120
150 200
Dense LM: a 3-gram LM build on about 130M words

The equivalent LM state maintenance is slower than the regular method.


Backoff happens less frequently
Inefficient suffix/prefix information lookup
Summary

We describe a scalable parsing-based MT decoder



The decoder has been successfully used for decoding
millions of sentences in a large-scale discriminative
training task
We propose a method to maintain equivalent LM
states
The decoder is available at

http://www.cs.jhu.edu/~zfli/
Acknowledgements



Thanks to Philip Resnik for letting me use the
UMD Python decoder
Thanks to UMD MT group members for very
helpful discussions
Thanks to David Chiang for Hiero and his
original implementation in Python

Thank you!
Grammar Formalism

Synchronous Context-free Grammar (SCFG)




Ts: a set of source-language terminal symbols
Tt: a set of target-language terminal symbols
N: a shared set of nonterminal symbols
A set of rules of the form

a typical rule looks like:
Chart-parsing

Grammar formalism

Synchronous Context-free Grammar (SCFG)

Decoding task is defined as

Chart parsing




It maintains a chart, which contains an array of cells or bins
A cell maintains a list of items
The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a
goal item is proved.
The hypotheses are stored in a structure called hypergraph.
m-gram LM Integration

Three Functions



Accumulate probability
Estimate future cost
State extraction
Cost Function
Finalized
probability
Estimated
probability
State
extraction
Parallel and Distributed Decoding

Parallel Decoding




Divide the test set into multiple parts
Each part is decoded by a separate thread
The threads share the language/translation models in memory
Distributed Language Model (DLM)

Training



Divide the corpora into multiple parts
Train a LM on each part
Find the optimal weights among the LMs


Maximize the likelihood of a dev set
Decoding




Load the LMs into different servers
The decoder remotely calls the servers to obtain the probabilities
The decoder then interpolates the probabilities on the fly
To save communication overhead, a cache is maintained
Chart-parsing

Decoding task is defined as

Chart parsing





State of an Item


It maintains a chart, which contains an array of cells or bins
A cell maintains a list of items
The parsing process starts from axioms, and proceeds by
applying the inference rules to prove more and more items, until a
goal item is proved.
The hypotheses are stored in a structure called hypergraph.
Source span, left-side nonterminal symbol, and left/right LM state
Decoding complexity
Hypergraph

A hypergraph consists of a set of nodes and hyperedges



in parsing, they correspond to item and deductive step, respectively
Roughly, a hyperedge can be thought as a rule with pointers
State of an item

Source span, left-side nonterminal symbol, and left/right LM state
Goal Item
(X0, X0)
S
S
X | 0, 4 | the mat | a cat
X
(X0 的 X1, X0 X1)
X
(X0, X0)
X | 0, 4 | a cat | the mat
(X0 的 X1, X0 ’s X1)
X
(X0 的 X1, X1 of X0)
X
(X0 的 X1, X1 on X0)
item
X | 0, 2 | the mat | NA
hyperedge
X
X | 3, 4 | a cat | NA
(垫子 上, the mat)
垫子0 上1
a cat
X
的2
(猫, a cat)
猫3
on the mat