Transcript ppt1

Machine Translation
Decoder for Phrase-Based SMT
Stephan Vogel
Spring Semester 2011
Stephan Vogel - Machine Translation
1
Decoder
 Decoding issues
 Two step decoding
 Generation of translation lattice
 Best path search
 With limited word reordering
 Specific Issues (Next Session)




Recombination of hypotheses
Pruning
N-best list generation
Future cost estimation
Stephan Vogel - Machine Translation
2
Decoding Issues
Decoder takes source sentence and all available knowledge
(translation model, distortion model, language model etc)
and generates a target sentence
 Many alternative translations are possible
 Too many to explore them all -> pruning is necessary
 Pruning leads to search errors
 Decoder outputs model-best translation
 Ranking of hyps according to model is different from ranking according
to external metric 
 Bad translations get better models scores than good translations ->
model errors
 Models see only limited context
 Different hypotheses become identical under the model
 -> Hypothesis recombination
Stephan Vogel - Machine Translation
3
Decoding Issues
 Languages have different word order
 Modeled by distortion models
 Exploring all possible reorderings to expensive (essentially O(J!))
 Need to restrict reordering -> different reordering strategies
 Optimizing the system
 We use a bunch of models (features), need to optimize scaling factors
(feature weights)
 Decoding is expensive
 Optimize on n-best list -> need to generate n-best lists
Stephan Vogel - Machine Translation
4
Decoder: The Knowledge Sources
 Translation models
 Phrase translation table
 Statistical lexicon and/or manual lexicon
 Named entities
 Translation information stored as transducers or extracted on the fly
 Language model: standard n-gram LM
 Distortion model: distance-based or lexicalized
 Sentence length model
 Typically simulated by word-count feature
 Other features
 Phrase-count
 Number of untranslated words
 …
Stephan Vogel - Machine Translation
5
The Decoder: Two Level Approach
 Build translation lattice



Run left-right over test sentence
Search for matching phrases between source sentence and phrase
table (and other translation tables)
For each translation, insert edges into the lattice
 First best search (or n-best search)






Run left-right over lattice
Apply n-gram language model
Combine translation model scores and language model score
Recombine and prune hypotheses
At sentence end: add sentence length model score
Trace back best hypothesis (or n-best hypotheses)
 Notice: this is convenient for describing decoder


Implementation can interleave both processes
Implementation can make a difference due to pruning
Stephan Vogel - Machine Translation
6
Building Translation Lattice
Sentence: ich komme morgen zu dir
Reference: I will come to you tomorrow
 Search in corpus for phrases and their translations
 Insert edges into the lattice
I will come
to your office
I come
I
ich
0
tomorrow
come
komme
1
morgen
2
to
you
zu
dir
…
Stephan Vogel - Machine Translation
…
J
7
Phrase Table in Hash Map
 Store phrase table in hash map (source phrase as key)
 For each n-gram in source sentence access hash map
foreach j = 1 to J-1
foreach l = 0 to lmax-1
// start position of phrase
// phrase length
SourcePhrase = (wj … wj+l)
TargetPhrases = Hashmap.Get( SourcePhrase )
foreach TargetPhrase t in TargetPhrases
create new edge e’ = (j-1, j+l, t ) // add TM scores
 Works fine for sentence input, but too expensive for lattices




Lattices from speech recognizer
Paraphrases
Reordering as preprocessing step
Hierarchical transducers
Stephan Vogel - Machine Translation
8
Example: Paraphrase Lattice
 Large: top-5 paraphrases
 Pruned
Stephan Vogel - Machine Translation
9
Phrase Table as Prefix Tree
Stephan Vogel - Machine Translation
10
Phrase Table as Prefix Tree
ja
,
okay
dann
okay
then
Montag
okay on Monday
bei
mir
then
Stephan Vogel - Machine Translation
11
Building the Translation Lattice
 Book-keeping: hypothesis h = (n, n, s0, hprev, e )
n
– node
s0
– initial state in transducer
hprev – previous hypothesis
e
– edge
 Convert sentence into lattice structure
 At each node n, insert ‘empty’ hypothesis
h = (n, n, s0, hprev = nil, e = nil )
as starting point for phrase search from this position
 Note: Previous hyp and edge are only needed for hierarchical
transducers, to be able to ‘propagate’ partial translations
Stephan Vogel - Machine Translation
12
Algorithm for Building Translation Lattice
foreach node n = 0 to J
create empty hypothesis h0 = (n, n, s0, NIL, NIL)
Hyps( n ) = Hyps( n ) + h0
foreach incoming edge e in n
w = WordAt( e )
nprev = FromNode( e )
foreach hypothesis hprev = (nstart, nprev, sprev, hx, ex ) in Hyps( nprev )
if transducer T has transition (s->s’ : w )
if s’ is emitting state
foreach translation t emitted in s’
create new edge e’ = (ns, n, t ) // add TM scores
if s’ is not final state
create new hypothesis h’ = (ns, n, s’, hprev, e )
Hyps( n ) = Hyps( n ) + h’
Stephan Vogel - Machine Translation
13
Searching for Best Translation
 We have constructed a graph
 Directed
 No cycles
 Each edge carries a partial translation (with scores)
 Now we need to find the best path
 Adding additional information (DM, LM, ….)
 Allowing for some reordering
Stephan Vogel - Machine Translation
14
Monotone Search
 Hypotheses describe partial translations
 Coverage information, translation, scores
 Expand hypothesis over outgoing edges
I will come
to your office
I come
I
ich
come
komme
tomorrow
morgen
to
you
zu
dir
h: c=0..4, t=I will come tomorrow zu
h: c=0..3, t=I will come tomorrow
h: c=0..4, t=I will come tomorrow to
h: c=0..5, t=I will come tomorrow to your office
Stephan Vogel - Machine Translation
15
Reordering Strategies
 All permutations
 Any re-ordering possible
 Complexity of traveling salesman -> only possible for very short
sentences
 Small jumps ahead – filling the gaps pretty soon
 Only local word reordering
 Implemented in STTK decoder
 Leaving small number of gaps – fill in at any time
 Allows for global but limited reordering
 Similar decoding complexity – exponential in number of gaps
 IBM-style reordering (described in IBM patent)
 Merging neighboring regions with swap – no gaps at all
 Allows for global reordering
 Complexity lower than 1, but higher than 2 and 3
Stephan Vogel - Machine Translation
16
IBM Style Reordering
 Example: first word translated last!
0
1
gap
2
another gap
3
4
partially filled
5
6
7
 Resulting reordering: 2 3 7 8 9 10 11 5 6 4 12 13 14 1
Stephan Vogel - Machine Translation
17
Sliding Window Reordering
 Local reordering within sliding window of size 6
0
[
1
[
2
3
4
]
]
gap
[
another gap
[
]
]
[
]
partially filled
5
[
]
new gap
6
[
]
7
[
]
8
[
]
Stephan Vogel - Machine Translation
18
Coverage Information
 Need to know which source words have already been
translated
 Don’t want to miss some words
 Don’t want to translate words twice
 Can compare hypotheses which cover the same words
 Use Coverage vector to store this information
 For ‘small jumps ahead’: position of first gap plus short bit vector
 For ‘small number of gaps’: array of positions of uncovered words
 For ‘merging neighboring regions’: left and right position
Stephan Vogel - Machine Translation
19
Limited Distance Word Reordering
 Word and phrase reordering within a given window
 From first un-translated source word next k positions
 Window length 1: monotone decoding
 Restrict total number of reordering (typically 3 per 10 words)
 Simple ‘Jump’ model or lexicalized distortion model
 Use bit vector 1001100… = words 1, 4, and 5 translated
 For long sentences long bit vectors, but only limited
reordering allowed, therefore:
Coverage = ( first untranslated word, bit vector)
i.e. 111100110… -> (4, 00110…)
Stephan Vogel - Machine Translation
20
Jumping ahead in the Lattice
 Hypotheses describe a partial translation
 Coverage information, translation, scores
 Expand hypothesis over uncovered position (within window)
I will come
to your office
I come
I
ich
tomorrow
come
komme
morgen
to
you
zu
dir
h: c=11000, t=I will come
h: c=11011, t=I will come to your office
h: c=11111, t=I will come to your office tomorrow
Stephan Vogel - Machine Translation
21
Hypothesis for Search
 Organize search according to number of translated words c
 It is expensive to expand the translation
 Replace by back-trace information
 Generate full translation only for the best (n-best) final translation
 Book-keeping: hypothesis h = (Q, C, L, i, hprev, e)






Q – total cost (we keep also cumulative costs for individual models)
C – coverage information: positions already translated
L – language model state: e.g. last n-1 words for n-gram LM
i – number of target words
hprev – pointer to previous hypothesis
e – edge traversed to expand hprev into h
 hprev and e is the back-trace information: used to reconstruct
the full translation
Stephan Vogel - Machine Translation
22
Algorithm for Applying Language Model
for coverage c = 0 to J-1
foreach h in Hyps( c )
foreach node n within reordering window
foreach outgoing edge e in n
if no coverage collision between h.C and C(e)
TMScore = -log( p( t | s ) // typically several scores
DMScore = -log p( jump ) // or lexicalized DM score
// other scores like word count, phrase count, etc
foreach target word tk in t
LMScore += -log p (tk | Lk-1 )
Lk = Lk-18ti
endfor
Q’ = k1*TMScore + k2*LMScore + k3*DMScore + …
h’ = ( h.Q + Q’, h.C & C(e), L’, h.i + |t|, h, e )
Hyps( c’ ) += h’
Stephan Vogel - Machine Translation
23
Algorithm for Applying LM cont.
// coverage is now J, i.e. sentence end reached
foreach h in Hyps( J )
SLScore = -log p( h.i | J )
// sentence length model
LMScore += -log p (</s> | Lh );
// end-of-sentence LM score
L’ = Lh8</s>
Q’ = a*LMScore + b*SLScore
h’ = ( h.Q + Q’, h.C , L’, h.i, h, e )
Hyps( J+1 ) += h’
Sort Hyps( J+1 ) according to total score Q
Trace back over sequence of (h, e) to construct actual translation
Stephan Vogel - Machine Translation
24
Sentence Length Model
 Different language have different level of ‘wordiness’
 Histogram over source sentence length – target sentence
length shows that distribution is rather flat -> p( J | I ) is not
very helpful
 Very simple sentence length model: the more – the better
 i.e. give bonus for each word (not a probabilistic model)
 Balances shortening effect of LM
 Can be applied immediately, as absolute length is not important
 However: this is insensitive to what’s in the sentence
 Optimize length of translations for entire test set, not each sentence
 Some sentences are made too long to cover for sentences which are
too short
Stephan Vogel - Machine Translation
25