Transcript ppt1
Machine Translation
Decoder for Phrase-Based SMT
Stephan Vogel
Spring Semester 2011
Stephan Vogel - Machine Translation
1
Decoder
Decoding issues
Two step decoding
Generation of translation lattice
Best path search
With limited word reordering
Specific Issues (Next Session)
Recombination of hypotheses
Pruning
N-best list generation
Future cost estimation
Stephan Vogel - Machine Translation
2
Decoding Issues
Decoder takes source sentence and all available knowledge
(translation model, distortion model, language model etc)
and generates a target sentence
Many alternative translations are possible
Too many to explore them all -> pruning is necessary
Pruning leads to search errors
Decoder outputs model-best translation
Ranking of hyps according to model is different from ranking according
to external metric
Bad translations get better models scores than good translations ->
model errors
Models see only limited context
Different hypotheses become identical under the model
-> Hypothesis recombination
Stephan Vogel - Machine Translation
3
Decoding Issues
Languages have different word order
Modeled by distortion models
Exploring all possible reorderings to expensive (essentially O(J!))
Need to restrict reordering -> different reordering strategies
Optimizing the system
We use a bunch of models (features), need to optimize scaling factors
(feature weights)
Decoding is expensive
Optimize on n-best list -> need to generate n-best lists
Stephan Vogel - Machine Translation
4
Decoder: The Knowledge Sources
Translation models
Phrase translation table
Statistical lexicon and/or manual lexicon
Named entities
Translation information stored as transducers or extracted on the fly
Language model: standard n-gram LM
Distortion model: distance-based or lexicalized
Sentence length model
Typically simulated by word-count feature
Other features
Phrase-count
Number of untranslated words
…
Stephan Vogel - Machine Translation
5
The Decoder: Two Level Approach
Build translation lattice
Run left-right over test sentence
Search for matching phrases between source sentence and phrase
table (and other translation tables)
For each translation, insert edges into the lattice
First best search (or n-best search)
Run left-right over lattice
Apply n-gram language model
Combine translation model scores and language model score
Recombine and prune hypotheses
At sentence end: add sentence length model score
Trace back best hypothesis (or n-best hypotheses)
Notice: this is convenient for describing decoder
Implementation can interleave both processes
Implementation can make a difference due to pruning
Stephan Vogel - Machine Translation
6
Building Translation Lattice
Sentence: ich komme morgen zu dir
Reference: I will come to you tomorrow
Search in corpus for phrases and their translations
Insert edges into the lattice
I will come
to your office
I come
I
ich
0
tomorrow
come
komme
1
morgen
2
to
you
zu
dir
…
Stephan Vogel - Machine Translation
…
J
7
Phrase Table in Hash Map
Store phrase table in hash map (source phrase as key)
For each n-gram in source sentence access hash map
foreach j = 1 to J-1
foreach l = 0 to lmax-1
// start position of phrase
// phrase length
SourcePhrase = (wj … wj+l)
TargetPhrases = Hashmap.Get( SourcePhrase )
foreach TargetPhrase t in TargetPhrases
create new edge e’ = (j-1, j+l, t ) // add TM scores
Works fine for sentence input, but too expensive for lattices
Lattices from speech recognizer
Paraphrases
Reordering as preprocessing step
Hierarchical transducers
Stephan Vogel - Machine Translation
8
Example: Paraphrase Lattice
Large: top-5 paraphrases
Pruned
Stephan Vogel - Machine Translation
9
Phrase Table as Prefix Tree
Stephan Vogel - Machine Translation
10
Phrase Table as Prefix Tree
ja
,
okay
dann
okay
then
Montag
okay on Monday
bei
mir
then
Stephan Vogel - Machine Translation
11
Building the Translation Lattice
Book-keeping: hypothesis h = (n, n, s0, hprev, e )
n
– node
s0
– initial state in transducer
hprev – previous hypothesis
e
– edge
Convert sentence into lattice structure
At each node n, insert ‘empty’ hypothesis
h = (n, n, s0, hprev = nil, e = nil )
as starting point for phrase search from this position
Note: Previous hyp and edge are only needed for hierarchical
transducers, to be able to ‘propagate’ partial translations
Stephan Vogel - Machine Translation
12
Algorithm for Building Translation Lattice
foreach node n = 0 to J
create empty hypothesis h0 = (n, n, s0, NIL, NIL)
Hyps( n ) = Hyps( n ) + h0
foreach incoming edge e in n
w = WordAt( e )
nprev = FromNode( e )
foreach hypothesis hprev = (nstart, nprev, sprev, hx, ex ) in Hyps( nprev )
if transducer T has transition (s->s’ : w )
if s’ is emitting state
foreach translation t emitted in s’
create new edge e’ = (ns, n, t ) // add TM scores
if s’ is not final state
create new hypothesis h’ = (ns, n, s’, hprev, e )
Hyps( n ) = Hyps( n ) + h’
Stephan Vogel - Machine Translation
13
Searching for Best Translation
We have constructed a graph
Directed
No cycles
Each edge carries a partial translation (with scores)
Now we need to find the best path
Adding additional information (DM, LM, ….)
Allowing for some reordering
Stephan Vogel - Machine Translation
14
Monotone Search
Hypotheses describe partial translations
Coverage information, translation, scores
Expand hypothesis over outgoing edges
I will come
to your office
I come
I
ich
come
komme
tomorrow
morgen
to
you
zu
dir
h: c=0..4, t=I will come tomorrow zu
h: c=0..3, t=I will come tomorrow
h: c=0..4, t=I will come tomorrow to
h: c=0..5, t=I will come tomorrow to your office
Stephan Vogel - Machine Translation
15
Reordering Strategies
All permutations
Any re-ordering possible
Complexity of traveling salesman -> only possible for very short
sentences
Small jumps ahead – filling the gaps pretty soon
Only local word reordering
Implemented in STTK decoder
Leaving small number of gaps – fill in at any time
Allows for global but limited reordering
Similar decoding complexity – exponential in number of gaps
IBM-style reordering (described in IBM patent)
Merging neighboring regions with swap – no gaps at all
Allows for global reordering
Complexity lower than 1, but higher than 2 and 3
Stephan Vogel - Machine Translation
16
IBM Style Reordering
Example: first word translated last!
0
1
gap
2
another gap
3
4
partially filled
5
6
7
Resulting reordering: 2 3 7 8 9 10 11 5 6 4 12 13 14 1
Stephan Vogel - Machine Translation
17
Sliding Window Reordering
Local reordering within sliding window of size 6
0
[
1
[
2
3
4
]
]
gap
[
another gap
[
]
]
[
]
partially filled
5
[
]
new gap
6
[
]
7
[
]
8
[
]
Stephan Vogel - Machine Translation
18
Coverage Information
Need to know which source words have already been
translated
Don’t want to miss some words
Don’t want to translate words twice
Can compare hypotheses which cover the same words
Use Coverage vector to store this information
For ‘small jumps ahead’: position of first gap plus short bit vector
For ‘small number of gaps’: array of positions of uncovered words
For ‘merging neighboring regions’: left and right position
Stephan Vogel - Machine Translation
19
Limited Distance Word Reordering
Word and phrase reordering within a given window
From first un-translated source word next k positions
Window length 1: monotone decoding
Restrict total number of reordering (typically 3 per 10 words)
Simple ‘Jump’ model or lexicalized distortion model
Use bit vector 1001100… = words 1, 4, and 5 translated
For long sentences long bit vectors, but only limited
reordering allowed, therefore:
Coverage = ( first untranslated word, bit vector)
i.e. 111100110… -> (4, 00110…)
Stephan Vogel - Machine Translation
20
Jumping ahead in the Lattice
Hypotheses describe a partial translation
Coverage information, translation, scores
Expand hypothesis over uncovered position (within window)
I will come
to your office
I come
I
ich
tomorrow
come
komme
morgen
to
you
zu
dir
h: c=11000, t=I will come
h: c=11011, t=I will come to your office
h: c=11111, t=I will come to your office tomorrow
Stephan Vogel - Machine Translation
21
Hypothesis for Search
Organize search according to number of translated words c
It is expensive to expand the translation
Replace by back-trace information
Generate full translation only for the best (n-best) final translation
Book-keeping: hypothesis h = (Q, C, L, i, hprev, e)
Q – total cost (we keep also cumulative costs for individual models)
C – coverage information: positions already translated
L – language model state: e.g. last n-1 words for n-gram LM
i – number of target words
hprev – pointer to previous hypothesis
e – edge traversed to expand hprev into h
hprev and e is the back-trace information: used to reconstruct
the full translation
Stephan Vogel - Machine Translation
22
Algorithm for Applying Language Model
for coverage c = 0 to J-1
foreach h in Hyps( c )
foreach node n within reordering window
foreach outgoing edge e in n
if no coverage collision between h.C and C(e)
TMScore = -log( p( t | s ) // typically several scores
DMScore = -log p( jump ) // or lexicalized DM score
// other scores like word count, phrase count, etc
foreach target word tk in t
LMScore += -log p (tk | Lk-1 )
Lk = Lk-18ti
endfor
Q’ = k1*TMScore + k2*LMScore + k3*DMScore + …
h’ = ( h.Q + Q’, h.C & C(e), L’, h.i + |t|, h, e )
Hyps( c’ ) += h’
Stephan Vogel - Machine Translation
23
Algorithm for Applying LM cont.
// coverage is now J, i.e. sentence end reached
foreach h in Hyps( J )
SLScore = -log p( h.i | J )
// sentence length model
LMScore += -log p (</s> | Lh );
// end-of-sentence LM score
L’ = Lh8</s>
Q’ = a*LMScore + b*SLScore
h’ = ( h.Q + Q’, h.C , L’, h.i, h, e )
Hyps( J+1 ) += h’
Sort Hyps( J+1 ) according to total score Q
Trace back over sequence of (h, e) to construct actual translation
Stephan Vogel - Machine Translation
24
Sentence Length Model
Different language have different level of ‘wordiness’
Histogram over source sentence length – target sentence
length shows that distribution is rather flat -> p( J | I ) is not
very helpful
Very simple sentence length model: the more – the better
i.e. give bonus for each word (not a probabilistic model)
Balances shortening effect of LM
Can be applied immediately, as absolute length is not important
However: this is insensitive to what’s in the sentence
Optimize length of translations for entire test set, not each sentence
Some sentences are made too long to cover for sentences which are
too short
Stephan Vogel - Machine Translation
25