Transcript ppt
Statistical Machine Translation
Word Alignment
Stephan Vogel
MT Class
Spring Semester 2011
Stephan Vogel - Machine Translation
1
Overview
Word alignment – some observations
Models IBM2 and IBM1: 0th-order position model
HMM alignment model: 1st-order position model
IBM3: fertility
IBM4: plus relative distortion
Stephan Vogel - Machine Translation
2
Alignment Example
Observations:
Mostly 1-1
Some 1-to-many
Some 1-to-nothing
Often monotone
Not always clear-cut
English ‘eight’ is a time
German has ‘acht Uhr’
Could also leave ‘Uhr’
unaligned
Stephan Vogel - Machine Translation
3
Evaluating Alignment
Given some manually aligned data (ref) and automatically
aligned data (hyp) links can be
Correct, i.e. link in hyp matches link in ref: true positive (tp)
Wrong, i.e. link in hyp but not in ref: false positive (fp)
Missing, i.e. link in ref but not in hyp: false negative (fn)
Evaluation measures
Precision: P = tp / (tp + fp) = correct / links_in_hyp
Recall:
R = tp / (tp + fn) = correct / links_in_ref
Alignment Error Rate: AER = 1 – F = 1 – 2tp / (2tp +fp +fn)
Stephan Vogel - Machine Translation
4
Sure and Possible Links
Sometimes it is difficult for human annotators to decide
Differentiate between sure and possible links
En: Det Noun - Ch: Noun,
don’t align Det, or align to NULL?
En: Det Noun - Ar: DetNoun, should Det be aligned to DetNoun?
Alignment Error Rate with sure and possible links (Och 2000)
A = generated links
S = sure links (no finding a sure link is an error)
P = possible links (putting a link which is not possible is an error)
Precision
A P
| A|
A S
Recall
AER
1
|S|
A P A S
| A| | S |
Stephan Vogel - Machine Translation
5
Word Alignment Models
IBM1
IBM2
IBM3
IBM4
IBM5
–
–
–
–
–
lexical probabilities only
lexicon plus absolut position
plus fertilities
inverted relative position alignment
non-deficient version of model 4
HMM – lexicon plus relative position
BiBr – Bilingual Bracketing, lexical probabilites plus
reordering via parallel segmentation
Syntactical alignment models
[Brown et.al. 1993, Vogel et.al. 1996, Och et al 2000, Wu 1997, Yamada
et al. 2003, and many others]
Stephan Vogel - Machine Translation
6
GIZA++ Alignment Toolkit
All standard alignment models (IBM1 … IBM5, HMM) are
implemented in GIZA++
This toolkit was started (as GIZA) at John Hopkins University
workshop 1998
Extended and improved by Franz Josef Och
Now used by many groups
Known problems:
Memory when training on large corpora
Writes many large files (depends on your parameter setting)
Extensions for large corpora (Qin Gao)
Distributed GIZA: run on many machines, I/O bound
Multithreaded GIZA: run on one machine, multiple cores
Stephan Vogel - Machine Translation
7
Notation
Source language
f: source (French) word
J: length of source sentence
j: position in source sentence (target position)
f J f ... f ... f : source sentence
1
1
j
J
Target language
e: target (English) word
I: length of target sentence
i: position in target sentence (source position)
I
e1 e1...ei ...eI : target sentence
Alignment: relation mapping source to target positions
i=aj: position i of ei which is aligned to j
a1J a1...a j ...aJ : whole alignment
Stephan Vogel - Machine Translation
8
SMT - Principle
Translate a ‘French’ string
into an ‘English’ string
f1J f1... f j ... f J
e1I e1...ei ...eI
Bayes’ decision rule for translation:
e1I arg max {Pr( e1I | f1J )}
e1i
arg max {Pr( e1I ) Pr( f1J | e1I )}
e1i
Why this inversion of the translation direction?
Decomposition of dependencies: makes modeling easier
Cooperation of two knowledge sources for final decision
Note: IBM paper and GIZA call e source and f target
Stephan Vogel - Machine Translation
9
Alignment as Hidden Variable
‘Hidden alignments’ to capture word-to-word correspondences
Mapping A subset of [1, …, J]x[1, …, I]
Number of connections: J * I (each source word with each target word
Number of alignments: 2JI (each connection yes/no)
Summation over all alignments
Pr( f | e) Pr( f1J , | e0I )
To many alignments, summation not feasible
Stephan Vogel - Machine Translation
10
Restricted Alignment
Each source word has one connection
Alignment mapping becomes function: j -> i = aj
Number of alignments is now: IJ
Sum over all alignments:
Not possible to enumerate
In some situations full summation
possible through Dynamic Programming
In other situations: take only best alignment
and perhaps some alignments close
to the best one
Stephan Vogel - Machine Translation
11
Empty Position (Null Word)
Sometimes a word has no correspondence
Alignment function aligns each source word to one target word,
i.e. cannot skip source word
Solution:
Introduce empty position 0 with null word e0
‘Skip’ source word fj by aligning it to e0
Target sentence is extended to:
Alignment is extended to:
e0I e0 ...ei ...eI
a0J a0 ...a j ...aJ
Stephan Vogel - Machine Translation
12
Translation Model
Sum over all alignment
Pr( f | e) Pr( f1J , a1J | e0I )
a0J
Pr( f1J , a1J | e0I )
Pr( J | e0I ) Pr( f1J , a1J | J , e0I )
Pr( J | e0I ) Pr( a1J | J , e0I ) Pr( f1J | a1J , J , e0I )
3 probability distributions:
Pr( J | e0I )
Length:
J
Alignment:
Pr( a | J , e ) Pr( a j | a1j 1 , J , e0I )
J
1
I
0
j 1
Lexicon:
J
Pr( f1 | a , J , e ) Pr( f j | f1 j 1 , a1J , J , e0I )
J
J
1
I
0
j 1
Stephan Vogel - Machine Translation
13
Model Assumptions
Decompose interaction into pairwise dependencies
Length: Source length only dependent on target length (very weak)
Pr( J | e0I ) p( J | I )
Alignment:
Zero order model: target position only dependent on source position
Pr(a j | a1j 1 , J , e0I ) p(a j | j, J , I )
First order model: target position only dependent on previous target
position
j 1
I
Pr(a j | a1 , J , e0 ) p(a j | a j 1 , J , I )
Lexicon: source word only dependent on aligned word
Pr( f j | f1 j 1 , a1J , J , e0I ) p( f j | ea j )
Stephan Vogel - Machine Translation
14
Mixture Model
Interpretation as mixture model by direct decomposition
J
Pr( f1 | e ) p ( J | I ) p ( f j | J , e1I )
J
I
1
j 1
J
I
p ( J | I ) p ( f j , i | J , e1I )
j 1 i 1
J
I
p ( J | I ) p (i | j , J , I ) p ( f j | ei )
j 1 i 1
Again, simplifying model assumptions applied
Stephan Vogel - Machine Translation
15
Training IBM2
Expectation-Maximization (EM) Algorithm
Define posterior weight (i.e. sum over column = 1)
p(i | f )
p(i | j , J s , I s ) p( f js | eis )
p(i'| j, J , I ) p( f
s
s
s
j
| eis' )
i'
Lexicon probabilities
count how often
word pairs are
aligned
A( f ; e) ( f , f js ) (e, eis ) p(i | f js )
s
p ( f | e)
j
i
A( f ; e)
A( f ' ; e)
f'
Alignment probabilities
Turn counts into
probabilities
B(i; j , J , I ) ( I , I s ) ( J , J s ) p (i | f js )
s
p (i | j , J , I )
B(i; j , J , I )
B(i' ; j, J , I )
i'
Stephan Vogel - Machine Translation
16
IBM1 Model
Assume uniform probability for position alignment
1
p (i | j , I , J )
I
Alignment probability
J
I
J
I
Pr( f1 | e1 ) p( J | I ) p(i | j , J , I ) p( f j | ei )
j 1 i 1
1
p( J | I ) J
I
J
I
p( f
j 1 i 1
j
| ei )
In training: only collect counts for word pairs
Stephan Vogel - Machine Translation
17
Training for IBM1 Model – Pseudo Code
# Accumulation (over corpus)
For each sentence pair
For each source position j
Sum = 0.0
For each target position i
Sum += p(fj|ei)
For each target position i
Count(fj,ei) += p(fj|ei)/Sum
# Re-estimate probabilities (over count table)
For each target word e
Sum = 0.0
For each source word f
Sum += Count(f,e)
For each source word f
p(f|e) = Count(f,e)/Sum
# Repeat for several iterations
Stephan Vogel - Machine Translation
18
HMM Alignment Model
Idea: relative position model
Entire word groups (phrases)
are moved with respect to
source position
Target
Source
Stephan Vogel - Machine Translation
19
HMM Alignment
First order model: target position dependent on previous target
position
(captures movement of entire phrases)
Pr(a j | a1j 1 , J , e0I ) p(a j | a j 1 , J , I )
Alignment probability:
J
J
I
Pr( f1 | e1 ) p( J | I ) p(a j | a j 1 , I ) p( f j | ea j )
a1J
j 1
Maximum approximation:
J
J
I
Pr( f1 | e1 ) p( J | I ) max p(a j | a j 1 , I ) p( f j | ea )
a1J
j 1
Stephan Vogel - Machine Translation
j
20
Viterbi Training on HMM Model
# Accumulation (over corpus)
# find Viterbi path
For each sentence pair
For each source position j
For each target position i
Pbest = 0;
t = p(fj|ei)
For each target position i’
Pprev = P(j-1,i’)
a = p(i|i’,I,J)
Pnew = Pprev*t*a
if (Pnew > Pbest)
Pbest = Pnew
BackPointer(j,i) = i’
# update counts
i = argmax{ BackPointer( J, I ) }
For each j from J downto 1
Count(f_j, e_i)++
Count(i,iprev,I,J)++
i = BackPoint(j,i)
# renormalize
…
Stephan Vogel - Machine Translation
Pnew=Pprev*a*t
t = p(fj | ei)
a = p(i | i’,I,J)
Pprev
21
HMM Forward-Backward Training
Gamma : Probability to emit fj when in state i in sentence s
Sum over all paths through (j,i)
(i )
s
j
J
p(a
a1J , a j i j '1
j'
| a j '1 , I ) p( f j ' | ea j ' )
i
j
Stephan Vogel - Machine Translation
22
HMM Forward-Backward Training
Epsilon: Probability to transit from state i’ into i
Sum over all paths through (j-1,i’) and (j,i), emitting fj
J
p(a
(i' , i)
a1J , a j 1 i ', a j i j '1
j'
| a j '1 , I ) p( f j ' | ea j ' )
i
j-1
j
Stephan Vogel - Machine Translation
23
Forward Probabilities
Defined as:
j (i)
j
p(a
a1j , a j i
Recursion:
j '1
j'
| a j '1 , I ) p( f j ' | ea j ' )
I
j (i) j 1 (i' ) p(i | i' , I ) p( f j | ei )
i '1
Initial condition: 0 (i) p(i | 0, I ) p( f1 | ei )
i
j
Stephan Vogel - Machine Translation
24
Backward Probabilities
Defined as:
j (i )
J
p(a
a Jj
Recursion:
, a j i j ' j
j'
| a j '1 , I ) p( f j ' | ea j ' )
I
j (i) j 1 (i' ) p(i' | i, I ) p( f j 1 | ei )
i '1
Initial condition: 0 ( I ) 1
i
j
Stephan Vogel - Machine Translation
25
Forward-Backward
Calculate Gamma and Epsilon with Alpha and Beta:
Gammas:
(i ) j (i )
I
(i' ) (i)
j
i '1
Epsilons:
(i' , i)
j 1 (i' ) p(i | i' , I ) p( f j | ei ) j (i)
~
~ ~
~
j 1 ( i ' ) p( i | i ' , I ) p( f j | e~i ) j ( i )
~ ~
i ', i
Stephan Vogel - Machine Translation
26
Parameter Re-Estimation
S
s 1
Lexicon probabilities
p ( f | e)
S
Js
s
j (i)
j 1
f j f ,ei e
Js
s 1 j 1
ei e
S
Alignment probabilities:
p (i | i ' )
Js
s 1 j 1
S Js
s
j
(i )
s
j
(i ' , i )
s
j (i)
s 1 j 1
Stephan Vogel - Machine Translation
27
Forward-Backward Training – Pseudo Code
# Accumulation
For each sentence-pair {
Forward. (Calculate Alpha’s)
Backward. (Calculate Beta’s)
Calculate Xi’s and Gamma’s.
For each source word {
Increase LexiconCount(f_j|e_i) by Gamma(j,i).
Increase AlignCount(i|i’) by Epsilon(j,i,i’).
}
}
# Update
Normalize LexiconCount to get P(f_j|e_i).
Normalize AlignCount to get P(i|i’).
Stephan Vogel - Machine Translation
28
Example HMM Training
Stephan Vogel - Machine Translation
29