Transcript ppt

Statistical Machine Translation
Word Alignment
Stephan Vogel
MT Class
Spring Semester 2011
Stephan Vogel - Machine Translation
1
Overview





Word alignment – some observations
Models IBM2 and IBM1: 0th-order position model
HMM alignment model: 1st-order position model
IBM3: fertility
IBM4: plus relative distortion
Stephan Vogel - Machine Translation
2
Alignment Example
Observations:
 Mostly 1-1
 Some 1-to-many
 Some 1-to-nothing
 Often monotone
 Not always clear-cut
 English ‘eight’ is a time
 German has ‘acht Uhr’
 Could also leave ‘Uhr’
unaligned
Stephan Vogel - Machine Translation
3
Evaluating Alignment
 Given some manually aligned data (ref) and automatically
aligned data (hyp) links can be
 Correct, i.e. link in hyp matches link in ref: true positive (tp)
 Wrong, i.e. link in hyp but not in ref: false positive (fp)
 Missing, i.e. link in ref but not in hyp: false negative (fn)
 Evaluation measures
 Precision: P = tp / (tp + fp) = correct / links_in_hyp
 Recall:
R = tp / (tp + fn) = correct / links_in_ref
 Alignment Error Rate: AER = 1 – F = 1 – 2tp / (2tp +fp +fn)
Stephan Vogel - Machine Translation
4
Sure and Possible Links
 Sometimes it is difficult for human annotators to decide
 Differentiate between sure and possible links
 En: Det Noun - Ch: Noun,
don’t align Det, or align to NULL?
 En: Det Noun - Ar: DetNoun, should Det be aligned to DetNoun?
 Alignment Error Rate with sure and possible links (Och 2000)
 A = generated links
 S = sure links (no finding a sure link is an error)
 P = possible links (putting a link which is not possible is an error)
Precision 
A P
| A|
A S
Recall

AER
 1
|S|
A P  A S
| A|  | S |
Stephan Vogel - Machine Translation
5
Word Alignment Models





IBM1
IBM2
IBM3
IBM4
IBM5
–
–
–
–
–
lexical probabilities only
lexicon plus absolut position
plus fertilities
inverted relative position alignment
non-deficient version of model 4
 HMM – lexicon plus relative position
 BiBr – Bilingual Bracketing, lexical probabilites plus
reordering via parallel segmentation
 Syntactical alignment models
[Brown et.al. 1993, Vogel et.al. 1996, Och et al 2000, Wu 1997, Yamada
et al. 2003, and many others]
Stephan Vogel - Machine Translation
6
GIZA++ Alignment Toolkit
 All standard alignment models (IBM1 … IBM5, HMM) are
implemented in GIZA++
 This toolkit was started (as GIZA) at John Hopkins University
workshop 1998
 Extended and improved by Franz Josef Och
 Now used by many groups
 Known problems:
 Memory when training on large corpora
 Writes many large files (depends on your parameter setting)
 Extensions for large corpora (Qin Gao)
 Distributed GIZA: run on many machines, I/O bound
 Multithreaded GIZA: run on one machine, multiple cores
Stephan Vogel - Machine Translation
7
Notation
 Source language




f: source (French) word
J: length of source sentence
j: position in source sentence (target position)
f J  f ... f ... f : source sentence
1
1
j
J
 Target language
 e: target (English) word
 I: length of target sentence
 i: position in target sentence (source position)
I
 e1  e1...ei ...eI : target sentence
 Alignment: relation mapping source to target positions
 i=aj: position i of ei which is aligned to j
 a1J  a1...a j ...aJ : whole alignment
Stephan Vogel - Machine Translation
8
SMT - Principle
 Translate a ‘French’ string
into an ‘English’ string
f1J  f1... f j ... f J
e1I  e1...ei ...eI
 Bayes’ decision rule for translation:

e1I  arg max {Pr( e1I | f1J )}
e1i
 arg max {Pr( e1I ) Pr( f1J | e1I )}
e1i
 Why this inversion of the translation direction?
 Decomposition of dependencies: makes modeling easier
 Cooperation of two knowledge sources for final decision
 Note: IBM paper and GIZA call e source and f target
Stephan Vogel - Machine Translation
9
Alignment as Hidden Variable
 ‘Hidden alignments’ to capture word-to-word correspondences
 Mapping A subset of [1, …, J]x[1, …, I]
 Number of connections: J * I (each source word with each target word
 Number of alignments: 2JI (each connection yes/no)
 Summation over all alignments
Pr( f | e)   Pr( f1J ,  | e0I )

 To many alignments, summation not feasible
Stephan Vogel - Machine Translation
10
Restricted Alignment
 Each source word has one connection
 Alignment mapping becomes function: j -> i = aj
 Number of alignments is now: IJ
 Sum over all alignments:
 Not possible to enumerate
 In some situations full summation
possible through Dynamic Programming
 In other situations: take only best alignment
and perhaps some alignments close
to the best one
Stephan Vogel - Machine Translation
11
Empty Position (Null Word)
 Sometimes a word has no correspondence
 Alignment function aligns each source word to one target word,
i.e. cannot skip source word
 Solution:
 Introduce empty position 0 with null word e0
 ‘Skip’ source word fj by aligning it to e0
 Target sentence is extended to:
 Alignment is extended to:
e0I  e0 ...ei ...eI
a0J  a0 ...a j ...aJ
Stephan Vogel - Machine Translation
12
Translation Model
 Sum over all alignment
Pr( f | e)   Pr( f1J , a1J | e0I )
a0J
Pr( f1J , a1J | e0I )
 Pr( J | e0I ) Pr( f1J , a1J | J , e0I )
 Pr( J | e0I ) Pr( a1J | J , e0I ) Pr( f1J | a1J , J , e0I )
 3 probability distributions:
Pr( J | e0I )
 Length:
J
 Alignment:
Pr( a | J , e )   Pr( a j | a1j 1 , J , e0I )
J
1
I
0
j 1
 Lexicon:
J
Pr( f1 | a , J , e )   Pr( f j | f1 j 1 , a1J , J , e0I )
J
J
1
I
0
j 1
Stephan Vogel - Machine Translation
13
Model Assumptions
Decompose interaction into pairwise dependencies
 Length: Source length only dependent on target length (very weak)
Pr( J | e0I )  p( J | I )
 Alignment:
 Zero order model: target position only dependent on source position
Pr(a j | a1j 1 , J , e0I )  p(a j | j, J , I )
 First order model: target position only dependent on previous target
position
j 1
I
Pr(a j | a1 , J , e0 )  p(a j | a j 1 , J , I )
 Lexicon: source word only dependent on aligned word
Pr( f j | f1 j 1 , a1J , J , e0I )  p( f j | ea j )
Stephan Vogel - Machine Translation
14
Mixture Model
 Interpretation as mixture model by direct decomposition
J
Pr( f1 | e )  p ( J | I ) p ( f j | J , e1I )
J
I
1
j 1
J
I
 p ( J | I ) p ( f j , i | J , e1I )
j 1 i 1
J
I
 p ( J | I ) p (i | j , J , I ) p ( f j | ei )
j 1 i 1
 Again, simplifying model assumptions applied
Stephan Vogel - Machine Translation
15
Training IBM2
 Expectation-Maximization (EM) Algorithm
 Define posterior weight (i.e. sum over column = 1)
p(i | f ) 
p(i | j , J s , I s ) p( f js | eis )
 p(i'| j, J , I ) p( f
s
s
s
j
| eis' )
i'
 Lexicon probabilities
count how often
word pairs are
aligned
A( f ; e)    ( f , f js )  (e, eis ) p(i | f js )
s
p ( f | e) 
j
i
A( f ; e)
 A( f ' ; e)
f'
 Alignment probabilities
Turn counts into
probabilities
B(i; j , J , I )    ( I , I s ) ( J , J s ) p (i | f js )
s
p (i | j , J , I ) 
B(i; j , J , I )
 B(i' ; j, J , I )
i'
Stephan Vogel - Machine Translation
16
IBM1 Model
 Assume uniform probability for position alignment
1
p (i | j , I , J ) 
I
 Alignment probability
J
I
J
I
Pr( f1 | e1 )  p( J | I ) p(i | j , J , I ) p( f j | ei )
j 1 i 1
1
 p( J | I ) J
I
J
I
 p( f
j 1 i 1
j
| ei )
 In training: only collect counts for word pairs
Stephan Vogel - Machine Translation
17
Training for IBM1 Model – Pseudo Code
# Accumulation (over corpus)
For each sentence pair
For each source position j
Sum = 0.0
For each target position i
Sum += p(fj|ei)
For each target position i
Count(fj,ei) += p(fj|ei)/Sum
# Re-estimate probabilities (over count table)
For each target word e
Sum = 0.0
For each source word f
Sum += Count(f,e)
For each source word f
p(f|e) = Count(f,e)/Sum
# Repeat for several iterations
Stephan Vogel - Machine Translation
18
HMM Alignment Model
 Idea: relative position model
Entire word groups (phrases)
are moved with respect to
source position
Target
Source
Stephan Vogel - Machine Translation
19
HMM Alignment
 First order model: target position dependent on previous target
position
(captures movement of entire phrases)
Pr(a j | a1j 1 , J , e0I )  p(a j | a j 1 , J , I )
 Alignment probability:
J
J
I
Pr( f1 | e1 )  p( J | I ) p(a j | a j 1 , I ) p( f j | ea j )
a1J
j 1
 Maximum approximation:
J
J
I
Pr( f1 | e1 )  p( J | I ) max  p(a j | a j 1 , I ) p( f j | ea )
a1J
j 1
Stephan Vogel - Machine Translation
j
20
Viterbi Training on HMM Model
# Accumulation (over corpus)
# find Viterbi path
For each sentence pair
For each source position j
For each target position i
Pbest = 0;
t = p(fj|ei)
For each target position i’
Pprev = P(j-1,i’)
a = p(i|i’,I,J)
Pnew = Pprev*t*a
if (Pnew > Pbest)
Pbest = Pnew
BackPointer(j,i) = i’
# update counts
i = argmax{ BackPointer( J, I ) }
For each j from J downto 1
Count(f_j, e_i)++
Count(i,iprev,I,J)++
i = BackPoint(j,i)
# renormalize
…
Stephan Vogel - Machine Translation
Pnew=Pprev*a*t
t = p(fj | ei)
a = p(i | i’,I,J)
Pprev
21
HMM Forward-Backward Training
 Gamma : Probability to emit fj when in state i in sentence s
 Sum over all paths through (j,i)
 (i ) 
s
j
J
  p(a
a1J , a j i j '1
j'
| a j '1 , I ) p( f j ' | ea j ' )
i
j
Stephan Vogel - Machine Translation
22
HMM Forward-Backward Training
 Epsilon: Probability to transit from state i’ into i
 Sum over all paths through (j-1,i’) and (j,i), emitting fj
J
  p(a
 (i' , i) 
a1J , a j 1 i ', a j i j '1
j'
| a j '1 , I ) p( f j ' | ea j ' )
i
j-1
j
Stephan Vogel - Machine Translation
23
Forward Probabilities
 Defined as:
 j (i) 
j
  p(a
a1j , a j i
 Recursion:
j '1
j'
| a j '1 , I ) p( f j ' | ea j ' )
 I

 j (i)    j 1 (i' ) p(i | i' , I ) p( f j | ei )
 i '1

 Initial condition:  0 (i)  p(i | 0, I ) p( f1 | ei )
i
j
Stephan Vogel - Machine Translation
24
Backward Probabilities
 Defined as:
 j (i ) 
J
  p(a
a Jj
 Recursion:
, a j i j ' j
j'
| a j '1 , I ) p( f j ' | ea j ' )
 I

 j (i)    j 1 (i' ) p(i' | i, I ) p( f j 1 | ei )
 i '1

 Initial condition:  0 ( I )  1
i
j
Stephan Vogel - Machine Translation
25
Forward-Backward
 Calculate Gamma and Epsilon with Alpha and Beta:
 Gammas:

 (i )  j (i )
I
  (i' ) (i)
j
i '1
 Epsilons:
 (i' , i) 
 j 1 (i' ) p(i | i' , I ) p( f j | ei )  j (i)
~
~ ~
~
 j 1 ( i ' ) p( i | i ' , I ) p( f j | e~i ) j ( i )
~ ~
i ', i
Stephan Vogel - Machine Translation
26
Parameter Re-Estimation
S

s 1
 Lexicon probabilities
p ( f | e) 
S
Js
s

 j (i)
j 1
f j  f ,ei  e
Js
 
s 1 j 1
ei  e
S
 Alignment probabilities:
p (i | i ' ) 
Js
 
s 1 j 1
S Js
s
j
(i )
s
j
(i ' , i )
s

 j (i)
s 1 j 1
Stephan Vogel - Machine Translation
27
Forward-Backward Training – Pseudo Code
# Accumulation
For each sentence-pair {
Forward. (Calculate Alpha’s)
Backward. (Calculate Beta’s)
Calculate Xi’s and Gamma’s.
For each source word {
Increase LexiconCount(f_j|e_i) by Gamma(j,i).
Increase AlignCount(i|i’) by Epsilon(j,i,i’).
}
}
# Update
Normalize LexiconCount to get P(f_j|e_i).
Normalize AlignCount to get P(i|i’).
Stephan Vogel - Machine Translation
28
Example HMM Training
Stephan Vogel - Machine Translation
29