한국어에 기반한 인터넷 지식 정보의 지능적 통합 기술

Download Report

Transcript 한국어에 기반한 인터넷 지식 정보의 지능적 통합 기술

Statistical Alignment and
Machine Translation
인공지능 연구실
정성원
Contents
• Machine Translation
• Text Alignment
– Length-based methods
– Offset alignment by signal processing
techniques
– Lexical methods of sentence alignment
• Word Alignment
• Statistical Machine Translation
2
Different Strategies for MT (1)
Interlingua
(knowledge representation)
(knowledge-based
translation)
English
(semantic representation)
French
(semantic representation)
semantic transfer
English
(syntactic parser)
English Text
(word string)
syntactic transfer
word-for-word
French
(syntactic parser)
French Text
(word string)
3
Different Strategies for MT (2)
• Machine Translation : important but hard problem
• Why is ML Hard?
– word for word
• Lexical ambiguity
• Different word order
– syntactic transfer approach
• Can solve problems of word order
• Syntactic ambiguity
– semantic transfer approaches
• can fix cases of syntactic mismatch
• Unnatural, unintelligible
• interlingua
4
MT & Statistical Methods
• In theory, each of the arrows in prior figure can be
implemented based on a probabilistic model.
– Most MT systems are a mix of prob. and non-prob.
components.
• Text alignment
– Used to create lexical resources such as bilingual
dictionaries and parallel grammars, to improve the
quality of MT
– More work on text alignment than on MT in statistical
NLP.
5
Text Alignment
• Parallel texts or bitexts
– Same content is available in several languages
– Official documents of countries with multiple official languages ->
literal, consistent
• Alignment
– Paragraph to paragraph, sentence to sentence, word to word
• Usage of aligned text
–
–
–
–
–
Bilingual lexicography
Machine translation
Word sense disambiguation
Multilingual information retrieval
Assisting tool for translator
6
Aligning sentences and paragraphs(1)
• Problems
– Not always one sentence to one sentence
– Reordering
– Large pieces of material can disappear
• Methods
– Length based vs. lexical content based
– Match corresponding point vs. form sentence bead
7
Aligning sentences and
paragraphs(2)
8
Aligning sentences and paragraphs(3)
• BEAD : n:m grouping
– S, T : text in two
languages
– S = (s1, s2, … , si)
– T = (t1, t2, … , tj)
– 0:1, 1:0, 1:1, 2:1, 1:2, 2:2,
2:3, 3:2 …
– Each sentence can occur in
only one bead
– No crossing
b1
b2
b3
b4
b5
.
.
bk
S
T
s1
.
.
.
.
.
.
.
si
t1
.
.
.
.
.
.
.
tj
9
Dynamic Programming(1)
9
V11
8
3
V01
V21
5
12
8
V12
7
3
V41
8
6
V22
5
V31
3
V42
8
4
8
6
V13
V1
9
3
4
V23
V2
V51
2
V32
5
6
2
6
V0
4
3
6
V3
V5
V43
V4
10
Dynamic Programming(2)
• 가장 짧은 길 계산
f ( Pmin )  d min (v01 )  min1i 3{d (v01 , v1i )  d min (v1i )}
d min (v1i )  min1 j 3{d (v1i , v2 j )  d min (v2 j )}
d min (v2i )  min1 j 3{d (v2i , v3 j )  d min (v3 j )}
d min (v3i )  min1 j 3{d (v3i , v4 j )  d min (v4 j )}
그런데 d min (v4 j )  d (v4 j , v51 ) 이므로
d min (v41 )  d (v41 , v51 )  4; d min (v42 )  6; d min (v43 )  3
i
dmin(vij)
1
2
3
4
5
22(v12)
20(v21)
11(v32)
5(v43)
4(v51)
2
14(v22)
12(v31)
6(v41)
6(v51)
3
18(v22)
10(v32)
1
j
3(v51)
11
Length-based methods
• Rationale
– Short sentence -> short sentence
– Long sentence -> long sentence
– Ignore the richer information but quite effective
• Length
– # of words or # of characters
• Pros
– Efficient (for similar languages)
– rapid
12
Gale and Church (1)
• Find the alignment A ( S, T : parallel texts )
arg max P( A / S , T )  arg max P( A, S , T )
A
A
• Decompose the aligned texts into a sequence of aligned
beads (B1,…Bk)
K
P( A, S , T )   P( Bk )
• The method
k 1
– length of source and translation sentences measured in characters
– similar language and literal translations
– used for Union Bank of Switzerland(USB) Corpus
• English, French, German
• aligned paragraph level
13
Gale and Church (2)
• D(i,j) : the lowest cost alignment between
sentences s1,…,si and t1,…,tj
 D(i, j  1)  cost (0 : 1 align , t j )
 D(i  1, j )  cost (1 : 0 align s ,  )
i

 D(i  1, j  1)  cost (1 : 1 align si , t j )
D(i, j )  min
 D(i  1, j  2)  cost (1 : 2 align si , t j 1 , t j )
 D(i  2, j  1)  cost (2 : 1 align si 1 , si , t j )

 D(i  2, j  2)  cost (2 : 2 align si 1 , si , t j 1 , t j )
14
Gale and Church (3)
L1 alignment 1
L2
S1
cost(align(s1, s2, t1)) t1
S2
+
cost(align(s3, t2)) t2
S3
+
cost(align(s3, t2)) t3
S4
L1 alignment 2
t1 cost(align(s1, t1))
+
t2 cost(align(s2, t2))
+
cost(align(s3, ))
t3 cost(align(s4, t3))
15
Gale and Church (4)
• l1, l2 : the length in characters of the sentences of
each language in the bead
• 두 언어 사이의 character의 길이 비
– normal distribution ~ (, s2)
  (l2  l1 ) / l1s 2
cost (l1 , l2 )   log P( align |  (l1 , l2 ,  , s 2 ))
  log P( align) P( |  align)

  P   cos t 
• average 4% error rate
• 2% error rate for 1:1 alignments
16
Other Researches
• Brown et.al(1991c)
– 대상 : Canadian Hansard(English , French)
– 방법 : Comparing sentence lengths in words rather than characters
– 목적 : produce an aligned subset of the corpus
– 특징 : EM algorithm
• Wu(1994)
– 대상 : Hong Kong Hansard(English, Cantonese)
– 방법 : Gale and Church(1993) Method
– 결과 : not as clearly met when dealing with unrelated language
– 특징 : use lexical cues
17
Offset alignment by signal
processing techniques
• Showing roughly what offset in one text aligns with what
offset in the other.
• Church(1993)
– 배경 : noisy text(OCR output)
– 방법
• character sequence level에서 cognate 정의 -> 순수한 cognate +
proper name + numbers
• dot plot method(character 4-grams)
– 결과 : very small error rate
– 단점
• different character set
• no or extremely few identical character sequences
18
DOT-PLOT
a
g
g
g
●
●
●
g
g
●
●
●
g
c
a
c
●
c
t
t
●
a
c
g
●
a
c
a
c
g
g
c
t
●
●
t
t
●
●
t
●
t
●
●
t
●
●
c
●
g
●
●
●
g
g
●
●
●
g
Uni-gram
c
●
●
●
a
●
t
c
t
●
●
●
bi—gram
19
g
Fung and Mckeown
• 조건
– without having found sentence boundary
– in only roughly parallel texts
– with unrelated language
• 대상 : English and Cantonese
• 방법 :
– arrival vector
– small bilingual dictionary
• A word offset : (1,263,267,519) => arrival vector :
(262,4,252).
• Choose English, Cantonese word pairs of high similarity
=> small bilingual dictionary => anchor of text
alignment
• Strong signal in a line along the diagonal in dot plot =>
good alignment
20
Lexical methods of sentence
alignment(1)
• Align beads of sentences in robust ways using lexical
information
• Kay and Röscheisen(1993)
– 특징 : lexical cues, a process of convergence
– 알고리즘
• Set initial anchors
• until most sentences are aligned
– Form an envelope of possible alignments
– Choose pairs of words that tend to co-occur in these potential partial
alignment
– Find pairs of source and target sentences which contain many possible
lexical correspondences.
21
Lexical methods of sentence
alignment(2)
•
•
•
96% coverage after four passes on Scientific American articles
7 errors after 5 passes on 1000 Hansard sentences
단점
– computationally intensive
– pillow shaped envelope => text moved, deleted
22
Lexical methods of sentence
alignment(3)
• Chen(1993)
– Similar to the model of Gale and Church(1993)
– Simple translation model is used to estimate the cost
of a alignment.
– 대상
• Canadian Hansard, European Economic Community
proceedings.(millions of sent.)
– Estimated error rate : 0.4 %
• most of errors are due to sentence boundary detection
method => no further improvement
23
Lexical methods of sentence
alignment(4)
• Haruno and Yamazaki(1996)
– Align structurally different languages.
– A variant of Kay and Roscheisen(1993)
– Do lexical matching on content words only
• POS tagger
– To align short texts, use an online dictionary
– Knowledge-rich approach
– The combined methods
• good results on even short texts between very different
languages
24
Word Alignment
• 용도
– terminology databases, bilingual dictionaries
• 방법
– text alignment -> word alignment
– χ 2 measure
– EM algorithm
• Use of existing bilingual dictionaries
25
Statistical Machine Translation(1)
Language Model
P(e)
e
Translation Model
P(f/e)
f
ê
Decoder
ê = arg maxe P(e/f)
• Noisy channel model in MT
– Language model
– Translation model
– Decoder
26
Statistical Machine Translation(2)
• Translation model
– compute p(f/e) by summing the probabilities of all alignments
m
l
1 l
P( f / e)   ...  P( f j / ea j )
Z a1 0 am 0 j 1









e: English sentence
l : the length of e in words
f : French sentence
m : the length of f
fj : word j in f
aj : the position in e that fj is aligned with
eaj : the word in e that fj is aligned with
p(wf/we) : translation prob.
Z : normalization constant
f
e
.
.
.
.
.
fj
.
.
.
eaj
.
..
27
Statistical Machine Translation(3)
Decoder
•
^
e  arg max p(e / f )  arg max
e
e
P(e) P( f / e)
 arg max P(e) P( f / e)
P( f )
e
search space is infinite => stack search
• Translation probability : p(wf/we)
– Assume that we have a corpus of aligned sentences.
– EM algorithm
Randominitialization of P( w f / we )
E step : z w f , we 
 P( w
f
( e , f ) s .t . we e , w f  f
M step : P( w f / we ) 
/ we )
z w f , we
z
w f ,v
v
28
Statistical Machine Translation(4)
• Problems
– distortion
– fertility : The number of French words one English
word generate.
• Experiment
– 48% of French sentences were decoded correctly
– incorrect decodings
– ungrammatical decodings
29
Statistical Machine Translation(5)
• Detailed Problems
– model problems
•
•
•
•
Fertility is asymmetric
Independence assumption
Sensitivity to training data
Efficiency
– lack of linguistic knowledge
•
•
•
•
No notion of phrase
Non-local dependencies
Morphology
Sparse data problems
30