02Mar2006-11-734.ppt

Transcript 02Mar2006-11-734.ppt

Using Pivot/Bridge Languages
Matthias Eck
General Problem
 Resources are available between languages A and B
… and between languages B and C
… but not C and A
A
C
B
 How to train translation models between C and A?
1st paper
Multipath Translation Lexicon Induction via Bridge Languages
 Gideon S. Mann and David Yarowsky
 NAACL 2001
 Method for inducing translation lexicons based on
transduction models of cognate pairs via bridge languages
Lexicon via Cognate pairs
Lexicon:
 Mapping of word in source language to words in target
language
Here:
 Lexicon is built between arbitrary languages using models of
cognate pairs and cognate distance
General idea
cognate
model
dictionary
English
Spanish
Portuguese
Italian
French
Romance Family
source
bridge
Romanian
target
Translation pairs
English
French
nephew
neveu
typical cognate pair
father
pere
Historically related, but now distant
water
eau
not related
 Cognate pairs can make up significant portion of lexicon if
languages are in the same family and close
Cognate string edit distance
 Obvious condition for a good distance D
s  S , c, n  T
If cognate (s,c)  noncognate (s,n)
Then D(s,c)  D(s,n)
 So we choose
tˆ  arg min D ( s, t )
tT
…as the translation for s
Used distance measures
 L: Levenshtein distance
 Minimum sum of the costs of edit operations required to
transform one string into another
 Deletion, Substitution, Insertion – traditional cost 1
 S: Stochastic transducers
 Probabilistic costs for each possible edit operation
 H: Hidden Markov Model
 Each character has separate edit operation parameters
Distance Measures
Variants of Levenshtein distance:
 L-V: vowel substitution cost only: 0.5
 L-S/L-A: Filter probabilities obtained by S into 3 classes 0.5, 0.75, 1
 L-S: Each pair separately trained
 L-A: Collectively trained for all Romance languages
Limitation
 Method cannot discover translation pairs with having no
surface form relationship
 Assumed cognate pairs: Levenshtein edit distance < 3
 Few false positives
Intra Family Translation Lexicon Induction
 Family: Romance languages
 Available: dictionary (English/Bridge language)
 General evaluation algorithm:
1. Select 100 word pairs from dictionary for testing
2. For adaptive metrics: Select hypothesized word pairs (Edit
distance < 3) as cognate pairs and train on them
3. For each word in source language select closest word from
the 100 target words
Results
Source Languages:
 Spanish, French, Italian, Romanian
Target Language:
 Portuguese
 1000 word pairs in dictionary for Spanish/Portuguese
 900 for other language pairs
Results
 Pure Levenshtein distance works surprisingly well
 S gives boost on French-Portuguese
 Reason could be that Spanish-Portuguese are closer
than French-Portuguese
 L-S usually best
Consonant-to-consonant
 Consonant-to-consonant
edit operations
 Most probable for
French – Portuguese
French
Portuguese
n
m
c
p
g
b
g
f
n
v
p
x
s
f
s
c
c
g
t
q
v
d
Analysis
Analysis - Example
Multiple bridge languages
cognate
model
dictionary
English
Czech
Russian
Ukrainian
Polish
Serbian
Slavic Family
source
bridge
target
Translation Lexicon Induction
Algorithm (One or more bridge languages)
For each word s  S
For each bridge language B
Translate s → b  B
t  T, Calculate D(b,t)
Rank t by D(b,t)
Score t using information from all bridges
Select highest scored t
Map s → t
Results
 One bridge languages, but multiple pathes
Examples
Different Pathways
 English to Portuguese (via Romance languages)
 English to Norwegian (via Germanic languages)
 English to Ukrainian (via Slavic languages)
 Portuguese to English (via Germanic languages, French)
Results
2nd Paper
Inducing Translation Lexicons via Diverse Similarity Measures
and Bridge Languages
 Charles Schafer and David Yarowsky
 COLING 2002
 Improves results of first paper by introducing additional
similarity scores between candidate translations
Basic Idea
 Decompose:
 P(English|Serbian) = P(English|Czech) x P(Czech|Serbian)
 For any language L close to Czech:
 P(English|L) = P(English|Czech) x P(Czech|L)
 P (Czech|L) as presented was done using similarity on
cognate pairs
Covered Languages
Serbian
English
Slovene
Bulgarian
Punjabi
Gujarati
Hindi
Marathi
Nepali
Bengali
Czech
Polish
Ukrainian
Slovak
Resources
Serbian – Czech – English
Gujarati – Hindi – English
 Czech – English
dictionary:
171k word pairs
 Hindi – English
dictionary:
74k word pairs
 Corpora:
English: 192M words
Serbian: 12M
(News data from web)
 Corpora:
Gujarati: 2M
Problem with Cognate Pairs
Serbian
Czech
English
favor
prazan
prizen
grace
pazen
patronage
prazdny
blank
empty
not
correct
correct
Idea
Introduce additional similarity models
 Weighted Levenshtein Similarity
 Context Similarity
 Date distributional Similarity
 Relative frequency Similarity
 Burstiness Similarity and Inverse Document
Frequency
 Use of Additional Bridge Languages
 Combine them with weighted string distance
Weighted Levenshtein Similarity
 1. Iteration:
Vowel cluster operations have half the cost of single
consonant substitutions, insertions and deletions
 dist(vowel+, vowel+)
 Use highest weighted of the top 2000 to re-estimate edit
weights
 Some high
probability substitutions:
Context Similarity
Compare narrow and wide contexts for candidates
Context: bag of words
(Narrow: radius 1/ Wide: radius 10)
1. Calculate Context on Source Language (Serbian)
2. Translate to English using current estimations
3. Compare with English Contexts via Cosine Similarity
Context Similarity - Example
Nezavisnost
pravo: 2
suvereniteti: 3
deklaracije: 3
pokrajina: 4
Context in Serbian Corpus with frequencies
Context Similarity - Example
Nezavisnost
pravo: 2
suvereniteti: 3
majesty
2
justice
1.5
deklaracije: 3
pokrajina: 4
declaration
1.5
sovereignty
1.5
4
country
Translate with Initial Lexicon
1.5
ornamental
Context Similarity - Example
Nezavisnost
pravo: 2
suvereniteti: 3
majesty
0
0
2
1.5
justice
deklaracije: 3
pokrajina: 4
declaration
1.5
1.5
sovereignty
4
country
1.5
ornamental
Independence
3
1
10
0
479
836
191
0
184
104
0
21
4
141
0
Freedom
681
expression
religion
Context of Candidates in English Corpus
Context Similarity - Example
Nezavisnost
pravo: 2
suvereniteti: 3
majesty
0
0
2
1.5
justice
deklaracije: 3
pokrajina: 4
declaration
1.5
1.5
sovereignty
4
country
1.5
ornamental
COS
Independence
3
1
10
0
479
836
191
0
184
104
0
21
4
141
0
Freedom
681
expression
religion
Cosine Similarity finds correct candidate
(Independence)
Date distributional Similarity
 News Data:
 Events are reported in parallel in multiple languages
(+/- 2 days)
 Construct term frequency vectors over time and compare
candidates
Date distributional Similarity
Relative Frequencies
 Word and translation are likely to have similar relative
frequencies
 Modest frequency variations are expected
 Useful to rule out pairings with several orders of magnitude
difference in relative frequency
 Ratio of logs of frequencies correlates well with translational
compatibility
Relative Frequency Similarity
 Correct translation “laud” has higher RF Score than higher
ranked incorrect candidates “calibre”, “quarter” and “class”
Burstiness Similarity
 Define Burstiness to measure differences
Burstiness Similarity
 Burstiness matches better for correct translations “laud” and
“praise”
Combine the different measures
1. Weighted Levenshtein distance to get initial candidate pairs
2. Calculate 8 similarity measures







Weighted Levenshtein
Wide bag-of-words context similarity
Narrow bag of words context similarity
Local News date distribution similarity
All News date distribution similarity
IDF similarity
Burstiness similarity
Combine the different measures
3. Integrate similarity measures into a single similarity function:
1. POS Similarity
Bias in favor of compatible parts of speech (N, V, ADJ)
Penalty for non-matching candidates
2. Sort candidates for each score in decreasing order
Assign Ranks 0,1,… and normalize by count
3. Scoring: Similarity models have associated weights
Weight Allocation
Evaluation
3 Evaluation Criteria
 Exact Match Accuracy
 Percentage of correct English in the top k ranks
 Median Position of the per word highest ranked correct
translation
Results
Results
 Improvements with second bridge language
Additional Bridge Language Work
Interlingua based Statistical Machine Translation
 Manuel Kauers, Stephan Vogel, Christian Fügen, Alex Waibel
 ICSLP 2002
 Paper covers SMT from Text to a structured Interlingua
format (IF)
English
IF
 Corpus English/IF is available
…but we also want to translate other languages into IF?
Generalized problem
 Assume we have translation model F to E and G to F
… but we want G to E?
E
G
 Decompose:
 Because:
F
And just translating…
 Experiments done during PF-STAR project 2003/2004
 Training data: 48k lines of BTEC data
 Test data: 506 lines, Test set for CSTAR 2003
 Translating Chinese → Italian
 Also via a bridge language Chinese → English → Italian
Ch → It
Ch → En → It
ITC-IRST
0.1769/4.5251
0.1695/4.4343
CMU/UKA
0.2030/4.8210
0.2238/4.9453

02Mar2006-11-734.ppt

Transcript 02Mar2006-11-734.ppt

Directory