EricDavis-Presentati..

Download Report

Transcript EricDavis-Presentati..

Morphology & Machine Translation
Eric Davis
MT Seminar
02/06/08
Professor Alon Lavie
Professor Stephan Vogel
Outline
Intro
The Issue at Hand
Supervised MA
Unsupervised MA
Integration of Morphology into MT
Papers
Morfessor
Bridging Inflectional Morphological Gap --> Arabic SMT
Unsupervised MA w/ Finnish, Swedish, & Danish
Turkish SMT
Discussion
The Good
The Bad
Future Directions
Q&A
Morfessor









morpheme segmentation & simple morphology induction algorithm
utilized Finnish & English data sets used in Morpho challenge
unsupervised method for segmentation of words into morpheme-like units
idea: propose substrings occurring frequently enough in several different
word forms as morphs
words = concatenation of morphs
look for optimal balance btwn compactness of morph lexicon &
representation of corpus
 very compact lexicon = individual letters --> as many morphs as letters
in word
 short rep of corpus: whole words --> large lexicon
corpus represented as sequence of pointers to entries in morph lexicon
uses probabilistic framework or MDL to produce segmentation resembling
linguistic morpheme segmentation
3 'flavors:' Baseline, Categories ML, Categories-MAP
Morfessor Baseline










context-independent splitting algorithm
optimization criterion = max P(lexicon) P(corpus|lexicon)
= ∏ P(α) ∏ P(μ)
lexicon = all distinct morphs spelled out forming strings of letters
α = strings of letters formed by morphs
P(lexicon) = product of probability of each letter in α string
corpus --> sequence of morphs
morphs --> particular segmentation of words in corpus
prob of segmentation P(corpus|lexicon) = product of probability of
each morph token μ
letter & morph probs are max likelihood
3 errors:
1) undersegmentation: freq string stored as whole b/c most concise rep
2) oversegmentation: infreq string best coded in parts
3) morphotactic violations: b/c model context-independent
Morfessor Categories ML








introduce morph categories
use HMM

transition probabilities between categories

emission probabilities of morphs from categories
4 categories: use properties of morphs in proposed segmentation

prefix: morph preceding large # of diff morphs (right perplexity)

stem: morph that is not very short

suffix: morph following large # of diff morphs (left perplexity)

noise: morph not obvious prefix, suffix, or stem in pos it occurs in
use heuristics & noise category to remove some errors from baseline
split redundant morphs in lexicon to reduce undersegmentation
prohibit splitting into 'noise'
join morphs tagged as noise w/ neighbors to reduce oversegmentation
introduce context-sensitivity (HMM) to reduce morphotactic violations
Morfessor Categories MAP





2 probabilities calculated:
P(lexicon) & P(representation of corpus conditioned on lexicon)
frequent strings represented as whole words in lexicon
frequent strings now have hierarchical representation
morph --> string of letters or 2 sub-morphs
expand morphs into sub-morphs to avoid undersegmentation
do not expand nodes in tree if next level = 'noise' to avoid
oversegmentation
Experiments & Results
baseline entirely unsupervised
ML & MAP not unsupervised
optimize perplexity threshold
separately for 3 lang
run 3 models on Challenge data
ML & MAP > baseline
baseline did best on English










MAP had much higher precision than other models BUT lower recall
MAP & ML great improvement in recall BUT lower precision
explanation: different complexities of morphology
Turkish/Finnish: high type/token ratio
word formation --> concat of morphemes
So, proportion of frequently occurring word forms is lower
English: word formation --> fewer morphemes
So, proportion of frequently occurring word forms is higher
BAMA & Arabic MT



take advantage of source & target lang context when conducting MA
preprocess data w/ BAMA

morphological analysis at word level

analyzes word --> returns all possible segmentations for word

segmentations --> prefixes, stems, suffixes

built in word-based heuristics --> rank candidates

gloss info provided by BAMA's manually constructed lexicon
3 methods to analysis
1) BAMA only
2) BAMA & context
3) BAMA & corresponding match
BAMA & Arabic MT
3 Methods of Analysis










1) BAMA only
Replace each Arabic word 1st possible split returned by BAMA
2) BAMA & context
Take full advantage of gloss info provided by BAMA’s lexicon
Each split  particular prefix, stem, suffix existing in lexicon
Set of possible translations (glosses) for each fragment
Select fragment (split for source word) using context
winner = split w/ most target side matches in translation of full
sentence
Save choice of split & use for all occurrences of surface form of
word in training & testing
3) BAMA & corresponding match
Arabic  info in surface form not present in English
Confusing for word-alignment unless fragments assigned to null
Remove fragments w/ lexical info not present in English
Find b/c English translations in BAMA lexicon empty
BAMA & Arabic MT
Data & System








Data  BTEC IWSLT05 Arabic language data
20,000 Arabic/English sentence pairs (training)
DevSet/Test05  500 arabic sentences each w/ 16 reference
translations per Arabic sentence
Also evaluated on randomly sampled dev & test sets
worried test & dev sets too similar
Used Vogel system w/ reordering & future cost estimation
baseline --> Normalize (merge Alif, tar marbuta, ee)
Trained translation parameters for 10 scores (LM, word & phrase
count, & 6 translation models)
Used MERT on dev set
Optimized system (separately) for both BLEU & NIST
Results




NIST scores  steady improvement w/ better splitting techniques
(up to 5% relative)
Improvements statistically significant
Better improvements for NIST than BLEU
NIST  sensitivity to correctly translating certain high gain words in
test corpus

Unknown word  inflectional splitting technique  correct
translation  increase score
Unsupervised MA for Finnish,
Swedish, & Danish SMT
used morphological information found in unsupervised way in SMT
3 languages: Danish, Swedish & Finnish
Danish, Swedish very close to each other
trained system on corpus containing 300,000 sentences from EuroParl
typical IBM model --> trans model & LM
used morphs as tokens NOT words
used Morfessor Categories-MAP to find morpheme-like units
even works w/ agglutinative languages, e.g., Finnish
Reasoning: speech recognition  using morph-based vocabulary shown to
improve results
used MAP because:
1) has better segmentation accuracy than Morfessor Baseline or ML
2) can handle unseen words
word = (PRE* STM SUF*)+
Language Models & Phrases
used basic n-gram LM --> base on sub-word units NOT words
used varigram model --> gets smaller n-gram model w/o restricting n too
much
model grows incrementally & includes longer contexts only when
necessary
used 3 types of LM
2 baseline 3-gram & 4-gram models trained w/ SRL LM toolkit
3rd --> varigram model trained w/ VariKN LM toolkit based on
(Siivola, 2007)
observed --> trans quality improved by translating seq of words (phrases)
used Moses --> generalized phrase-based approach to work w/ morphology
used morphs w/o modifications to Moses
similar phrases constructed from morphs as words
morphs suitable for translating compound words in parts
morph category info (pre, stm, suf) part of morph label
+ --> not last morph of word --> necessary to reconstruct words from morphs
Data & Experiments
ran all experiments on Moses & used BLEU to score
data = European Parliament from 1996-2001 --> strip bi-texts of XML tags
& converted letters to lowercase
test --> last 3 months of 2000
dev --> sessions of September 2000
training --> rest (excluding above)
Trained Morfessor on training set & used to segment dev & test sets
created 2 data sets for each alignment pair --> 1 w/ words, 1 w/ morphs
used training sets for LM training
used dev sets for parameter tuning
Moses cleaning script removed mis-aligned sentences:
a) 0 tokens
b) too many tokens
c) bad token ratio
test set --> sentences had at least 5 words & at most 15 words
Results
morphs shorter than words --> need longer n-gram to cover same amount of
context info
4-gram improves scores over 3-gram LM for morphs & for words (3/4) -->
use 4-gram LM
default phrase length in Moses = 7 --> not long enough for morphs -->
increased to 10
varigram model --> mixed:
overall --> translation based on morph phrases worse
signif icantly worse in 2 cases: Finnish-Swedish & Swedish-Finnish
Reasons:
only 1 reference translation --> hurts score
Finnish has fewer words for same text than Swedish or Danish
1 mistake in suffix of word --> word is error even if can understand
Untranslated Words
word-based translation model only translates words present in training data
data --> morphs have notably lower type count
same vocabulary coverage w/ smaller # more frequently occurring units
reduces OOV problem
results --> morph-based system translated many more sentences fully-->
morph-based system translated more words
higher # compound words & inflected word forms left untranslated by wordbased system
Performance on Baseforms
translating into Fin --> word & morph models trouble getting grammatical
endings right
Morph-based model translated more words
Restored words to baseform  morph-based model improve?
used FINTWOL = Finnish MA to produce baseforms for each word in
outcome of Swedish-Finnish translation
3.3%(word) & 2.2(morph) & 1.8%(ref) words not recognized by MA
left unchanged
BLEU scores about 5% higher for modified data
Word-based model still outperformed morph-based model
no test on other language pairs
Quality of Morfessor’s Segmentation
selected 500 words from data randomly & manually segmented
precision = proportion of morph boundaries proposed by Morfessor agreeing
w/ linguistic segmentation
recall = proportion of boundaries in linguistic segmentation found by
Morfessor
segmentation accuracy for Danish & Swedish very similar
Finnish morphology more challenging --> results worse
precision around 80% for all languages --> 4/5 morph boundaries suggested
by Morfessor correct
prefer high precision --> proposed morph boundaries usually correct
lower recall --> words generally undersegmented (segmentation more
conservative)
difference btwn standard word representation & Morfessor segmentation
smaller than difference btwn words & linguistic morphs
Closer Look at Segmentation
looked for phrases not spanning entire words
at least 1 phrase boundary = morph boundary w/in word
3 categories:
1) same structure across languages
compound words common in 3 languages studied
Danish & Swedish similar  similar morphological structure
parallel structures when translating to or from Finnish N & V
2) differing structures across languages
morph-based model captures fairly well
need way to re-order phrases
interesting: Finnish (written) turns V to N
3) lexicalized forms split into phrases
Swedish & Danish: translate phrase piece by piece even though
phrases may be very short & not morphologically productive
data: 2/3 translated sent btwn Swedish & Danish have at least 1
phrase boundary w/in word --> only 1/3 in Finnish
Conclusion
unsupervised MA flexible --> provide language independence
generalization ability increased through more refined phrases
Improvements:
specialize alignment algorithm for morphs instead of words
rescore translations with word-based LM
combine allomorphs of same morpheme into equivalence classes
use factored translation models to combine in translation
English-Turkish SMT
looked at sub-lexical structure b/c Turkish word aligns to complete phrase on
English side
phrase on English side may be discontinuous
Turkish  150 diff suffixes & 30,000 root words
use morphs to alleviate sparseness
Abstract away from word-internal details w/ morph representation
words at morph level that appear different may be similar on surface
Turkish has many more distinct word forms (2X Eng) but fewer distinct
content words
May overload distortion mechanisms b/c account for both word-internal
morph sequence & sentence level word ordering
Segmentation of word might not be unique
generate representation with lexical & morphological features for all
possible segmentations & interpretations of word
Disambiguate analyses w/ statistical disambiguator using morph features
Exploiting Turkish Morphology
Process docs:
1) improve statistical alignment  segment words into lexical
morphemes to remove differences b/c of word-internals
2) tag English side w/ TreeTagger  lemma & POS for each word
Remove any tags not implying morpheme or exceptional form
3) extract sequence of roots for open class content words from morphsegmented data
Remove all closed-class words as well as tags signaling morph on
open class word
Processing  bolsters training corpus, improves alignment
Goal: align roots w/o additional noise from morphs or function words
Framework & Systems
used monolingual Turkish text of 100,000 sentences & training data for LM
decoded & rescored n-best list
surface words directly recoverable from concatenated representation of
segmentation
used word-based representation for word-based LM used for rescoring
used phrase-based SMT framework (Koehn) & Moses toolkit (Koehn) &
SRILM LM toolkit (Stolke)
evaluated decoded translations w/ BLEU using single reference translation
3 Systems:
Baseline
Fully morphologically segmented model
Selectively segmented model
Baseline System
Trained model using default Moses parameters w/ word-based training
corpus
Decoded English test set w/ default decoder parameters & w/ distortion
limit set to unlimited
Also tried distortion weight set to 0.1 to allow for long distance distortions
Tried MERT but did not improve scores
Added content word data & trained 2nd baseline model
Adding content word hurt performance (16.29 vs. 16.13 & 20.16 vs. 19.77)
Fully Morphologically Segmented
Model
Trained model w/ morphs & w/ & w/o adding content words
Used 5-gram morpheme based LM for decoding
goal: capture local morphotactic constraints &
sentence level ordering of words
2 morph per word  covers 2 words
decoded 1000-best lists
Converted 1000 sentences into words & rescored w/ 4-gram word-based LM
goal: enforce distant word sequence constraints
Experimented w/ parameters & various linear combos of word-based LM
and trans model w/ tuning
Default decoding parameters used by Moses decoder provided bad results
English & Turkish word order very different  need distortion
Allow longer distortions w/ less penalty  7 point BLEU improvement
Add content words  6.2% improvement (no rescoring)  better alignment
Rescored 1000-best sentence output w/ 4-gram word-based LM
4% relative improvement (.79 BLEU points)
Best: allow distortion & rescore  1.96 BLEU points (9.4% relative)
Selectively Segmented Model
Analyzed GIZA++ files
certain morphemes on Turkish side almost never aligned w/ anything
Only derivational MA on Turkish side
Nominalization, agreement markers, et al mostly unaligned
For above cases  attach morphemes to root (intervening morphs for V too)
Case morphemes did align w/ prepositions on English side, so left alone
Trained model w/ added content words & parameters from best scoring
model in last slide
2.43 pts (11% rel) improvement BLEU over best model above
Model Iteration
used iterative approach to use multiple models  like post-editing
used selective segmentation model & decoded English training & English
test sets to obtain T1 test & train
trained next model on T1 train and T train data & built model
aim: T1 < model < T
model applied to T1 train & T1 test  produces T2 train & T2 test  repeat
did not include content word corpus in experiments:
Preliminary experiments  word-based models perform better than
morpheme-based models in next iterations
Adding content words for word-based models not helpful
decoded data on original test data using 3-gram word-based LM
Re-ranked 1000-best outputs using 4-gram LM
2nd iteration  4.86 (24% relative) improvement in BLEU over 1st fully
morph-segmented model (no rescoring)
Errors & Word Repair
Errors in any translated morpheme or morphotactics  word incorrect
1-gram precision score  get almost 65% of root words correct
1-gram precision score only about 52% w/ best model
Mismatches  poorly formed  root correct but morphs not applicable or in
wrong position
Many cases  mismatches only 1 morpheme edit distance away from
correct word
Solution:
Utilize morpheme level ‘spelling corrector’ operating on segmented
representations
Corrects forms w/ minor morpheme errors  form lattice & use to
rescore contextually correct form
Used BLEU+ to investigate
recover all words 1 or 2 morphs away  raise word BLEU score to
29.86 and 30.48
Oracle scores BUT very close to root word BLEU scores
Other Scoring Methods
BLEU very harsh on Turkish & morph-based approach
all-or-none nature of token comparison
Possible to have almost interchangeable words w/ very similar semantics
not exact match  BLEU marks as wrong
Solution: use stems & synonyms (METEOR)
alter notion of token similarity  score increases to 25.08
use root word synonymy & Wordnet  score increase to 25.45
combine rules & Wordnet  score increases to 25.46
Conclusions
Morphology & rescoring  significant boost in BLEU score
Other solutions to morphotactics problem:
use skip LM in SRILM toolkit  content word order directly used by
decoder
identify morphologically correct OOV words or assigned low
probability by LM using posterior probabilities
generate additional ‘close’ morphological words & construct lattice that
can be rescored