Transcript Talk
Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin Snyder University of Wisconsin-Madison 28 July, 2011 The University of Wisconsin-Madison Unsupervised NLP Unsupervised learning in NLP has become popular 27 papers in this year ACL+EMNLP Relies on inductive bias, encoded in model structure or learning algorithm. Example : HMM for POS induction, encodes transitional regularity ? ? ? I like to The University of Wisconsin-Madison ? read 1 Inductive Biases Formulated with weak empirical grounding (or left implicit) Single, simple bias for all languages low performance, complicated models, fragility, language dependence. Our approach : learn complex, universal bias using labeled languages i.e. Empirically learn what the space of plausible human languages looks like to guide unsupervised learning The University of Wisconsin-Madison 2 Key Idea Training languages Test language (x1 , y1 ) (x2 , y2 ) (x,?) (x3 , y3 ) 1) Collect labeled corpora (non-parallel) for several training languages x = corpus y = labels The University of Wisconsin-Madison 3 Key Idea Training languages Test language f ( x1 , y1 ) f ( x2 , y2 ) f (x,?) f ( x3 , y3 ) 2) Map each (x,y) pair into a “universal feature space” - i.e. to allow cross-lingual generalization The University of Wisconsin-Madison 4 Key idea Training languages Test language f ( x1 , y1 ) f ( x2 , y2 ) score (·) f (x,?) f ( x3 , y3 ) 3) Train scoring function over universal feature space - i.e. treat each annotated language as single data point in structured prediction problem The University of Wisconsin-Madison 5 Key Idea Training languages Test language f ( x1 , y1 ) score (·) f ( x2 , y2 ) f (x, y * ) f ( x3 , y3 ) 4) Predict test labels which yield highest score y* argmax score ( f (x, y) ) y The University of Wisconsin-Madison 6 Test Case: Nominal Morphology Languages differ in morphological complexity - Only 4 English noun tags in Penn Treebank - 154 noun tags in Hungarian corpus (suffix encode case, number, and gender) Our analysis will break each noun into : stem, phonological deletion rule, and suffix - utiskom [ stem = utisak, del = (..ak# → ..k#), suffix = om ] Question : Can we use morphologically annotated languages to train a universal morphological analyzer ? The University of Wisconsin-Madison 7 Our Method Universal feature space (8 features) - Size of stem, suffix, and deletion rule lexicons - Entropy of stem, suffix, and deletion rule distributions - Percentage of suffix-free words, and words with phonological deletions. Learning algorithm - Broad characteristics of morphology often similar across select language pairs - Motivates a nearest neighbor approach - In structured scenario, learning becomes a search problem over label space The University of Wisconsin-Madison 8 Structured Nearest Neighbor Main Idea: predict analysis for test language which brings us closest in feature space to a training language. 1) Initialize analysis of test language: y0 2) For each training language : - iteratively and greedily update test language analysis to bring closer in feature space to : y (1, ) , y (2, ) , y (3, ) ,... 3) After T iterations, choose training language * y =y The University of Wisconsin-Madison closest in feature space: = arg min f (x, y (T , ) ) - f (x , y ) 4) Predict the associated analysis: * * Test Training * (T , ) 9 Structured Nearest Neighbor Training languages: (x1 , y1 ) ( x2 , y2 ) ( x3 , y3 ) Initialize test language labels: The University of Wisconsin-Madison y0 10 Structured Nearest Neighbor Iterative Search: The University of Wisconsin-Madison y 0 y (1,1) ...y (t,1) y (1,2) ...y (t,2) y (1,3) ...y (t,3) 11 Structured Nearest Neighbor Iterative Search: The University of Wisconsin-Madison y 0 y (1,1) ...y (t,1) , y (t+1,1) y (1,2) ...y (t,2) , y (t+1,2) y (1,3) ...y (t,3) , y (t+1,3) 12 Structured Nearest Neighbor Iterative Search: The University of Wisconsin-Madison y 0 y (1,1) ...y (t,1) , y (t+1,1) ...y (T ,1) y (1,2) ...y (t,2) , y (t+1,2) ...y (T ,2) y (1,3) ...y (t,3) , y (t+1,3) ...y (T ,3) 13 Structured Nearest Neighbor * Predict: The University of Wisconsin-Madison =3 y* = y (T ,3) 14 Morphology Search Algorithm Stage 0: Initialization Stage 1: Reanalyze Each Word Based on (Goldsmith 2005) - He minimizes description length - We minimize distance to training language Training Select Stage 2: Find New Stems Stage 3: Find New Suffixes The University of Wisconsin-Madison Candidates 15 Iterative Search Algorithm t1 f1 d1 t2 f2 d2 tn fn dn Stem Set T Suffix Set F Deletion rule Set F Stage 0 : Using “character successor frequency,” initialize sets T, F, and D. The University of Wisconsin-Madison 16 Iterative Search Algorithm t1 f1 d1 t2 f2 d2 tn fn dn Stem Set T Suffix Set F Deletion rule Set F Stage 1 : - greedily reanalyze each word, keeping T and F fixed. The University of Wisconsin-Madison 17 Iterative Search Algorithm t1 f1 d1 t2 f2 d2 tn fn dn Stem Set T Suffix Set F Deletion rule Set F Stage 2 : - greedily analyze unsegmented words, keeping F fixed The University of Wisconsin-Madison 18 Iterative Search Algorithm t1 f1 d1 t2 f2 d2 tn fn dn Stem Set T Suffix Set F Deletion rule Set F Stage 3 : Find new Suffixes - greedily analyze unsegmented words, keeping T fixed The University of Wisconsin-Madison 19 Experimental Setup Corpus: Orwell’s Nineteen Eighty Four (Multext East V3) - Languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Serbian - 94,725 tokens (English). Slight confound: data is parallel. Method does not assume or exploit this fact. - all words tagged with morpho-syntactic analysis. Baseline: Linguistica model (Goldsmith 2005) - same search procedure, greedily minimizes description length Upper bound: supervised model - structured perceptron framework (Collins 2002) The University of Wisconsin-Madison 20 Aggregate Results 100 Linguistica Accuracy: fraction of word types with correct analysis 80 60 64.6 40 avg. over 8 languages The University of Wisconsin-Madison 21 Aggregate Results Supervised Linguistica 100 Accuracy: fraction of word types with correct analysis 92.8 80 60 64.6 40 avg. over 8 languages The University of Wisconsin-Madison 22 Aggregate Results 100 Our model Supervised Linguistica 92.8 80 Accuracy: fraction of word types with correct analysis Our Model: Train with 7, test on 1 -average absolute increase of 11.8 -reduces error by 42% 60 76.4 64.6 40 avg. over 8 languages The University of Wisconsin-Madison 23 Aggregate Results 100 Oracle Our model Supervised Linguistica 92.8 80 Accuracy: fraction of word types with correct analysis Our Model: Train with 7, test on 1 81.1 -average absolute increase of 11.8 -reduces error by 42% 60 76.4 64.6 Oracle: Each language guided using own gold standard feature values Accuracy still below supervised: 40 avg. over 8 languages The University of Wisconsin-Madison (1) search errors (2) coarseness of feature space 24 Results By Language Linguistica 100 80 60 40 69 BG 60 CS 81 EN 51 ET 65 HU 66 RO 61 SL 64 SR Linguistica Best accuracy: English Lowest accuracy: Estonian The University of Wisconsin-Madison 25 Results By Language Our Model 100 Linguistica 80 60 40 84 69 83 60 76 81 67 51 69 65 71 66 83 61 79 64 BG CS EN ET HU RO SL SR Our Model (train with 7, test on 1) Biggest improvements for Serbian (15 points) and Slovene (22 points). For all languages other than English, improvement over baseline The University of Wisconsin-Madison 26 Visualization of Feature Space Linguistica Gold Standard Our Method Feature space reduced to 2D using MDS The University of Wisconsin-Madison 27 Visualization of Feature Space Linguistica Gold Standard Our Method Serbian and Slovene: - Closely related Slavic languages - Nearest Neighbors under our model’s analysis - Essentially they “swap places” The University of Wisconsin-Madison 28 Visualization of Feature Space Linguistica Gold Standard Our Method Estonian and Hungarian: - Highly inflected Uralic Languages - They “swap places” The University of Wisconsin-Madison 29 Visualization of Feature Space Linguistica Gold Standard Our Method English: - Failed to find a good neighbor - Pulled towards Bulgarian (second least inflected language in dataset) The University of Wisconsin-Madison 30 Accuracy as Training Languages Added Averaged over all language combinations of various sizes - Accuracy climbs as training languages added - Worse than baseline when only one training language available - Better than baseline when two or more training languages available The University of Wisconsin-Madison 31 Why does accuracy improve with more languages? Resulting distance VS accuracy for all 56 train-test pairs - More training languages ⇒ find a closer neighbor - Closer neighbor ⇒ higher accuracy The University of Wisconsin-Madison 32 Summary Main Idea: Recast unsupervised learning as cross-lingual structured prediction Test case: morphological analysis of 8 languages. Formulated universal feature space for morphology Developed novel structured nearest neighbor approach Our method yields substantial accuracy gains The University of Wisconsin-Madison 33 Future Work Shortcoming - uniform weighting of dimensions in the the universal feature space - some features may be more important than others Future work: learn distance metric on universal feature space The University of Wisconsin-Madison 34 Thank You The University of Wisconsin-Madison 35