Transcript Talk

Universal Morphological Analysis using
Structured Nearest Neighbor Prediction
Young-Bum Kim, João V. Graça, and Benjamin Snyder
University of Wisconsin-Madison
28 July, 2011
The University of Wisconsin-Madison
Unsupervised NLP
 Unsupervised learning in NLP has become popular
27 papers in this year ACL+EMNLP
 Relies on inductive bias, encoded in model structure or learning algorithm.
Example : HMM for POS induction, encodes transitional regularity
?
?
?
I
like
to
The University of Wisconsin-Madison
?
read
1
Inductive Biases
 Formulated with weak empirical grounding (or left implicit)
 Single, simple bias for all languages

low performance, complicated models, fragility, language dependence.
Our approach : learn complex, universal bias using labeled languages
i.e. Empirically learn what the space of plausible human languages looks
like to guide unsupervised learning
The University of Wisconsin-Madison
2
Key Idea
Training languages
Test language
(x1 , y1 )
(x2 , y2 )
(x,?)
(x3 , y3 )
1) Collect labeled corpora (non-parallel) for several training languages
x = corpus
y = labels
The University of Wisconsin-Madison
3
Key Idea
Training languages
Test language
f ( x1 , y1 )
f ( x2 , y2 )
f (x,?)
f ( x3 , y3 )
2) Map each (x,y) pair into a “universal feature space”
- i.e. to allow cross-lingual generalization
The University of Wisconsin-Madison
4
Key idea
Training languages
Test language
f ( x1 , y1 )
f ( x2 , y2 )
score (·)
f (x,?)
f ( x3 , y3 )
3)
Train scoring function over universal feature space
- i.e. treat each annotated language as single data point in structured
prediction problem
The University of Wisconsin-Madison
5
Key Idea
Training languages
Test language
f ( x1 , y1 )
score (·)
f ( x2 , y2 )
f (x, y * )
f ( x3 , y3 )
4) Predict test labels which yield highest score
y*  argmax score ( f (x, y) )
y
The University of Wisconsin-Madison
6
Test Case: Nominal Morphology
 Languages differ in morphological complexity
- Only 4 English noun tags in Penn Treebank
- 154 noun tags in Hungarian corpus
(suffix encode case, number, and gender)
 Our analysis will break each noun into :
stem, phonological deletion rule, and suffix
- utiskom [ stem = utisak, del = (..ak# → ..k#), suffix = om ]
Question : Can we use morphologically annotated languages to train a universal
morphological analyzer ?
The University of Wisconsin-Madison
7
Our Method
 Universal feature space (8 features)
- Size of stem, suffix, and deletion rule lexicons
- Entropy of stem, suffix, and deletion rule distributions
- Percentage of suffix-free words, and words with phonological deletions.
 Learning algorithm
- Broad characteristics of morphology often similar across select language pairs
- Motivates a nearest neighbor approach
- In structured scenario, learning becomes a search problem over label space
The University of Wisconsin-Madison
8
Structured Nearest Neighbor
 Main Idea: predict analysis for test language which brings us closest in
feature space to a training language.
1) Initialize analysis of test language:
y0
2) For each training language :
- iteratively and greedily update test language analysis to bring closer
in feature space to
: y (1, ) , y (2, ) , y (3, ) ,...
3) After T iterations, choose training language
*
y =y
The University of Wisconsin-Madison
closest in feature space:
= arg min f (x, y (T , ) ) - f (x , y )
4) Predict the associated analysis:
*
*
Test
Training
*
(T , )
9
Structured Nearest Neighbor
Training languages:
(x1 , y1 ) ( x2 , y2 ) ( x3 , y3 )
Initialize test language labels:
The University of Wisconsin-Madison
y0
10
Structured Nearest Neighbor
Iterative Search:
The University of Wisconsin-Madison
y
0
y (1,1) ...y (t,1)
y (1,2) ...y (t,2)
y (1,3) ...y (t,3)
11
Structured Nearest Neighbor
Iterative Search:
The University of Wisconsin-Madison
y
0
y (1,1) ...y (t,1) , y (t+1,1)
y (1,2) ...y (t,2) , y (t+1,2)
y (1,3) ...y (t,3) , y (t+1,3)
12
Structured Nearest Neighbor
Iterative Search:
The University of Wisconsin-Madison
y
0
y (1,1) ...y (t,1) , y (t+1,1) ...y (T ,1)
y (1,2) ...y (t,2) , y (t+1,2) ...y (T ,2)
y (1,3) ...y (t,3) , y (t+1,3) ...y (T ,3)
13
Structured Nearest Neighbor
*
Predict:
The University of Wisconsin-Madison
=3
y* = y (T ,3)
14
Morphology Search Algorithm
Stage 0:
Initialization
Stage 1:
Reanalyze Each Word
Based on (Goldsmith 2005)
- He minimizes description length
- We minimize distance to training language
Training
Select
Stage 2:
Find New Stems
Stage 3:
Find New Suffixes
The University of Wisconsin-Madison
Candidates
15
Iterative Search Algorithm
t1
f1
d1
t2
f2
d2
tn
fn
dn
Stem Set T
Suffix Set F
Deletion rule Set F
 Stage 0 : Using “character successor frequency,” initialize sets T, F, and D.
The University of Wisconsin-Madison
16
Iterative Search Algorithm
t1
f1
d1
t2
f2
d2
tn
fn
dn
Stem Set T
Suffix Set F
Deletion rule Set F
 Stage 1 :
- greedily reanalyze each word, keeping T and F fixed.
The University of Wisconsin-Madison
17
Iterative Search Algorithm
t1
f1
d1
t2
f2
d2
tn
fn
dn
Stem Set T
Suffix Set F
Deletion rule Set F
 Stage 2 :
- greedily analyze unsegmented words, keeping F fixed
The University of Wisconsin-Madison
18
Iterative Search Algorithm
t1
f1
d1
t2
f2
d2
tn
fn
dn
Stem Set T
Suffix Set F
Deletion rule Set F
 Stage 3 : Find new Suffixes
- greedily analyze unsegmented words, keeping T fixed
The University of Wisconsin-Madison
19
Experimental Setup
 Corpus: Orwell’s Nineteen Eighty Four (Multext East V3)
- Languages:
Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Serbian
- 94,725 tokens (English).
Slight confound: data is parallel. Method does not assume or exploit this fact.
- all words tagged with morpho-syntactic analysis.
 Baseline: Linguistica model (Goldsmith 2005)
- same search procedure, greedily minimizes description length
 Upper bound: supervised model
- structured perceptron framework (Collins 2002)
The University of Wisconsin-Madison
20
Aggregate Results
100
Linguistica
 Accuracy: fraction of word types with
correct analysis
80
60
64.6
40
avg. over 8 languages
The University of Wisconsin-Madison
21
Aggregate Results
Supervised
Linguistica
100
 Accuracy: fraction of word types with
correct analysis
92.8
80
60
64.6
40
avg. over 8 languages
The University of Wisconsin-Madison
22
Aggregate Results
100
Our model
Supervised
Linguistica
92.8
80
 Accuracy: fraction of word types with
correct analysis
 Our Model: Train with 7, test on 1
-average absolute increase of 11.8
-reduces error by 42%
60
76.4
64.6
40
avg. over 8 languages
The University of Wisconsin-Madison
23
Aggregate Results
100
Oracle
Our model
Supervised
Linguistica
92.8
80
 Accuracy: fraction of word types with
correct analysis
 Our Model: Train with 7, test on 1
81.1
-average absolute increase of 11.8
-reduces error by 42%
60
76.4
64.6
 Oracle: Each language guided using own
gold standard feature values
Accuracy still below supervised:
40
avg. over 8 languages
The University of Wisconsin-Madison
(1) search errors
(2) coarseness of feature space
24
Results By Language
Linguistica
100
80
60
40
69
BG
60
CS
81
EN
51
ET
65
HU
66
RO
61
SL
64
SR
Linguistica
 Best accuracy:
English
 Lowest accuracy: Estonian
The University of Wisconsin-Madison
25
Results By Language
Our Model
100
Linguistica
80
60
40
84 69
83 60
76 81
67 51
69 65
71 66
83 61
79 64
BG
CS
EN
ET
HU
RO
SL
SR
Our Model (train with 7, test on 1)
 Biggest improvements for Serbian (15 points) and Slovene (22 points).
 For all languages other than English, improvement over baseline
The University of Wisconsin-Madison
26
Visualization of Feature Space
Linguistica
Gold Standard
Our Method
 Feature space reduced to 2D using MDS
The University of Wisconsin-Madison
27
Visualization of Feature Space
Linguistica
Gold Standard
Our Method
 Serbian and Slovene:
- Closely related Slavic languages
- Nearest Neighbors under our model’s analysis
- Essentially they “swap places”
The University of Wisconsin-Madison
28
Visualization of Feature Space
Linguistica
Gold Standard
Our Method
 Estonian and Hungarian:
- Highly inflected Uralic Languages
- They “swap places”
The University of Wisconsin-Madison
29
Visualization of Feature Space
Linguistica
Gold Standard
Our Method
 English:
- Failed to find a good neighbor
- Pulled towards Bulgarian (second least inflected language in dataset)
The University of Wisconsin-Madison
30
Accuracy as Training Languages Added
 Averaged over all language combinations of various sizes
- Accuracy climbs as training languages added
- Worse than baseline when only one training language available
- Better than baseline when two or more training languages available
The University of Wisconsin-Madison
31
Why does accuracy improve with more languages?
 Resulting distance VS accuracy for all 56 train-test pairs
- More training languages ⇒ find a closer neighbor
- Closer neighbor
⇒ higher accuracy
The University of Wisconsin-Madison
32
Summary
Main Idea: Recast unsupervised learning as cross-lingual structured
prediction
Test case: morphological analysis of 8 languages.
 Formulated universal feature space for morphology
 Developed novel structured nearest neighbor approach
 Our method yields substantial accuracy gains
The University of Wisconsin-Madison
33
Future Work
 Shortcoming
- uniform weighting of dimensions in the the universal feature space
- some features may be more important than others
 Future work: learn distance metric on universal feature space
The University of Wisconsin-Madison
34
Thank You
The University of Wisconsin-Madison
35