Learning Morphological Disambiguation Rules for Turkish

Download Report

Transcript Learning Morphological Disambiguation Rules for Turkish

Learning Morphological
Disambiguation Rules for
Turkish
Deniz Yuret
Ferhan Türe
Koç University, İstanbul
Overview
 Turkish morphology
 The morphological disambiguation task
 The Greedy Prepend Algorithm
 Training
 Evaluation
Turkish Morphology
 Turkish is an agglutinative language:
Many
syntactic phenomena expressed by function words and word order in
English are expressed by morphology in Turkish.
I will be able to go.
(go) + (able to) + (will) + (I)
git + ebil
+ ecek + im
Gidebileceğim.
Fun with Turkish Morphology
Avrupalılaştıramadıklarımızdanmışsınız
 Avrupa
 lı
 laş
 tır
 ama
Europe
European
become
make
not able to
 dık
 larımız
 dan
 mış
 sınız
we were
those that
from
were
you
So how long can words be?
 uyu – sleep
 uyut – make X sleep
 uyuttur – have Y make X sleep
 uyutturt – have Z have Y make X sleep
 uyutturttur – have W have Z have Y make X sleep
 uyutturtturt – have Q have W have Z …
…
Morphological Analyzer for Turkish
masalı
 masal+Noun+A3sg+Pnon+Acc (= the story)
 masal+Noun+A3sg+P3sg+Nom (= his story)
 masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (=
with tables)
 Oflazer, K. (1994). Two-level description of Turkish morphology. Literary
and Linguistic Computing
 Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999) Design for a turkish
treebank. EACL’99
 Kenneth R. Beesley and Lauri Karttunen, Finite State
Morphology, CSLI Publications, 2003
Features, IGs and Tags
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
stem
features
features
inflectional group (IG)
derivational
boundary
IG
tag
 126 unique features
 9129 unique IGs
 ∞ unique tags
 11084 distinct tags observed in
1M word training corpus
Why not just do POS tagging?
from Oflazer (1999)
Why not just do POS tagging?
 Inflectional groups can independently act as heads or
modifiers in syntactic dependencies.
 Full morphological analysis is essential for further
syntactic analysis.
Morphological disambiguation
 Ambiguity rare in English:
lives = live+s or life+s
 More serious in Turkish:
42.1% of the tokens ambiguous
1.8 parses per token on average
3.8 parses for ambiguous tokens
Morphological disambiguation

Task: pick correct parse given context
1.
2.
3.
–
–
–
masal+Noun+A3sg+Pnon+Acc
masal+Noun+A3sg+P3sg+Nom
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
Uzun masalı anlat
Uzun masalı bitti
Uzun masalı oda
Tell the long story
His long story ended
Room with long table
Morphological disambiguation

Task: pick correct parse given context
1.
2.
3.
masal+Noun+A3sg+Pnon+Acc
masal+Noun+A3sg+P3sg+Nom
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
Key Idea
Build a separate classifier for each feature.
Decision Lists
1.
2.
3.
4.
5.
If
Then
If
Then
If
Then
If
Then
If
Then
(W = çok) and (R1 = +DA)
W has +Det
(L1 = pek)
W has +Det
(W = +AzI)
W does not have +Det
(W = çok)
W does not have +Det
TRUE
W has +Det
 “pek çok alanda”
 “pek çok insan”
 “insan çok daha”
(R1)
(R2)
(R4)
Greedy Prepend Algorithm
GPA(data)
1 dlist = NIL
2 default-class = Most-Common-Class(data)
3 rule = [If TRUE Then default-class]
4 while Gain(rule, dlist, data) > 0
5
do dlist = prepend(rule, dlist)
6
rule = Max-Gain-Rule(dlist, data)
7 return dlist
Training Data
 1M words of news material
 Semi automatically disambiguated
 Created 126 separate training sets, one for
each feature
 Each training set only contains instances
which have the corresponding feature in at
least one of their parses
Input attributes
For a five word window:
 The exact word string (e.g. W=Ali'nin)
 The lowercase version (e.g. W=ali'nin)
 All suffixes (e.g. W=+n, W=+In, W=+nIn,
W=+'nIn, etc.)
 Character types (e.g. Ali'nin would be
described with W=UPPER-FIRST, W=LOWER-MID, W=APOSMID, W=LOWERLAST)
Average 40 features per instance.
Sample decision lists
+Acc
+Prop
0
1 W=+InI
1 W=+yI
1 W=UPPER0
1 W=+IzI
1 L1=~bu
1 W=~onu
1 R1=+mAK
1 W=~beni
0 W=~günü
1 W=+InlArI
1 W=~onlarý
0 W=+olAyI
0 W=~sorunu
… (672 rules)
1
0 W=STFIRST
0 W==Türk
1 W=STFIRST R1=UCFIRST
0 L1==.
0 W=+AnAl
1 R1==,
0 W=+yAD
1 W=UPPER0
0 W=+lAD
0 W=+AK
1 R1=UPPER
0 W==Milli
1 W=STFIRST R1=UPPER0
… (3476 rules)
7000
100
6000
98
96
Rules
5000
4000
3000
2000
1000
0
g un on m B rb dj os sg sg op r o cc rb 3pl
s
o D Ve A P 3 2 Pr Ze A ve A
A3 No Pn N
P P
Ad
94
92
90
88
86
84
Accuracy
Models for individual features
Combining models
 masal+Noun+A3sg+P3sg+Nom
 masal+Noun+A3sg+Pnon+Acc
 Decision list results and confidence (only
distinguishing features necessary):




P3sg = yes
Nom = no
Pnon = no
Acc = yes
(89.53%)
(93.92%)
(95.03%)
(89.24%)
 score(P3sg+Nom) = 0.8953 x (1 – 0.9392)
 score(Pnon+Acc) = (1 – 0.9503) x 0.8924
Evaluation
 Test corpus: 1000 words, hand tagged
 Accuracy: 95.87% (conf. int: 94.57-97.08)
 Better than the training data !?
Other Experiments
 Retraining on own output: 96.03%
 Training on unambiguous data: 82.57%
 Forget disambiguation, let’s do tagging with a
single decision list: 91.23%, 10000 rules
Contributions
 Learning morphological disambiguation rules
using GPA decision list learner.
 Reducing data sparseness and increase
noise tolerance using separate models for
individual output features.
 ECOC, WSD, etc.