Learning Morphological Disambiguation Rules for Turkish
Download
Report
Transcript Learning Morphological Disambiguation Rules for Turkish
Learning Morphological
Disambiguation Rules for
Turkish
Deniz Yuret
Ferhan Türe
Koç University, İstanbul
Overview
Turkish morphology
The morphological disambiguation task
The Greedy Prepend Algorithm
Training
Evaluation
Turkish Morphology
Turkish is an agglutinative language:
Many
syntactic phenomena expressed by function words and word order in
English are expressed by morphology in Turkish.
I will be able to go.
(go) + (able to) + (will) + (I)
git + ebil
+ ecek + im
Gidebileceğim.
Fun with Turkish Morphology
Avrupalılaştıramadıklarımızdanmışsınız
Avrupa
lı
laş
tır
ama
Europe
European
become
make
not able to
dık
larımız
dan
mış
sınız
we were
those that
from
were
you
So how long can words be?
uyu – sleep
uyut – make X sleep
uyuttur – have Y make X sleep
uyutturt – have Z have Y make X sleep
uyutturttur – have W have Z have Y make X sleep
uyutturtturt – have Q have W have Z …
…
Morphological Analyzer for Turkish
masalı
masal+Noun+A3sg+Pnon+Acc (= the story)
masal+Noun+A3sg+P3sg+Nom (= his story)
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (=
with tables)
Oflazer, K. (1994). Two-level description of Turkish morphology. Literary
and Linguistic Computing
Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999) Design for a turkish
treebank. EACL’99
Kenneth R. Beesley and Lauri Karttunen, Finite State
Morphology, CSLI Publications, 2003
Features, IGs and Tags
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
stem
features
features
inflectional group (IG)
derivational
boundary
IG
tag
126 unique features
9129 unique IGs
∞ unique tags
11084 distinct tags observed in
1M word training corpus
Why not just do POS tagging?
from Oflazer (1999)
Why not just do POS tagging?
Inflectional groups can independently act as heads or
modifiers in syntactic dependencies.
Full morphological analysis is essential for further
syntactic analysis.
Morphological disambiguation
Ambiguity rare in English:
lives = live+s or life+s
More serious in Turkish:
42.1% of the tokens ambiguous
1.8 parses per token on average
3.8 parses for ambiguous tokens
Morphological disambiguation
Task: pick correct parse given context
1.
2.
3.
–
–
–
masal+Noun+A3sg+Pnon+Acc
masal+Noun+A3sg+P3sg+Nom
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
Uzun masalı anlat
Uzun masalı bitti
Uzun masalı oda
Tell the long story
His long story ended
Room with long table
Morphological disambiguation
Task: pick correct parse given context
1.
2.
3.
masal+Noun+A3sg+Pnon+Acc
masal+Noun+A3sg+P3sg+Nom
masa+Noun+A3sg+Pnon+Nom^DB+Adj+With
Key Idea
Build a separate classifier for each feature.
Decision Lists
1.
2.
3.
4.
5.
If
Then
If
Then
If
Then
If
Then
If
Then
(W = çok) and (R1 = +DA)
W has +Det
(L1 = pek)
W has +Det
(W = +AzI)
W does not have +Det
(W = çok)
W does not have +Det
TRUE
W has +Det
“pek çok alanda”
“pek çok insan”
“insan çok daha”
(R1)
(R2)
(R4)
Greedy Prepend Algorithm
GPA(data)
1 dlist = NIL
2 default-class = Most-Common-Class(data)
3 rule = [If TRUE Then default-class]
4 while Gain(rule, dlist, data) > 0
5
do dlist = prepend(rule, dlist)
6
rule = Max-Gain-Rule(dlist, data)
7 return dlist
Training Data
1M words of news material
Semi automatically disambiguated
Created 126 separate training sets, one for
each feature
Each training set only contains instances
which have the corresponding feature in at
least one of their parses
Input attributes
For a five word window:
The exact word string (e.g. W=Ali'nin)
The lowercase version (e.g. W=ali'nin)
All suffixes (e.g. W=+n, W=+In, W=+nIn,
W=+'nIn, etc.)
Character types (e.g. Ali'nin would be
described with W=UPPER-FIRST, W=LOWER-MID, W=APOSMID, W=LOWERLAST)
Average 40 features per instance.
Sample decision lists
+Acc
+Prop
0
1 W=+InI
1 W=+yI
1 W=UPPER0
1 W=+IzI
1 L1=~bu
1 W=~onu
1 R1=+mAK
1 W=~beni
0 W=~günü
1 W=+InlArI
1 W=~onlarý
0 W=+olAyI
0 W=~sorunu
… (672 rules)
1
0 W=STFIRST
0 W==Türk
1 W=STFIRST R1=UCFIRST
0 L1==.
0 W=+AnAl
1 R1==,
0 W=+yAD
1 W=UPPER0
0 W=+lAD
0 W=+AK
1 R1=UPPER
0 W==Milli
1 W=STFIRST R1=UPPER0
… (3476 rules)
7000
100
6000
98
96
Rules
5000
4000
3000
2000
1000
0
g un on m B rb dj os sg sg op r o cc rb 3pl
s
o D Ve A P 3 2 Pr Ze A ve A
A3 No Pn N
P P
Ad
94
92
90
88
86
84
Accuracy
Models for individual features
Combining models
masal+Noun+A3sg+P3sg+Nom
masal+Noun+A3sg+Pnon+Acc
Decision list results and confidence (only
distinguishing features necessary):
P3sg = yes
Nom = no
Pnon = no
Acc = yes
(89.53%)
(93.92%)
(95.03%)
(89.24%)
score(P3sg+Nom) = 0.8953 x (1 – 0.9392)
score(Pnon+Acc) = (1 – 0.9503) x 0.8924
Evaluation
Test corpus: 1000 words, hand tagged
Accuracy: 95.87% (conf. int: 94.57-97.08)
Better than the training data !?
Other Experiments
Retraining on own output: 96.03%
Training on unambiguous data: 82.57%
Forget disambiguation, let’s do tagging with a
single decision list: 91.23%, 10000 rules
Contributions
Learning morphological disambiguation rules
using GPA decision list learner.
Reducing data sparseness and increase
noise tolerance using separate models for
individual output features.
ECOC, WSD, etc.