Slide 1

Transcript Slide 1

Morphology 2
A case study of developing Bengali
morph analyzer and generator
Sudeshna Sarkar
IIT Kharagpur
Two level morphology


PC-KIMMO, a morphological parser based on Kimmo
Koskenniemi's model of two-level morphology
( Koskenniemi 1983).
Koskenniemi's model of two-level morphology was
based on the traditional distinction that linguists make
between
 morphotactics, which enumerates the inventory of
morphemes and specifies in what order they can
occur, and
 morphophonemics, which accounts for alternate
forms or "spellings" of morphemes according to the
phonological context in which they occur.



For example, the word chased is analyzed morphotactically as the
stem chase followed by the suffix -ed.
However, the addition of the suffix -ed apparently causes the loss of
the final e of chase; thus chase and chas are allomorphs or alternate
forms of the same morpheme.
Koskenniemi's model is "two-level" in the sense that a word is
represented as a direct, letter-for-letter correspondence between its
lexical or underlying form and its surface form. For example, the
word chased is given this two-level representation (where + is a
morpheme boundary symbol and 0 is a null character):
Lexical form: c h a s e + e d
Surface form: c h a s 0 0 e d
Main components of Karttunen's KIMMO parser
1.
the rules component: twolevel rules that accounted
for regular phonological or
orthographic alternations,
such as chase versus
chas.
2.
lexical component: list all
morphemes (stems and
affixes) in their lexical
form and specify
morphotactic constraints.


Englex: a two-level description of English
morphology
Englex consists of a set of orthographic rules, a 20,000entry lexicon of roots and affixes, and a word grammar.
With Englex and PC-KIMMO, you can morphologically
parse English words and text.
Generative rules and 2-level rules
Two-level rules are similar to the rules of standard generative
phonology, but differ in several crucial ways. Rule R1 is an example
of a generative rule.
R1
t ---> c / ___ i
Rule R2 is the analogous two-level rule.
R2
t:c => ___ i

Generative rules
 Transformational rules
 Sequential application
 Unidirectional
Two-level rules
 Declarative – talk about correspondences
 They apply is parallel
 Bidirectional
Hindi Morphology
Hindi noun analysis
A. Noun analysis
Nouns are categorised into 20 different paradigms based
on the following criterion:
1. Vowel ending.
2. Valid suffix of a word.
3. Gender, Number, Person and Case information.
A snapshot of the analysis in shown in table 2.1.
There are 20,000 Nouns classified in 20 such paradigms.
Hindi verb analysis
B. Verb Analysis
The Verb Group represents the following grammatical properties:
1. Tense : Present, Past and Future.
2. Aspect: Durative, Stative, Infinitive, Habitual and Perfective etc.
3. Modal: Abilitive, Deontic, Probabilitative etc.
4. Gender: Male, Female, Dual.
5. Person: 1st , 2nd and 3rd.
These values formed the basis to list Verb Groups according
to their TAM-GNP values. A TAM-GNP matrix having all
possible VGs is developed.
IITB morph analyzer Presently there are 622 unique
paradigms in the TAM-GNP matrix
Bengali Morphology
Morphology: Verb
Attribute 1: Root
Val 0: root word of the given surface form of the word
Attribute 2: Category
Val 0: verb (v)
Attribute 3: Person
Val 0: first, Val 1: second normal, Val 2: second familiar, Val 3: third normal, Val 4:
formal (second/third)
Attribute 4: Tense
Val 0: Present, Val 1: Past, Val 2: Future
Attribute 5: Aspect
Val 0: simple, Val 1: continuous Val 2: perfect
Attribute 6: Modality
Attribute 8: Specificity
Val 0: non-specific, Val 1: specific
Attribute 9: Emphasizer
Val 0: none, Val 1: only, Val 2: also
Attribute 10: Polarity
Val 0: positive Val 1: negative
Attributes & Values (Verb) :
Person:
 First Person-(1),Ami
 Second Formal-(2),Apani
 Second Normal-(3),tumi
 Second Familiar-(4),tui
 Third Normal-(5),se
 Third Formal-(6),tini
 Unspecified
Attributes & Values (Verb) :
Tense:
Present-(1),kari
Past-(2),karalAma
Future-(3),karaba
Overall-(4)
Attributes & Values (Verb) :
Aspect:
Simple-(1),karalAma
Habitual-(2),karatAma
Continuous-(3),karachhe
Perfect-(4),karechhi
Indefinite-(5),kari
Attributes & Values (Verb) :
Modality:
 Indicative-(1),kara
 Imperative-(2),kar
 Subjunctive-(3),karale
Attributes & Values (Verb) :
Polarity:
Positive-(1),kari
Negative-(2),karini
INFORMATION:VERBS



Total Numbers of Categories (Based on Syllabic
Structure) : 20
Rules:214/Category
Total Numbers of Rules : 214x20=4280(apprx.)
Bengali Verb Paradigms
Bengali Verb morphology for one of the paradigms
Classification : Nouns








Morphological Classification Based on Different Types of Nouns:
1.Animate (example: mAnuSha)
2.Inanimate(example: mATi)
3.Abstract/Qualitative(example: daYA)
4.Verbal(example : bhojana)
5.Collective(example: pAla)
6.The Singular (example: chandra)
7.Compounded(example: riksAoYAlA)
Sub Classification :Nouns









Sub Classification based on “Root Endings”:
1.a-ending root (animate “mAnusha”)
2.A- ending root (animate “bAlikA”)
3.i- ending root (animate “pAkhi”)
4.I- ending root (animate “khukI”)
5.e- ending root (animate “chhele”)
6.o- ending root (animate “myA;o”)
7.u-ending root (animate “shishu”)
8.U- ending root (animate “badhU”)
Classification :Pronouns
Morphological Analysis Based on Different Natures of Pronouns:
1.Personal (Ami,Apani,-)
2.Inclusive (saba,sakala,ubhaYa,-)
3.Relative(ye,yAhA,-)
4.Interrogative(ke,ki,-)
5.Denoting Others (anya,para,-)
6.Near Demonstrative (e,ihA,-)
7.Far Demonstrative (o,uhA,-)
8.Reflexive (nija,nijenije,-)
9.Indeffinite (keu,kichhu,-)
Morphology : Pronoun
Attributes:
 Number
 Val 0: singular, Val 1: plural, Val 2: honorary plural
 Form
 Val 0: direct, Val 1: oblique
 Specificity
 Val 0: non-specific, Val 1: specific
 Case
 Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative
 Emphatic Marker
 Val 0: none, Val 1: only, Val 2: also
 Ellipses
 Val 0: false, Val 1: true
 Nature
 Types
Bengali POS Categories (Noun)
Bengali Noun has the following attributes:
Number, Specificity, Ellipses, Form, Case and Emphasizer






Number has 2 values (Singular and Plural)
Specificity has 2 values (Specific and non_specific)
Ellipses has 2 values (Elliptic and non_elliptic)
Form has 2 values (Direct and Oblique)
Case has 5 values (Nominative, Accusative, Genitive, Locative,
Instrumental)
Emphasizer has 3 values (None, Only, Also)
Adjective Morphology
Root
Val 0: root word of the given surface form of the word
Specificity
 Val 0: non-specific, Val 1: specific
Emphasizer
 Val 0: none, Val 1: only, Val 2: also
Degree
 Val 0: normal, Val 1: superlative, Val 2: Comparative
Gender
 Val 0: masculine Val 1: feminine Val 2: neuter

Adverb Morphology
Root
Val 0: root word of the given surface form of the word
Emphasizer
 Val 0: none, Val 1: only, Val 2: also
Degree
 Val 0: normal, Val 1: superlative, Val 2: Comparative

Postposition Morphology
Root
Val 0: root word of the given surface form of the word
Emphasizer
 Val 0: none, Val 1: only, Val 2: also

Morphological Generator
Developed at
IIT Kharagpur
Introduction
Morphological Generator uses certain linguistic resources and
generates the surface form from a given input.
The following linguistic resources are required
 Root Dictionary
 Morphological Rules
 Rule/Attribute Type Declaration (RATD)
 Morphotactics
 Paradigm Tables
 Orthographic Rewrite Rules
 Exception List
Format of the root dictionary
<root_word>:<category, paradigm_no;>+
 root_word: The root word in UTF-8
 category: Part-of-speech category
 paradigm_no: A specific non-negative number referring to the
paradigm table to be used for generation of the surface form for the
root_word, when used as a particular POS-category.
 +: denotes one or more occurrence of the <category, paradigm_no;>
Example for Hindi:
 कर: NN,0; VM,1;
 आम: NN,1; JJ, 0;
RATD
The first line of the RATD is
<#categories> <cat_tag >+
 #categories: The total number of distinct categories, for which
morphological generation is required.
 cat_tag: The category tag as used in the root dictionary, for which the
generation is required.

Example:
3
NN
QC
VM
RATD
This is followed by the declarations related to the #categories categories. The
declaration for each category consists of meta declaration line followed by
#morphotactics lines specifying the morphotactic rules. The meta
declaration for a category is as follows:
<cat_tag> <file_name> <#paradigms> <#morphotactics><#attributes>
<#values_for_attribute>+
 cat_tag: As defined above
 file_name: The name of the file that contains the morphotactics, paradigm
tables and rewrite rules of the particular category.
 #paradigms: Total number of paradigms for the category
 #morphotactics: Total number of linear morphotactic rules for the category
 #attributes: Total number of attributes that govern the morphology
 #values_for_attribute: The number of values for each of the attributes.
Example
NN nn.txt 5 1 2 2 2
Morphotactics
The morphotactics are specified linearly in the following format
{ ‘(’ { attribute_id, }+ ‘)’ }+
 For example, the morphotactic rule (0, 2)(3)(1, 4) means that the suffix marking for
the features 0 and 2 is followed by the suffix marking feature 3 and then the suffix
marking the features 1 and 4.
 We assume a linear morphology
 We assume that inflections are in the form of suffixes only (i.e. no prefix or infix)
 In the above example, it is not possible to split the suffixes marking for features 0 and
2, and 1 and 4. In other words, the suffixes for these features are fusional as far as
(0,2) or (1,4) feature combinations are considered, but the morphology is
agglutinative in general.
 There can be more than one morphotactic rule for a category in a language. In that
case, the first rule is taken as the default one, whereas the other rules are triggered
only under special circumstances, which are to be specified with the rule by assigning
some specific value to the feature, like (0, 2=5)(3)(1, 4) implies that the rule is
triggered only when Attribute 2 has a value of 5.
Morphotactics example
Bengali noun morphology
 Attribute 0: Number
Val 0: singular, Val 1: plural
 Attribute 1: Obliqueness Val 0: direct, Val 1: oblique
 Attribute 2: Specificity
Val 0: non-specific, Val 1: specific
 Attribute 3: Case
Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative
 Attribute 4: Emphasizer
Val 0: none, Val 1: only, Val 2: also
 Attribute 5: Ellipses
Val 0: false, Val 1: true
Bengali nouns follow one of the following two morphotactics
 (0,1,2)(3)(4)
 (0,1,2)(5=1)(0,1,2)(3)(4)
The second rule is triggered only in the case of ellipses.

Paradigm Table



The category specific files (e.g. nn.txt in the earlier
example) store the paradigm tables and orthographic
rewrite rules.
There are paradigm tables corresponding to every
paradigm number for each of the feature/featurecombination in the morphotactics. Thus, if there are
#paradigms for Bengali nouns, then there are
4*#paradigms paradigm tables. The 4 tables per
paradigm corresponds to (0,1,2), (3), (4), and (5).
However, several paradigms might share some of the
tables. Therefore, in the declaration, a particular table
can stand for more than one paradigm.
Paradigm table contains the list of suffices for a particular combination
of attributes.
<ParadigmTable
<Attributes a1, a2>
<ParadigmNumber x1, x2, x3>
<Suffixes s11, s12, s13,…, s21, s22, s23,…>
The Number of suffices in a table is equal to the multiplication of the
values of the attributes in that combination.
Example: If the combination is (0,1) and 1st attribute has 10 values and
2nd attribute has 3 values, the table for the combination (0,1) will
contain 10×3 = 30 suffices (may be some of them are NULL).
Orthographic Rules
Orthographic rules are specified as rewrite rules of the following forms
input  output / left_context, right_context
We also have provisions to specify two layer rules, where on the top
layer specifies the rule on strings, and on the bottom layer, the
features are indicated.
Thus, a rule of type
input  output / left_context, right_context
[att1]
[root],
[att2]
means that when the suffix corresponding to the attribute att1 has the
pattern input, and it is immediately preceded by the pattern
left_context, which belongs to the root and followed by the pattern
right_context, which belongs to another suffix corresponding to
some attribute att2, then input should be replaced by the pattern
output.
RATD for Bengali













11 NN QC VM PN AV AJ PS OT UT QF QO
NN nn.txt nn_rule.txt mean_noun.txt 1 1 6 2 2 2 2 5 3
QC qc.txt qc_rule.txt mean_card.txt 1 1 4 4 2 2 3
VM vm.txt vm_rule.txt mean_verb.txt 1 2 5 6 10 3 2 2
PN pn.txt pn_rule.txt mean_pron.txt 1 2 7 2 2 2 2 2 5 3
AV av.txt av_rule.txt mean_adv.txt 1 1 2 3 3
AJ aj.txt aj_rule.txt mean_adj.txt 1 1 2 3 3
PS ps.txt ps_rule.txt mean_psp.txt 1 1 1 3
OT ot.txt ot_rule.txt mean_oth.txt 1 1 1 1
UT ut.txt ut_rule.txt mean_quot.txt 1 1 1 3
QF qf.txt qf_rule.txt mean_quan.txt 1 1 2 2 3
QO qo.txt qo_rule.txt mean_ord.txt 1 1 1 3
symbols: aAbcdDeghiIjklmn.;NoprsStTuUyY
Orthographic Rules
The format is similar to two level morphological rules. Each rule has 4 parts
input:output/left_context,right_context
Here input is changed to output provided left_context is preceded by and right_context is followed by
input. Suffix is ended by #.
Example:
“giveîng# = giving” can be written by the rule
Rule 1:
e^:NULL/giv,ing#
If we say all “e-ending” words are inflected like “give” then we can write the rule
Rule 2:
e^:NULL/*,ing#
If we say all “a-ending” and “o-ending” words are simply concatenated when added with “ing#” we
can write
Rule 3:
^:NULL/*~,ing#
(Where ~ symbol means either ‘a’ or ‘o’)
Orthographic Rules Contd..
The Orthographic rules are best designed by FSM (Deterministic).
FSM will help to decide whether the rule is satisfied by the input word. If “yes”
finding out the portion to be replaced is not very tricky.
If no Orthographic rule is triggered suffix is simply concatenated.
If following the FSM, input word reach the final state, we say the rule is triggered.
Building FSM
Example FSM for Rule 2:
e^:NULL/*,ing#
*-e
e
i
*
S
e
A
n
g
^
B
#
C
D
E
F
*-e-^
*-n
*-g
*-#
*-i
*
H
G
Orthographic Rules for Bengali Verb





















^y:;i/*,#
â:;o/*,#
AWAâie:eWe/*,#
Aâie:e/*,#
yAoYAâie:giYe/*,*
AoYAâie:eYe/X,#
eoYAâie:iYe/*,#
Anoâie:iYe/*,#
oYAâie:uYe/*$,#
AWAâie:eWe/*,#
A^:NULL/B,~*
A^:a/B,$*
no^:ch/*A,chh*
oYA^:ch/*A,chh*
Ano^:iY/*,echh*
no^:NULL/*A,E*
no^F:NULL/*A,G*
oYA^F:NULL/*A,G*
no^:NULL/*A,iK
oYA^:NULL/*A,iK
no^L:o/*A,*


















oYA^L:o/*A,*
noê:Ya/*A,K
oYAê:Ya/*A,K
AoYA^:eY/X,echh*
yAoYA^:giY/*,echh*
AoYA^:e/X,M*
oYA^:NULL/*A,b*
AoYA^:e/y,t*
yAoYA^:ge/*,l*
eoYA^:iY/*,echh*
eoYA^:i;/*,iK
eoYA^:ich/*,chh*
eoYA^:i/*,P*
oYA^:NULL/*e,Q*
eoYAû:i/*,*
eoYA^:NULL/*,R*
eoYA^:A/*,o*
eoYAâ:Ao/*,ni


















eoYAê:eYa/*,K
oYA^:uY/*$,echh*
oYA^:uch/*$,chh*
oYA^:u/*$,V*
oYAî:u/*$,sa*
YA^:;/*$o,o#
YAâ:;o/*$o,ni*
YAê:NULL/*$o,naK
YAû:NULL/*$o,*
Aê:a/*$oY,K
^y:;i/*,#
â:;o/*,#
AWAâie:eWe/*,#
yAoYAâie:giYe/*,*
AoYAâie:eYe/X,#
eoYAâie:iYe/*,#
Anoâie:iYe/*,#
oYAâie:uYe/*$,#
Input Format
Input to the Morphological Generator is started with the root of the word
followed by the POS Category and Attribute names and their values.
Example:
karA VM Person 3 Tense 2 Emp 2
In Bengali Person and Tense combine to give a suffix which will be
added first and Emphasizer will give another suffix which will be
added next.
See Morphotactic for Bengali Verb.
Input Format Contd.
In Bengali, Person can have 6 values and Tense (which is actually TAM) can have 10
values. The suffices In the Paradigm table is arranged in the following way.
First entry is Person 0 Tense 0
Second entry is Person 0 Tense 1
Third entry is Person 0 Tense 2 …
10th entry is Person 0 Tense 9
11th entry is Person 1 Tense 0
So Person 3 Tense 2 will be the entry number
(Person input) × (TAM value) + TAM input +1
= 3 × 10 + 2 + 1 = 33
Get 33rd entry from the Paradigm table for (0,1) and use the Orthographic rule to get the
correct word.
Bengali Verb Paradigms and Morphotactics
<ParadigmTable
<Attributes 1 2 > /* 1 indicates Person and 2 indicates TAM */
<suffixes
i chhi echhi lAma chhilAma echhilAma ba tAma NULL ini isa
chhisa echhisa li chhili echhili bi tisa NULL isani o chha
echha le chhile echhile be te NULL ani
ena chhena echhena lena chhilena echhilena bena tena una
enani e chhe echhe la chhila echhila be ta uka eni ena
chhena echhena lena chhilena echhilena bena tena una
enani
>>
Morphotactic rule


(0,1)(2)(3)
(3=2)(2)
<ParadigmTable
<Attributes 3 >
/*Case*/
<suffixes
NULL i o
>>
Bengali Noun Paradigms and Morphotactics
<ParadigmTable
<Attributes 0 1 2 > /* Number, Specificity, Ellipses 2×2×2 = 8 entries*/
<suffixes
NULL eraTA TA NULL gulo guloraTA NULL NULL
>>
<ParadigmTable
<Attributes 3 4 >
/* Form, Case 2 × 5 = 10 entries */
<suffixes
NULL ke NULL ete ete NULL NULL era NULL NULL
>>
<ParadigmTable
<Attributes 5 >
<suffixes
NULL i o
>>
Morphotactic rule
(0,1,2)(3,4)(5)
/* Emphasizer 3 entries */
Example (Bengali Verb)
Example: the Input is
balA Verb Person 1 TAM 1 Case 0
First Morphotactic rule is triggered.
Person can have 6 values and TAM can have 10 values. So the
extracted suffix number from the paradigm table 1,2 is
10×(Person value) +(TAM value) + 1 = 10×1 + 1 + 1 = 12
i.e., chhisa is to be added first.
From the paradigm table (3) extracted suffix is NULL.
i.e., NULL is to be added next.
Example Contd.
Now balA^chhisa# is the input which will search for suitable
Orthographic rule.
Suppose there is an orthographic rule
A^:a/B,$*
Where B:*-Y and $: consonant
Then the FSM for this rule will bring the input to the final state. i.e., the
rule is triggered. Now “A^” is replaced by “a” and the output is
“balachhisa”
Exception List:
Some words which do not match with other words in the
orthographic change on those which are changed completely when
inflected are said to be exceptions.
Those words if added in Orthographic rule will cause a large number
of rules with a huge complexity.
We handled those words mentioning in a separate file which include
the exception words along with all its inflections.
Morph Analyzer

Slide 1

Transcript Slide 1

Directory