Morphology and Finite

Transcript Morphology and Finite

Morphology and Finite-State
Transducers
by Mathias Creutz
31 October 2001
Chapter 3, Jurafsky & Martin
Contents

Morphology


Morphological Parsing



finite-state automata, two-level morphology
Finite-State Transducers


morphemes, inflection and derivation, allomporphs
rules, combination of FSTs, lexicon-free FSTs
Human Morphological Processing
Exercise
Morphology

Morphology is the study of the way
words are built up from smaller
meaning-bearing units, morphemes.


e.g. talo + ssa + ni + kin
Two broad classes of morphemes,
stems and affixes:

the stem is the ”main morpheme” of the
word, supplying the main meaning, e.g.
talo in talo+ssa+ni+kin
Affixes


Affixes add ”additional” meanings.
Concatenative morphology uses the
following types of affixes:



prefixes, e.g. epä- in epä+olennainen
suffixes, e.g. –ssa in talo+ssa
circumfixes, e.g. German ge- -t in
ge+sag+t ([have] said)
Non-concatenative Morphology

In non-concatenative morphology the
stem morpheme is split up. The following
types of affixes are used:


infixes, e.g. Californian Jurok, sepolah (field),
se+ge+polah (fields)
transfixes, e.g. Hebrew, l+a+m+a+d (he
studied), l+i+m+e+d (he taught), l+u+m+a+d
(he was taught)

This type of non-concatenative morphology is called
templatic or root-and-pattern morphology.
Inflection and Derivation

There are two broad classes of ways to
form words from morphemes:
inflection and derivation.
Inflection

Inflection is the combination of a word stem
with a grammatical morpheme, usually
resulting in a word of the same class as the
original stem, and usually filling some
syntactic function, e.g. plural of nouns.


Inflection is productive.


talo (singular), talo+t (plural)
talo, talo+t vs. auto, auto+t vs. metsä, metsä+t
The meaning of the resulting word is easily
predictable.
Derivation

Derivation is the combination of a word stem
with a grammatical morpheme, usually
resulting in a word of a different class, often
with a meaning hard to predict exactly.

e.g. järki, järje+st+ää, järje+st+ö,
järje+st+ell+ä, järje+st+el+mä,
järje+st+el+mä+lli+nen,
järje+st+el+mä+lli+syys

Not always productive.

järki, järje+st+ää vs. metsä, metsä+st+ää vs.
talo, talo+st+aa?
Allomorphs

A group of allomorphs make up one
morpheme class. An allomorph is a
special variant of a morpheme.


e.g. Finnish illative ending:
+<vowel_lengthening>n, +h<vowel>n,
+seen, +siin  talo+on, metsä+än,
talo+i+hin, huonee+seen, huone+i+siin
e.g. Finnish stem variation: käsi, käde+n,
kät+tä, käte+en
Why Allomorphs?

Phonological constraints


Morphological paradigms


e.g. käsi, käde+n vs. kasi, kasi+n,
Swedish leta, leta+de vs. heta, het+te
Irregularities


e.g. vowel harmony, talo+ssa vs. metsä+ssä
e.g. cat, cat+s vs. goose, geese
Orthographic constraints, i.e. spelling rules

e.g. cat, cat+s vs. city, citi+es
Morphological Parsing


Parsing means taking an input and
producing some sort of structure for it.
Morphological parsing means breaking
down a word form into its constituent
morphemes.


e.g. talossa  talo +ssa
Mapping of a word form to its baseform is
called stemming.

e.g. talossa  talo
Finite-State Morphological Parsing

In order to build a parser we need the
following:



a lexicon containing the stems and affixes,
morphotactics, i.e. the model of morpheme
ordering, e.g. talo+ssa+ni instead of talo+ni+ssa,
a set of rules (orthographic, etc.), i.e. the model
of changes that occur in a word, usually when two
morphemes combine, e.g. city + s  cities.
Finite-State Automaton for Inflection of
English Verbs
irreg-past-verb-form
reg-verb-stem
q1
q0
q3
past-participle (-ed)
reg-verb-stem
q2
irreg-verb-stem
preterite (-ed)
progressive (-ing)
3-singular (-s)
Finite-State Automaton for Inflection of
the Verbs ’talk’, ’test’ and ’sing’
u
a
s
s
e
q0
t
q1
s
l
k
g
e
d
e
t
i
n
g
k
e
a
s
t
l
a
t
n
q2
q3
d
g
i
n
s
Two-Level Morphology

Two-level morphology represents a word as a
correspondence between a lexical level, which
represents a simple concatenation of morphemes
making up a word, and the surface level, which
represents the actual spelling of the final word.
Lexical
s
i
n g +V
+PROG
Surface
s
i
n g
n g
i
Finite-State Transducer



A transducer maps between one set of symbols and
another; a finite state transducer does this via a finite
automaton.
Where an FSA accepts a language stated over a finite
alphabet of single symbols, e.g. ={a, b, c, ...}, an
FST accepts a language stated over pairs of
symbols, e.g. ={a:a, b:b, a:c, a:, :, ...}
In two-level morphology, we call pairs like a:a
default pairs, and refer to them by a single symbol
a.

An FST can be seen as a recognizer, generator,
translator or a set relator.
Finite-State Transducer for Inflection of
the Verbs ’talk’, ’test’ and ’sing’
n
i:u
s
q0
t
+V:
k
e
s
l
g
t
l
a
s
n
s
a
t
+V:
i:a
e
g
t
k
+PRET:e
+PSTPCP:e
+V: +PROG:i
+V:
+PRET:
:d
q3
:d
:n
:g
i
n
g
+PSTPCP:
+3SG:s
Examples
Lexical form
Surface form
talk +V
sing +V +3SG
test +V +PROG
talk +V +PRET
sing +V +PRET
talk +V +PSTPCP
sing +V +PSTPCP
talk
sings
testing
talked
sang
talked
sung
Useful FST Operations


Inversion: Switch input and output labels.
 e.g. (T)={a:b, c:d}  (inv(T))={b:a, d:c}
Intersection: Only sequences of pairs
accepted by both transducerT1 and
transducerT2 are accepted by transducer
T1^T2.

Composition: The output of transducer T1
serves as input to T2. This is marked as
T1ºT2 or T2(T1).
Spelling Rules and FSTs
Name
Description of Rule
Example
Consonant
doubling
1-letter consonant
doubled before -ing/-ed
beg/begging
E deletion
Silent e dropped before
-ing and –ed
make/making
E insertion
e added after –s, -z, -x,
-ch, -sh before -s
watch/watches
Y replacement
-y changes to –ie before
-s, and to -i before -ed
verbs ending with vowel
+ -c add -k
try/tries
K insertion
panic/panicked
Three levels

Add an intermediate level between the
lexical and surface levels
Lexical
k
i
s
s +V
Intermediate
k
i
s
s ^ s #
Surface
k
i
s
s
e
+3SG
s
FST for the E-insertion Rule
^:
other
#
other
q5
z, s, x z, s, x
z, s, x
q0
x
 
  e / s ^ __ s #
z 
 
q1
^:
z, x
s
q2
^:
:e
#, other
#, other
#
q3
s
q4
Combination of FSTs (1)
Lexical
k
i
s
s +V
+3SG
Lexicon-FST
Intermediate
k
i
s
Rule1-FST
Surface
k
i
s
s ^ s #
...
s
RuleN-FST
e
s
Combination of FSTs (2)
Lexical
k
i
s
s +V
+3SG
Lexicon-FST
Intermediate
Intersect
Surface
k
i
s
Rule1-FST
k
i
s
s ^ s #
...
s
RuleN-FST
e
s
Combination of FSTs (3)
Lexical
k
i
Compose
Intermediate
Intersect
Surface
s
s +V
+3SG
Lexicon-FST
k
i
s
Rule1-FST
k
i
s
s ^ s #
...
s
RuleN-FST
e
s
Intersection and Composition

For each state qi in transducer T1 and state
qj in transducer T2, create a new state qij.


Intersection: For any pair a:b, if T1 transitions
from qi to qn, and T2 transitions from qj to qm,
T1^T2 transitions from qij to qnm.
Composition: If T1 transitions from qi to qn with
the pair a:b, and T2 transitions from qj to qm with
the pair b:c, then T1ºT2 transitions from qij to
qnm with the pair a:c.
Lexicon-Free FSTs


Used in information-retrieval
E.g. the Porter algorithm, which is based
on a series of simple cascaded rewrite rules:



ATIONAL  ATE (relational  relate)
ING   if stem contains vowel (motoring 
motor)
Errors occur:

organization  organ, doing  doe, university 
universe
Human Morphological Processing (1)

How are multi-morphemic words represented in the
minds of human speakers?


full-listing hypothesis vs. minimum redundancy
hypothesis
Experiments:


Stanners et al. 1979: a word is recognized faster if it
has been seen before (priming): lifting  lift, burned
 burn, selective / select, i.e. different
representations for inflection and derivation.
Marsen-Wilson et al. 1994: spoken derived words can
prime their stems, but only if their meaning is close:
government  govern, department / depart
Human Morphological Processing (2)

Speech errors: Speakers mix up the
order of words...


e.g. if you break it, it’ll drop
... and also attach affixes to the wrong
stems:


e.g. it’s not only we who have screw looses
(for ”screws loose”)
e.g. easy enoughly (for ”easily enough”)
Excercise (1/3)

Your task is to create a finite-state transducer that can analyze
the following Finnish word forms:
Surface form
Lexical form
talo
talo +NOM
taloon
talo +ILL
talomme
talo +NOM +POS1PL
taloomme
talo +ILL +POS1PL
metsä
metsä +NOM
metsään
metsä +ILL
metsämme
metsä +NOM +POS1PL
metsäämme
metsä +ILL +POS1PL
Exercise (2/3)

The morphological tags have the following meaning:
+NOM = nominative; +ILL = illative; +POS1PL =
possessive, 1st person plural.


Take a look at Fig 3.16, 3.17 and 3.18 in Jurafsky &
Martin. Create three separate finite-state transducers
that you finally combine into one:
a) Create a transducer that operates between the
intermediate and surface level. This transducer handles
the vowel lengthening that is necessary for the illative
form: talo +ILL  talo|on vs. metsä +ILL  metsä|än.
Excercise (3/3)




b) Create a transducer that operates between the
intermediate and surface level. This transducer handles
the deletion of n in front of a possessive ending:
talo + mme  talo|mme vs.
talo|on + mme  talo|o|mme.
c) Create a transducer that operates between the lexical
and the intermediate level. This transducer maps
morphological tags onto endings.
d) Combine all the transducers into one.
Present your transducers as graphs or tables (cf. Fig. 3.15
in Jurafsky & Martin)