Transcript Document

Morphological Recognition
• We take each sub-lexicon of each stem class and we
expand each arc (e.g. the reg-noun arc) with all the
morphemes that make up the set of stems in the reg-noun
word class.
• This way a FSA is created that can be used for
morphological recognition.
Two-level Morphology
• Ideally, for morphological parsing we would like to input a
word and get as output its stem with morphological
information. e.g. cats -> cat + N + PL
• Two-level morphology represents a word as the
correspondence between the lexical and the surface level.
Finite State Transducer (FST)
• A FST is an automaton that we use for performing the
mapping between the two-levels.
• A FST is an automaton with two-tapes that recognizes or
generates pairs of strings, therefore it defines a relation
between strings.
• Another view of a FST is as a machine that reads one
string and generates another string.
Formal FST definition
• Extention to FSA definition
– Q: a finite set of states. (q0, q1, q2, …)
– Σ: a finite alphabet of complex symbols i:o pairs where i is a
symbol from the input alphabet and o a symbol from the output
alphabet (ε might be part of both the input and output alphabets)
– q0: the start state (first state)
– F: the states with of final states (subset of Q)
– δ(q,i:o): the transition function from states and complex input
symbols to states. Given a state q and an input i, it returns a new
state q’.
• e.g Σ= {a:a, b:b, !:!, a:!, a:b, b:a, a:ε, ε:!}
Useful FST Properties
• Inversion: The inversion of a transducer simply switches
the input and output labels of the transducer (the two
tapes). Therefore it is very easy to transform a FST from a
parser into a generator.
• Composition: Given two FSTs T1 that maps from I to C
and T2 that maps from C to O, their composition is a new
transducer T1 o T2 that maps from I to O. Therefore is we
have a number of FST that run serialy, it is possible to
build a new FST that maps from the initial input to the
final output.
Finite State Transducers
• It is convenient to view a FST as having two tapes.
– The upper or lexical tape
– The lower of surface tape
• Each symbol a:b in the FST alphabet expresses how a
symbol from one tape is mapped to a symbol on the other
tape.
• Symbols such as a:a are called default pairs and are
represented simply as a.
FST Morphotactics
FST for English plural formation. ^ marks a morpheme
boundary and # a word boundary.
FST Lexicon
Combining FST Lexicon and
Morphtactics
• The two FST for lexicon and morphotactics can be
cascaded, i.e. the input is run through the lexicon FST and
then the output is run through the morphotactics FST.
• Based on the composition propery it is possible to
compose these two FSTs into a single FST that maps
directly from the lexical to the surface level (without any
reference to word classes).
Orthographic Rules
• The previous FST will accept the word foxs and reject the
word foxes.
• We need a way to deal with the spelling changes that often
take place at morpheme boundaries. This is done by
introducing orthographic rules. E.g. for English
– e is inserted after -s, -z, -x, -ch, -sh before -s.
– -y becomes -ie before -s.
• Formal rule notation: a -> b/c__d means “rewrite a as b
when it occurs between c and d.
– ε ->e/{x,s,z}^__s#.
Orthographic Rules and FST
• The spelling rule can be seen as taking a simple
concatenation of morphemes (intermediate level) and
producing the surface form of the word.
Orthographic Rules and FST
• The previous orthographic rule can be represented as a
FST.
Orthographic Rules and FST
• Transition table for the previous FST.
State/
Input
s:s
x:x
z:z
^:ε
ε:e
#
other
q0:
1
1
1
0
-
0
0
q1:
1
1
1
2
-
0
0
q2:
5
1
1
0
3
0
0
q3
4
-
-
-
-
-
-
q4
-
-
-
-
-
0
-
q5
1
1
1
2
-
-
0
Combining FST Lexicon and Rules
• First the lexicon FST maps between the lexical level and
the intermediate level which is just a concatenation of
morphemes.
• Then, a number of spelling rule FSTs run in parallel (or as
a cascade) mapping from the intermediate level to the
surface level.
• The lexicon FST and the orthographic rules FST form a
cascade. This can be run top-down (generation) or bottomup (parsing).
FST Parsing
• Parsing is more complicated than generation because of
ambiguity. E.g. foxes may be parsed as both fox+V+3SG
and as fox+N+PL. Disambiguiation cannot be performed
at the lexical level. Both parses should be given by the
FST.
• Also ambiguities occur during parsing due to ε arcs or
multiple possible paths. In fact, this is similar to the case
for NFSA and similar search techniques must be
employed.