Writing Lexical Transducers Using xfst Overview of Transduction Review of xfst Rules Creating Two-Level Lexicons Putting it All Together Beesley 2000
Download ReportTranscript Writing Lexical Transducers Using xfst Overview of Transduction Review of xfst Rules Creating Two-Level Lexicons Putting it All Together Beesley 2000
Slide 1
Writing Lexical Transducers Using
xfst
Overview of Transduction
Review of xfst Rules
Creating Two-Level Lexicons
Putting it All Together
Beesley 2000
Slide 2
Theory-Neutral Morphological
Analysis
Analyses
Black-Box
Morphological
Analyzer
Words
Beesley 2000
Slide 3
Finite-State Transducers (FSTs)
• An FST encodes a Regular Relation, i.e. a relation
between two regular languages.
• FSTs can be used for morphological analysis, if
• The set of surface words (strings) to be analyzed is a regular
language, and
• The “analyses” are also defined to be a regular language, i.e.
just another set of strings
Analysis String Language
FST
Surface String Language
Beesley 2000
Slide 4
What Do the Two Languages Look
Like?
• In commercial natural-language processing
• The surface language (e.g. French words written in the standard French
orthography) is usually a given.
• Periodic official spelling reforms may require fixes to your analyzer.
• You may have to worry about national variations.
• In contrast, the analysis-language strings must be designed by the
linguist. In the most common Xerox convention, each analysis
string consists of the traditional dictionary-citation baseform
followed by multicharacter-symbol “tags”.
cantar+Verb+PInd+1P+Sg
canto+Noun+Masc+Sg
alto+Adj+Fem+Pl
Beesley 2000
Slide 5
Non-Commercial (Lesser-Studied)
Languages
1. All normal human beings speak a natural language, but there is
nothing necessary or natural about reading and writing.
2. An orthography is a set of symbols, and conventions for using
them, for “making language visible”.
3. Orthographies are technologies, like agriculture or metalworking.
4. Most languages have never been written, i.e. there is no standard
orthography; or linguists and governments may have proposed
several competing orthographies.
5. When working with lesser-studied languages, you may have to
choose (or devise) a surface orthography for use in your
morphological analyzer.
Beesley 2000
Slide 6
Two Main Tasks to Morphology
• Morphotactics
• Describe the structure/grammar of words
• Classic finite-state operations required
– Concatenation of one morpheme to the next
– Union of morphemes within classes
• Some languages require other finite-state operations
– Arabic stems require intersection
– Malay requires special algorithms for reduplication
• Phonological/Orthographical Alternation
• Union and concatenation by themselves tend to build abstract
morphophonemic strings
• Use finite-state rules to map from underlying (or “lexical”)
morphophonemic strings to surface strings
Beesley 2000
Slide 7
Describing Morphotactics Using
Regular Expressions
Some very simple morphotactics can be described using just union,
concatenation and perhaps optionality.
Simple Esperanto Verbs
Opt. Prefix Req. Root
ne
mal
Beesley 2000
don
dir
pens
ir
...
Opt. Aspect
ad
Req. Verb Ending
as
is
os
us
u
i
Slide 8
Esperanto Verb Morphotactics
xfst[]: read regex
(ne|mal)
[ d o n | d i r | p e n s | i r]
(ad)
[as|is|os|us|u|i];
• Each morpheme class is a unioned list of morphemes.
• Optional classes are surrounded with parentheses.
• Then morpheme classes are concatenated together, in the right
order.
Beesley 2000
Slide 9
Esperanto Verb Morphotactics,
Version 2 (xfst script)
xfst[]:
xfst[]:
xfst[]:
xfst[]:
xfst[]:
define Prefix n e | m a l ;
define Root d o n | d i r | p e n s | i r ;
define Aspect a d ;
define VSuff a s | i s | o s | u s | u | i ;
read regex (Prefix) Root (Aspect) VSuff ;
Beesley 2000
Slide 10
Morphophonological/Orthographical
Alternations
• If simple concatenation doesn’t produce valid words,
then we need to handle alternations.
• In today’s exercises, we will use Replace Rules, e.g. if
Spanish pluralization is done by concatenating [ %+ s]
to a noun, we will need to fix cases like the following:
pez+s
.o.
z %+ -> c e || _ s .#.
pez+s
FST
peces
Beesley 2000
Slide 11
The Simplest Xerox Replace Rules
Schema:
upper -> lower || left _ right
where upper, lower, left and right are regular expressions denoting
regular languages (not relations!)
Remember to use regular-expression syntax. Replace Rules are
regular expressions! The overall Replace Rule denotes a relation.
E.g.
s -> z || [ a | e | i | o | u ] _ [ a | e | i | o | u ]
A context can be left empty, which is equivalent to a context of ?*
E.g.
Beesley 2000
s -> z || _ m
p -> m || m _
Slide 12
The Simplest Replace Rules II
Referring to the beginning or the end of a word:
z -> s || _ .#.
e -> i || _ (s) .#.
e -> i || .#. p _ r
A rule may be unconditioned, with no context at all
c h -> %$
s s -> s
Do not write “ss” or “ch” in regular expressions unless you want
them to be treated as single symbols. Remember to “unspecialize”
special symbols when you want a literal dollar sign, etc.
Beesley 2000
Slide 13
Rule Abbreviations
Instead of two rules:
You can write:
e -> i || _ (s) .#.
o -> u || _ (s) .#.
e -> i , o -> u || _ (s) .#.
a comma separates the “left-hand sides” of the rule
Instead of two rules:
You can write:
e -> i || _ (s) .#.
e -> i || .#. p _ r
e -> i || _ (s) .#. , .#. p _ r
a comma separates the “right-hand sides” of the rule
Beesley 2000
Slide 14
Simple Replace-Rule Semantics
upper -> lower || leftcontext _ rightcontext
• The overall rule denotes a finite-state relation (not an algorithm)
• The upper-side language of a -> relation is the universal language (?*)
• By default, all symbols on the upper side are mapped to the same symbol on the
lower side
• But IF a string on the upper side contains a designated “upper” string, in the
designated context, then it is mapped to a string (or strings) on the lower side
where the matched substring is replaced by the designated “lower” string.
• The context must “match” on the upper side string
• A right-arrow -> rule has a downward orientation.
Beesley 2000
Slide 15
Understanding Replace Rules
xfst>
xfst>
xfst>
xfst>
xfst>
xfst>
xfst>
read regex a -> b ;
apply down a
apply down aaa
apply down dog
apply up b
apply up bbb
apply up dog
xfst>
xfst>
xfst>
xfst>
xfst>
read regex a:b ;
apply down a
apply down aaa
apply up b
apply up bbb
Beesley 2000
Slide 16
Review of Notations for Transducers
The cross-product operator:
[ u p p e r .x. l o w e r ]
In general, for any two regular expressions A and B
denoting languages:
A .x. B
For convenience, we can also write
a:b
equivalent to
[ a .x. b ]
%+Tag:{ing}
[ %+Tag .x. i n g ]
{upper}:{lower}
[ u p p e r .x. l o w e r ]
Beesley 2000
Slide 17
Esperanto Verb Morphotactics,
Version 3; A Lexicon with Two Levels
xfst[]: define Prefix Neg%+:{ne} | Op%+:{mal} ;
xfst[]: define Root d o n | d i r | p e n s | i r ;
xfst[]: define Aspect %+Cont:{ad} ;
xfst[]: define VSuff %+Pres:{as} | %+Past:{is} | %+Fut:{os} |
%+Cond:{us} | %+Subj:u | %+Inf:i ;
xfst[]: read regex (Prefix) Root (Aspect) VSuff ;
Beesley 2000
Slide 18
Esperanto Verb Transducer
0
Neg+
n
Op+
m
Pres+
a
0
e
0
a
o
d
i
p
e
Apply up:
0
Cont+
a
r
0 l
Beesley 2000
n
i
s
n
malpensadus
d
0
Past+ Fut+
i
o Cond+
u
Subj+
u
Inf+
i
0
s
Slide 19
The Usual Strategy: Define a
dictionary and alternation rules
Upper: Op+don+Cont+Past
Lower: maldonadis
Dictionary
Transducer
.o.
As necessary,
apply alternation
rules via
composition
Beesley 2000
Alternation Rules
Final FST
Slide 20
The Bambona Language
Review the Xerox regular-expression syntax.
Review the difference between
• regular expression file
– contains a single regular expression, ends with a semicolon and
newline
– xfst[]: read regex < myfile.regex
• script file
– contains a list of commands to xfst (including perhaps “define” and
“read regex” commands)
– xfst[]: source myfile.script
Read the description carefully (not just the final test data).
Describe the morphotactics using union and concatenation.
Handle the variations using replace rules.
Beesley 2000
Writing Lexical Transducers Using
xfst
Overview of Transduction
Review of xfst Rules
Creating Two-Level Lexicons
Putting it All Together
Beesley 2000
Slide 2
Theory-Neutral Morphological
Analysis
Analyses
Black-Box
Morphological
Analyzer
Words
Beesley 2000
Slide 3
Finite-State Transducers (FSTs)
• An FST encodes a Regular Relation, i.e. a relation
between two regular languages.
• FSTs can be used for morphological analysis, if
• The set of surface words (strings) to be analyzed is a regular
language, and
• The “analyses” are also defined to be a regular language, i.e.
just another set of strings
Analysis String Language
FST
Surface String Language
Beesley 2000
Slide 4
What Do the Two Languages Look
Like?
• In commercial natural-language processing
• The surface language (e.g. French words written in the standard French
orthography) is usually a given.
• Periodic official spelling reforms may require fixes to your analyzer.
• You may have to worry about national variations.
• In contrast, the analysis-language strings must be designed by the
linguist. In the most common Xerox convention, each analysis
string consists of the traditional dictionary-citation baseform
followed by multicharacter-symbol “tags”.
cantar+Verb+PInd+1P+Sg
canto+Noun+Masc+Sg
alto+Adj+Fem+Pl
Beesley 2000
Slide 5
Non-Commercial (Lesser-Studied)
Languages
1. All normal human beings speak a natural language, but there is
nothing necessary or natural about reading and writing.
2. An orthography is a set of symbols, and conventions for using
them, for “making language visible”.
3. Orthographies are technologies, like agriculture or metalworking.
4. Most languages have never been written, i.e. there is no standard
orthography; or linguists and governments may have proposed
several competing orthographies.
5. When working with lesser-studied languages, you may have to
choose (or devise) a surface orthography for use in your
morphological analyzer.
Beesley 2000
Slide 6
Two Main Tasks to Morphology
• Morphotactics
• Describe the structure/grammar of words
• Classic finite-state operations required
– Concatenation of one morpheme to the next
– Union of morphemes within classes
• Some languages require other finite-state operations
– Arabic stems require intersection
– Malay requires special algorithms for reduplication
• Phonological/Orthographical Alternation
• Union and concatenation by themselves tend to build abstract
morphophonemic strings
• Use finite-state rules to map from underlying (or “lexical”)
morphophonemic strings to surface strings
Beesley 2000
Slide 7
Describing Morphotactics Using
Regular Expressions
Some very simple morphotactics can be described using just union,
concatenation and perhaps optionality.
Simple Esperanto Verbs
Opt. Prefix Req. Root
ne
mal
Beesley 2000
don
dir
pens
ir
...
Opt. Aspect
ad
Req. Verb Ending
as
is
os
us
u
i
Slide 8
Esperanto Verb Morphotactics
xfst[]: read regex
(ne|mal)
[ d o n | d i r | p e n s | i r]
(ad)
[as|is|os|us|u|i];
• Each morpheme class is a unioned list of morphemes.
• Optional classes are surrounded with parentheses.
• Then morpheme classes are concatenated together, in the right
order.
Beesley 2000
Slide 9
Esperanto Verb Morphotactics,
Version 2 (xfst script)
xfst[]:
xfst[]:
xfst[]:
xfst[]:
xfst[]:
define Prefix n e | m a l ;
define Root d o n | d i r | p e n s | i r ;
define Aspect a d ;
define VSuff a s | i s | o s | u s | u | i ;
read regex (Prefix) Root (Aspect) VSuff ;
Beesley 2000
Slide 10
Morphophonological/Orthographical
Alternations
• If simple concatenation doesn’t produce valid words,
then we need to handle alternations.
• In today’s exercises, we will use Replace Rules, e.g. if
Spanish pluralization is done by concatenating [ %+ s]
to a noun, we will need to fix cases like the following:
pez+s
.o.
z %+ -> c e || _ s .#.
pez+s
FST
peces
Beesley 2000
Slide 11
The Simplest Xerox Replace Rules
Schema:
upper -> lower || left _ right
where upper, lower, left and right are regular expressions denoting
regular languages (not relations!)
Remember to use regular-expression syntax. Replace Rules are
regular expressions! The overall Replace Rule denotes a relation.
E.g.
s -> z || [ a | e | i | o | u ] _ [ a | e | i | o | u ]
A context can be left empty, which is equivalent to a context of ?*
E.g.
Beesley 2000
s -> z || _ m
p -> m || m _
Slide 12
The Simplest Replace Rules II
Referring to the beginning or the end of a word:
z -> s || _ .#.
e -> i || _ (s) .#.
e -> i || .#. p _ r
A rule may be unconditioned, with no context at all
c h -> %$
s s -> s
Do not write “ss” or “ch” in regular expressions unless you want
them to be treated as single symbols. Remember to “unspecialize”
special symbols when you want a literal dollar sign, etc.
Beesley 2000
Slide 13
Rule Abbreviations
Instead of two rules:
You can write:
e -> i || _ (s) .#.
o -> u || _ (s) .#.
e -> i , o -> u || _ (s) .#.
a comma separates the “left-hand sides” of the rule
Instead of two rules:
You can write:
e -> i || _ (s) .#.
e -> i || .#. p _ r
e -> i || _ (s) .#. , .#. p _ r
a comma separates the “right-hand sides” of the rule
Beesley 2000
Slide 14
Simple Replace-Rule Semantics
upper -> lower || leftcontext _ rightcontext
• The overall rule denotes a finite-state relation (not an algorithm)
• The upper-side language of a -> relation is the universal language (?*)
• By default, all symbols on the upper side are mapped to the same symbol on the
lower side
• But IF a string on the upper side contains a designated “upper” string, in the
designated context, then it is mapped to a string (or strings) on the lower side
where the matched substring is replaced by the designated “lower” string.
• The context must “match” on the upper side string
• A right-arrow -> rule has a downward orientation.
Beesley 2000
Slide 15
Understanding Replace Rules
xfst>
xfst>
xfst>
xfst>
xfst>
xfst>
xfst>
read regex a -> b ;
apply down a
apply down aaa
apply down dog
apply up b
apply up bbb
apply up dog
xfst>
xfst>
xfst>
xfst>
xfst>
read regex a:b ;
apply down a
apply down aaa
apply up b
apply up bbb
Beesley 2000
Slide 16
Review of Notations for Transducers
The cross-product operator:
[ u p p e r .x. l o w e r ]
In general, for any two regular expressions A and B
denoting languages:
A .x. B
For convenience, we can also write
a:b
equivalent to
[ a .x. b ]
%+Tag:{ing}
[ %+Tag .x. i n g ]
{upper}:{lower}
[ u p p e r .x. l o w e r ]
Beesley 2000
Slide 17
Esperanto Verb Morphotactics,
Version 3; A Lexicon with Two Levels
xfst[]: define Prefix Neg%+:{ne} | Op%+:{mal} ;
xfst[]: define Root d o n | d i r | p e n s | i r ;
xfst[]: define Aspect %+Cont:{ad} ;
xfst[]: define VSuff %+Pres:{as} | %+Past:{is} | %+Fut:{os} |
%+Cond:{us} | %+Subj:u | %+Inf:i ;
xfst[]: read regex (Prefix) Root (Aspect) VSuff ;
Beesley 2000
Slide 18
Esperanto Verb Transducer
0
Neg+
n
Op+
m
Pres+
a
0
e
0
a
o
d
i
p
e
Apply up:
0
Cont+
a
r
0 l
Beesley 2000
n
i
s
n
malpensadus
d
0
Past+ Fut+
i
o Cond+
u
Subj+
u
Inf+
i
0
s
Slide 19
The Usual Strategy: Define a
dictionary and alternation rules
Upper: Op+don+Cont+Past
Lower: maldonadis
Dictionary
Transducer
.o.
As necessary,
apply alternation
rules via
composition
Beesley 2000
Alternation Rules
Final FST
Slide 20
The Bambona Language
Review the Xerox regular-expression syntax.
Review the difference between
• regular expression file
– contains a single regular expression, ends with a semicolon and
newline
– xfst[]: read regex < myfile.regex
• script file
– contains a list of commands to xfst (including perhaps “define” and
“read regex” commands)
– xfst[]: source myfile.script
Read the description carefully (not just the final test data).
Describe the morphotactics using union and concatenation.
Handle the variations using replace rules.
Beesley 2000