Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute

Download Report

Transcript Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute

Finite-State Methods in Natural
Language Processing
Lauri Karttunen
LSA 2005 Summer Institute
July 20, 2005
Course Outline
July 18:
Intro to computational morphology
XFST
Readings
Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J.
Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:
Regular expressions
More on XFST
Readings
Chapter 2: “Systematic Introduction”
Chapter 3: “The XFST interface”
July 25
Concatenative morphotactics
Constraining non-local dependencies
Readings
Chapter 4. “The LEXC Language”
Chapter 5. “Flag Diacritics”
July 27
Non-concatenative morphotactics
Reduplication, interdigitation
Readings
Chapter 8. “Non-Concatenative Morphotactics”
August 1
Realizational morphology
Readings
Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm
Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture
Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.),
205-216, Springer Verlag. 2003.
August 3
Optimality theory
Readings
Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic
and Saami Linguistics, Diane Nelson and Satu Manninen (eds.),
pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse
constraint". Phonology 16. 273-329.
Scripting xfst
xfst -l myscript
Start XFST
execute myscript
wait for more commands from
the command line
xfst -f myscript
Execute myscript and exit
xfst -e “echo Welcome” \
-e “regex a b c;” \
-e “save foo” \
-stop
Execute the commands in the
given order. The commands
must be on the same line. The
-stop at the end is required to
make xfst quit.
Numeral Script
# This script constructs the language of English
# numerals from "one” to "ninety-nine".
# This is a comment.
# From "one" through "nine":
define OneToNine [{one} | {two} | {three} | {four} |
{five} | {six} | {seven} | {eight} |
{nine}];
# It is convenient to define a set of prefixes that
# can be followed either by "teen" or by "ty".
define TeenTyStem [{thir} | {fif} | {six}
|
{seven} | {eigh} | {nine}] ;
Numeral Script (Continued)
# From "ten" to "nineteen"
define Teens [{ten} | {eleven} | {twelve} |
[TeenTyStem | {four}] {teen}];
# Let’s define stems that can be followed "ty".
define TyStem [TeenTyStem | {twen} | {for}];
# TyStem is followed either by "ty" or by ty-"
# and a number from OneToNine.
define Tens [TyStem [{ty} | {ty-} OneToNine]];
define OneToNinetyNine [ OneToNine | Teens | Tens ];
push OneToNinetyNine
Number to Numeral
Analysis
Generation
105
105
hundred five
hundred five
hundred and five
one hundred and five
NumberToNumeral script
# This script constructs a transducer that relates the
# English numerals "one", "two", ..., "ninety-nine",
# to the corresponding numbers "1", 2 ... "99".
define OneToNine [1:{one} | 2:{two} | 3:{three} |
4:{four} |5:{five} | 6:{six} |
7:{seven} | 8:{eight} | 9:{nine}];
define TeenTyStem [3:{thir} | 5:{fif} | 6:{six}|
7:{seven} | 8:{eigh} | 9:{nine}];
define Teens [1:0 [{0}:{ten} | 1:{eleven} | 2:{twelve} |
[TeenTyStem | 4:{four}] 0:{teen}]];
NumberToNumeral (Continued)
define TyStem [2:{twen} | TeenTyStem | 4:{for}];
# TyStem is followed either by "ty" paired with a zero
# or by "ty-" mapped to an epsilon and followed by a
# number. Note that {0} means zero and not epsilon.
define Tens [TyStem [{0}:{ty} | 0:{ty-} OneToNine]];
define OneToNinetyNine [ OneToNine | Teens | Tens ];
push OneToNinetyNine
Xerox RE Operators
$
=>
-> @->
containment
restriction
replacement
Make it easier to describe complex languages
and relations without extending the formal
power of finite-state systems.
Containment
$a
?
[?* a ?*]
a
?
a
Restriction
b
a => b _ c
b
?
“Any a must be preceded by b
and followed by c.”
c
c
?
c
~[~[?* b] a ?*] & ~[?* a ~[c ?*]]
Equivalent expression
a
Replacement
a:b
a b -> b a
b:a
“Replace ‘ab’ by ‘ba’.”
b?
a:b
?
a
[[~$[a b] [[a b] .x. [b a]]]* ~$[a b]]
Equivalent expression
a
Marking
a|e|i|o|u -> %[ ... %]
0:[
[
]
i
e
?
o
u
0:]
p o t a t o
p[o]t[a]t[o]
a
Multiple Results
a b | b | b a | a b a -> x
(a) b (a) -> x
applied to “aba”
a b a a b a
a b a
a b a
a x a a x
x a
x
Four factorizations of the input string.
Directed Replace Operators
guarantee a unique result by constraining
the factorization of the input string by
Direction of the match (rightward or leftward)
Length (longest or shortest)
@-> Left-to-right, Longest-match
Replacement
(a) b (a) @-> x
applied to “aba”
a b a
a x a
a b a
a x
a b a
x a
a b a
x
Conditional Replacement
A -> B
L _ R
Replacement
Context
The relation that replaces A by B between L and R leaving
everything else unchanged.
Sources of complexity:


Replacements and contexts may overlap
Alternative ways of interpreting “between left and right.”
A -> B || L _ R
both contexts on the input
A -> B // L _ R
left context on the output
A -> B \\ L _ R
right context on the output
Vowel shortening after a long
vowel
Left context on the input side
V %: -> V || V %: C* _
Slovak
v o l + a: v + a: m e:
v o l + a: v + a m e
we call often
Left context on the output side
V%: -> V // V%: C* _
Gidabal
g u n u: m + ba: + d a: ng + b e: +
g u n u: m + ba
+d a: ng + b e +
is certainly right on the stump
Shortening script
define V [ a |
define C [ b |
m |
x |
e
c
n
y
|
|
|
|
i
d
p
z
| o | u | a ];
| f | g | h | j | k | l |
| q | r | s | t | v |
];
define SlovakShortening %: -> 0 || V %: C* V _ ;
define GidabalShortening %: -> 0 // V %: C* V _ ;
push SlovakShortening
down vola:va:me:
vola:vame
push GidabalShortening
down gunu:mba:da:ngbe:
gunu:mbada:ngbe
Palatalization and Vowel Raising
Palatalization
tim
-->
cim
Vowel Raising
memi
-->
mimi
Interaction
temi -->
cimi
tememi --> cimimi
Vowel Raising & Palatalization
define C [ b | c | d | f | g | h | j | k | l |
m | n | p | q | r | s | t | v |
x | y | z ];
define Raising e -> i \\ _ C* i ;
define Palatalization t -> c || _ i;
regex Raising .o. Palatalization;
down
mimi
down
cim
down
cimi
down
memi
tim
t e m e m i
temi
t i m i m i
tememi
cimimi
c i m i m i
Making a lexical transducer
Morphotactics
Lexicon
Regular Expression
Compiler
Rules
Regular Expressions
Alternations
Lexicon
FST
composition
Rule
FSTs
Lexical Transducer
(a single FST)
Finnish Gradation Script
define Stems [ {tukka}| {kakku} | {pappi} | {tippa} |
{katto} | {juttu} |{tikka} | {huppu} |
{rotta} | {nahka} |{lika} | {maku} |
{rako} | {tuke} | {halko} | {jalka} |
{virka} | {lanka} | {linko} | {puku} |
{suku} | {tiuku} | {raaka} |{ripa} |
{sopu} | {tapa} | {kampa} | {rumpu} |
{sampe} | {sota} | {pata} | {kita} |
{rinta} | {kanto} | {ranta} | {ilta} |
{kulta} | {parta} | {kerta} ];
define Case
[ "+Part":a | "+Gen":n ];
define Finnish [Stems Case];
Auxiliary definitions
define V [a | e | i | o | u | y | ä | ö];
define C [b | c | d | f | g | h | j | k | l | m | n |
p | q | r | s | t | v | w | x | z];
define Coda [ C
[C | .#.] ];
define ClosedSyll [V Coda] ;
Weak form of k
define WeakK
k -> ' || V a _ a Coda, V u _ u Coda
.o.
k -> j || r _ e Coda
.o.
k -> v || u _ u Coda
.o.
k -> g || n _ V Coda
.o.
k -> 0 || \[s|h] _ V Coda ; # kiskon 'rail',
# nahkan 'skin
Weak form of p
define WeakP
p -> m || m _ V Coda
.o.
p -> v || \[s|p] _ V Coda
.o.
p -> 0 || p _ V Coda;
# piispan 'bishop'
Weak form of t
define WeakT
t -> n || n _ V Coda
.o.
t -> l || l _ V Coda
.o.
t -> r || r _ V Coda
.o.
t -> d || \[s|t] _ V Coda # koston revenge
.o.
t -> 0 || t _ V Coda ;
Putting it all together
define Gradation
WeakK .o. WeakP .o. WeakT;
regex Finnish .o. Gradation;
print lower-words
echo *** Size of Finnish .o. Gradation
print size
echo *** Size of Finnish
push Finnish
print size
echo *** Size of Gradation
push Gradation
print size
Syllabification
define C [ b | c | d | f ...
define V [ a | e | i | o | u ];
[C* V+ C*] @-> ... "-" || _ [C V]
“Insert a hyphen after the longest instance of the
C* V+ C* pattern in front of a C V pattern.”
s t r u k
t u
r a
l i s
m i
s t r u k - t u - r a - l i s - m i