Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute

Download Report

Transcript Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute

Finite-State Methods in Natural
Language Processing
Lauri Karttunen
LSA 2005 Summer Institute
August 3, 2005
August 1
Non-concatenative morphotactics
Reduplication, interdigitation
Realizational morphology
Readings
Chapter 8. “Non-Concatenative Morphotactics”
Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure.
Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in
Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer
Verlag. 2003.
August 3
Optimality theory
Readings
Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and
Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI
Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint".
Phonology 16. 273-329.
Background
q
Two old strains of finite-state (morpho)phonology
rewrite rules
two-level constraints
q
Optimality theory
(Chomsky&Halle 1968)
(Koskenniemi 1983)
(Prince & Smolensky 1993)
two-level model with ranked, violable constraints
q
Formal Power
OT is not a finite-state system if it involves unlimited counting of
constraint violations. (Ellison 1994, Eisner 1997, Frank&Satta
1998)
But a finite-state model can be useful for OT.
Optimality theory
Prince & Smolensky 1993
eliminate
rules
derivations
introduce
violable ranked constraints
Instant success!
Brief Introduction to OT
Input
A language of underlying lexical forms.
GEN
A function that generates alternate surface
realizations for each input form, possibly an
infinite set.
Constraints
A finite set of principles, preferrably universal,
that filter out unwanted realizations.
Ranking
A language-specific ordering of the constraints.
Computational perspective
q
Ellison 1994
OT deals with regular sets and relations: a finite-state system
constraint transducers mark violations, marks sorted and counted
q
Tesar 1995
dynamic algorithm for optimal path computations
q
Eisner 1996
two-level typology of optimality constraints: restrict, prohibit
“FootForm Decomposed” MIT Working Papers in Linguistics, 31:115-143
proposes Primitive Optimality Theory (no generalized alignment)
q
Karttunen 1998
Introduces lenient composition
q
Frank & Satta 1998
Prove that OT is regular if # of violations is bounded.
Comparisons
Application
Merging
rewrite rules
composition
composition
two-level constraints
intersecting
composition
intersection
optimality constraints
lenient
composition
lenient
composition
Finnish OT Prosody
Lauri Karttunen
CLS-41
April 7, 2005
Finnish Prosody: basic facts
•
•
•
•
The nucleus of a Finnish syllable must
consist of a short vowel, a long vowel, or a
diphthong.
Main stress is always on the first syllable,
secondary stress occurs on non-initial
syllables.
Adjacent syllables are never stressed.
Stressed syllable is initial in the foot.
ilmoittautuminen
(íl.moit).(tàu.tu).(mì.nen)
‘registering’ (Nom Sg)
Ternary feet in Finnish
Stress that would fall on a light syllable shifts on the
following heavy syllable creating a ternary foot.
(ká.las).te.(lèm.me)
‘we are fishing’
(íl.moit).(tàu.tu).mi.(sès.ta)
‘registering’ (Ela Sg)
(rá.kas).ta.(jàt.ta).ri.(àn.sa)
‘his mistresses’ (Par Pl)
Can we get these facts to come out “for free”, from the
interaction of independently motivated principles?
Yes!
Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to
Finnic and Saami Linguistics, Diane Nelson and Satu Manninen
(eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse
constraint". Phonology 16. 273-329.
Non-OT and OT solutions
It is possible to define a cascade of replace rules that
produce the desired result.
http://www.stanford.edu/~laurik/fsmbook/examples/Fin
nishProsody.html
But, following Kiparsky, we are going to do OT today,
and in a more elegant way than is shown at
http://www.stanford.edu/~laurik/fsmbook/examples/Fin
nishOTProsody.html
Prelude: Built-in Functions in fst
Case conversion
UpCase(
DownCase(
Cap(
AnyCase(
OptUpCase(
OptDownCase(
OptCap(
Cap({hello}) is equivalent to {Hello}
OptUpCase(a:b, L) is equivalent to [a:B | a:b] ;
Symbol manipulation
Explode(
Implode(
regex Explode("+Test") is equivalent to regex {+Test};
Functions: User-defined
The function definition is attached to a symbol ending with (
The definition is any regular expression.
There may be any number of arguments.
define Redup(X) [X X];
define Apply(X, Y) [X .o. Y].l ;
When the function is used in a regular expression, the
arguments are bound and the function is evaluated.
regex Apply({abc}, a -> x || _ b);
print words
xbc
The definition of a function may contain other functions.
Pig Latin
# This script creates a function for translating from
English to Pig Latin:
# pig -> igpay, brown -> ownbray, script -> iptscray
define C [b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z];
define V [a|e|i|o|u] ;
define
define
define
define
Redup(X) [X "." X];
DelCons(X) [X .o. C+ @-> 0 || .#. _ ];
TailToAy(X) [X .o. V ?* @-> {ay} || "." C* _ ];
DelMiddle(X) [X .o. "." -> 0];
define Pig(X) [DelMiddle(TailToAy(DelCons(Redup(X))))];
Demo!
fst -l piglatin.script
Computing with OT
Input language
.o.
GEN
Constraint 1
Constraint 2
Compose the input language
with GEN to produce a
mapping from each input form
to all of its output candidates
Eliminate suboptimal
candidates by applying
constraints in the ranked
order. At least one output
candidate always survives.
By what finite-state
operation?
Priority union .P.
a
R={
b
,
x
b
Q={
}
y
c
,
z
a
R .P. Q = {
b
,
x
w
c
,
y
}
}
w
All pairs from R and those pairs from Q that do
not conflict with the mapping established by R.
R .P. Q
=
[ R | [~[R.u] .o. Q]
Kaplan 1987
Lenient Composition .O.
Let R be a relation that maps each input string to
one or more outputs.
Let C be a constraint that eliminates some outputs.
R .O. C is the relation that maps each input string
that can meet the constraint C to the outputs that
meet C and leaves the rest of the relation R
unchanged. (Karttunen 1998)
R .O. C
=
[ [R .o. C] .P. R ]
Is constraint ranking rule ordering in disguise? Yes.
Need a prolific GEN
ka.la
ka.lá
ka.là
ka.(là)
ka.(lá)
ká.la
ká.lá
ká.là
ká.(là)
ká.(lá)
kà.la
kà.lá
kà.là
kà.(là)
kà.(lá)
(kà).la
(kà).lá
(kà).là
(kà).(là)
(kà).(lá)
(kà.là)
(kà.lá)
kala ‘fish’ (Nom Sg) 33 candidates
(kà.la)
(ká).la
(ká).lá
(ká).là
(ká).(là)
(ká).(lá)
(ká.là)
(ká.lá)
(ká.la) ☜
(ka.là)
(ka.lá)
Basic definitions 1
Using Parc/XRCE regular expression syntax:
define C [b | c | d | f | g | h | j | k | l | m |
n | p | q | r | s | t | v | w | x | z];
define
define
define
define
define
define
define
define
HighV [u |
MidV [e |
LowV [a |
USV [HighV
y | i];
o | ö];
ä] ;
| MidV | LowV];
#
#
#
#
# Consonant
High vowel
Mid vowel
Low vowel
Unstressed Vowel
MSV [á | é | í | ó | ú | ý | ä’ | ö’];
SSV [à | è | ì | ò | ù | y` | ä` | ö`];
SV [MSV | SSV];
# Stressed vowel
V [USV | SV] ;
# Vowel
Basic definitions 2
define P [V | C];
define B [[\P+] | .#.];
define E .#. | ".";
define Light [C* V];
define Heavy [Light P+];
define S [Heavy | Light];
define SS [S & $SV];
define US [S & ~$SV];
define MSS [S & $MSV] ;
# Phone
# Boundary
# Edge
# Light syllable
# Heavy syllable
# Syllable
# Stressed syllable
# Unstressed syllable
# Syllable with main stress
GEN 1
define MarkNonDiphthongs [
[. .] -> "." || [HighV | MidV]
LowV
i
u
y
$V i
$V u
$V y
_
_
_
_
_
_
_
_
LowV,
MidV,
[MidV
[MidV
[MidV
e,
o,
ö ];
# i.a, e.a
#a.e
- e],
- o],
- ö],
#
#
#
#
#
#
i.o, i.ö
u.e
y.e
poiki.en
Insert a syllable boundary between vowels that cannot form
a diphtong: i.a, e.a, a.e, i.o, u.e, y.e, etc.
define Syllabify C* V+ C* @-> ... "." || _ C V ;
Insert a syllable boundary after a maximal C* V+ C* pattern that is followed by C V.
For example, strukturalismi -> struk.tu.ra.lis.mi.
GEN 2
define Stress a (->) á|à, e (->) é|è, i (->) í|ì,
o (->) ó|ò, u (->) ú|ù, y (->) "y´"|"y`",
ä (->) "ä´"|"ä`", ö (->) "ö´"|"ö`";
Optionally stress any vowel with a primary or secondary stress.
define Scan [[S ("." S ("." S)) & $SS] (->) "(" ... ")" || E _ E] ;
Optionally group syllables into unary, binary, or ternary feet when there
is at least one stressed syllable.
define Gen [MarkNonDiphthongs .o. Syllabify .o.
Stress .o. Scan];
Demo!
fst -utf8 -l gen.script
regex {kala} .o. Gen
print lower-words
print size
(compose)
(show output candidates)
(count them)
Kiparsky's nine constraints
Clash
AlignLeft
MainStress
FootBin
Lapse
NonFinal
StressToWeight
Parse
AllFeetFirst
Counting constraint violations
We use asterisks to mark constraint violations. We need a way
to prefer candidates with the least number of violation
marks.
define Viol ${*};
define
define
define
define
Viol0
Viol1
Viol2
Viol3
~Viol;
~[Viol^2];
~[Viol^3];
~[Viol^4];
# No violations
# At most one violation
# At most two violations
This eliminates the violation marks after the candidate set has
been pruned by a constraint.
define Pardon {*} -> 0;
Defining OT Constraints
Three types:
Unviolable constraints
Primary stress in Finnish
Ordinary violable constraints
Lapse
Gradient alignment constraints
All-Feet-First
Strategy:
We define an evaluation template for each of the three types
and then define the individual constraints with the help
of the templates.
Evaluation Template for Unviolable
Constraints
define Unviolable(Candidates, Constraint) [
Candidates
.o.
Constraint
];
Example:
define MainStress(X) Unviolable(X, B MSS ~$MSS);
# B is the left edge of the word or "(".
# MSS is a syllable with a primary stress.
Evaluation Template for Ordinary
Constraints
define Eval(Candidates, Violation, Left, Right) [
Candidates
.o.
Violation -> ... {*} || Left _ Right
.O.
Viol3 .O. Viol2 .O. Viol1 .O. Viol0
.o.
Pardon ];
where Viol0 is ~${*}, Viol2 is ~[[${*}]^2], etc. and
Pardon is {*} -> 0 deleting all violation marks.
Evaluation Template for LeftOriented Gradient Alignment
define EvalGradientLeft(Candidates, Violation, Left, Right) [
Candidates .o.
Violation -> {*} ... || .#. Left _ Right
.o.
Violation -> {*}^2 ... || .#. Left^2 _ Right
.o.
Violation -> {*}^3... || .#. Left^3 _ Right
.o.
Violation -> {*}^4 ... || .#. Left^4 _ Right
.o.
Violation -> {*}^5 ... || .#. Left^5 _ Right
.o.
Violation -> {*}^6 ... || .#. Left^6 _ Right
.o.
Violation -> {*}^7 ... || .#. Left^7 _ Right
.o.
Violation -> {*}^8 ... || .#. Left^8 _ Right
.O.
Viol12 .O. Viol11 .O. Viol10 .O. Viol9 .O. Viol8 .O. Viol7 .O.
Viol6 .O. Viol5 .O. Viol4 .O. Viol3 .O. Viol2 .O. Viol1 .O.
Viol0 .o. Pardon ];
Clash, AlignLeft, MainStress
Clash
No stress on adjacent syllables.
define Clash(X) Eval(X, SS, SS B, ?*);
Align-Left
The stressed syllable is initial in the foot.
define AlignLeft(X) Eval(X, SV, .#. ~[?* "(" C*], ?*);
Main Stress
The primary stress in Finnish is on the first syllable.
define MainStress(X) Unviolable(X, B MSS ~$MSS);
FootBin, Lapse, NonFinal
Foot-Bin
Feet are minimally bimoraic and maximally bisyllabic.
define FootBin(X) Eval(X, "(” Light ") "|” ("S["." S]^>1,
?* ,?*);
Lapse
Every unstressed syllable must be adjacent to a stressed syllable or to the word edge.
define Lapse(X) Eval(X, US, [B US B], [B US B]);
Non-Final
The final syllable is not stressed.
define NonFinal(X) Eval(X, SS, ?*, ~$S .#.);
StressToWeight, Parse,
AllFeetFirst
Stress-To-Weight
Stressed syllables are heavy.
define StressToWeight(X) Eval(X, SS & Light, ?*, ")"| E);
License-s
Syllables are parsed into feet.
define Parse(X) Eval(X, S, E, E);
All-Ft-Left
The left edge of every foot coincides with the left edge of some prosodic word.
define AllFeetFirst(X) [
EvalGradientLeft(X, "(", $".", ?*) ];
Finnish Prosody
Kiparsky 2003:
define FinnishProsody(Input) [
AllFeetFirst( Parse( StressToWeight(
NonFinal( Lapse( FootBin( MainStress(
AlignLeft( Clash( Input .o. Gen)))))))))];
FinnWords
regex FinnishProsody( {kalastelet} | {kalasteleminen} |
{ilmoittautuminen} | {järjestelmättömyydestänsä} |
{kalastelemme} | {ilmoittautumisesta} |
{järjestelmällisyydelläni} | {järjestelmällistämätöntä} |
{voimisteluttelemasta} | {opiskelija} | {opettamassa} |
{kalastelet} | {strukturalismi} | {onnittelemanikin} |
{mäki} | {perijä} | {repeämä} | {ergonomia} |
{puhelimellani} | {matematiikka} | {puhelimistani} |
{rakastajattariansa} | {kuningas} | {kainostelijat} |
{ravintolat} | {merkonomin} ) ;
Demo!
Result
(ér.go).(nò.mi).a
(íl.moit).(tàu.tu).mi.(sès.ta)
(íl.moit).(tàu.tu).(mì.nen)
(ón.nit).(tè.le).(mà.ni).kin
(ó.pis).(kè.li).ja
(ó.pet).ta.(màs.sa)
(vói.mis).te.(lùt.te).le.(màs.ta)
(strúk.tu).ra.(lìs.mi)
(rá.vin).(tò.lat)
(rá.kas).ta.(jàt.ta).ri.(àn.sa)
(ré.pe).(ä`.mä)
(pé.ri).jä
(pú.he).li.(mèl.la).ni
(pú.he).li.(mìs.ta).ni
(mä’.ki)
(má.te).ma.(tìik.ka)
(mér.ko).(nò.min)
(kái.nos).(tè.li).jat
(ká.las).te.(lèm.me)
(ká.las).te.(lè.mi).nen
(ká.las).(tè.let)
(kú.nin).gas
(jä’r.jes).tel.(mä`l.li).syy.(dèl.lä).ni
(jä’r.jes).(tèl.mät).tö.(my`y.des).(tä`n.sä)
(jä’r.jes).(tèl.mäl).(lìs.tä).mä.(tö`n.tä)
Two Errors
(ká.las).te.(lè.mi).nen
(jä´r.jes).tel.(mä`l.li).syy.(dèl.lä).ni
The interaction of Lapse and StressToWeight does
not produce the desired result in these cases.
What is wrong?
define Debug(Input) [
DebugStressToWeight(
NonFinal( Lapse( FootBin( MainStress( AlignLeft(
Clash( Input .o. Gen))))))) ];
regex Debug({kalasteleminen});
(ká*.las).te.(lè*.mi).nen
<-- actual winner
(ká*.las).(tè*.le).(mì*.nen)
<-- desired output
(jä´r.jes).tel.(mä`l.li).syy.(dèl.lä).ni <-- actual winner
(jä’r.jes).(tèl.mäl).li.(sy`y.del).(lä`*.ni)
<-- desired output
The StressToWeight constraint eliminates some of the desired
winning candidates.
Nine Elenbaas
A unified account of binary and ternary stress. Ph.D. dissertation.
University of Utrecht. 1999. Based on Kiparsky&Hanson 1996. The
only difference is that Elenbaas has a special constraint *(L’ H) or
AntiLStressH( in place of Kiparsky’s more general StressToWeight
constraint.
define FinnishProsody(Input) [
AllFeetFirst( Parse( AntiLStressH(
NonFinal( Lapse( AlignLeft( FootBin(
MainStress( Clash( Input .o. Gen))))))))) ];
define AntiLStressH(X) Eval(X, SS & Light, "(" , "." Heavy);
Result
(ér.go).(nò.mi).a
(íl.moit).(tàu.tu).mi.(sès.ta)
(íl.moit).(tàu.tu).(mì.nen)
(ón.nit).(tè.le).(mà.ni).kin
(ó.pis).(kè.li).ja
(ó.pet).ta.(màs.sa)
(vói.mis).te.(lùt.te).le.(màs.ta)
(strúk.tu).ra.(lìs.mi)
(rá.vin).(tò.lat)
(rá.kas).ta.(jàt.ta).ri.(àn.sa)
(ré.pe).(ä`.mä)
(pé.ri).jä
(pú.he).li.(mèl.la).ni
(pú.he).li.(mìs.ta).ni
(mä’.ki)
(má.te).ma.(tìik.ka)
(mér.ko).(nò.min)
(kái.nos).(tè.li).jat
(ká.las).te.(lèm.me)
(ká.las).te.(lè.mi).nen
(ká.las).(tè.let)
(kú.nin).gas
(jä’r.jes).(tèl.mäl).li.(syy’.del).(lä’.ni)
(jä’r.jes).(tèl.mät).tö.(my`y.des).(tä`n.sä)
(jä’r.jes).(tèl.mäl).(lìs.tä).mä.(tö`n.tä)
Did She Know?
Six syllables (Appendix of Elenbaas thesis)
XXLLLL
áterìanàni
áteriànani 'meal (Ess 1SG)'
érgonòmiàna
'ergonomics (Ess)'
káinostèlijàna
'shy person (Ess)'
káinostèlijàni
'shy person (Nom 1SG)'
kúnnallìsenàni
'council (Ess 1SG)'
kúnnallìsiàni ’
councils (Part 1SG)'
kúnnallìsinàni
'councils (Ess 1SG)'
mérkonòmiàni
'degree in economics (Part 1SG)'
mérkonòminàni
'degree in economics (Ess 1SG)'
ópiskèlijàni
'student (Nom 1SG)'
púhelìmenàni
'telephone (Ess 1SG)'
púhelìmiàni
’telephone (Part 1SG)’
Missing pattern: X X L L L H
Conclusion
Can we get ternary feet in Finnish “for free”, from the
interaction of independently motivated principles?
We don’t know.
We know that the Kiparsky and Elenbaas accounts fail.
Optimality Prosody is computationally very difficult.
The number of initial candidates is huge:
kalasteleminen
järjestelmällisyydelläni
70653
21767579
Simple tableau methods do not work.
Finite-state implementation guards against errors made by a
human GEN and EVAL.
But even when an error can be pinpointed, the fix is not obvious.
Debugging OT constraints is as hard as debugging two-level
rules, in practice more difficult than rewrite systems.
Final Thoughts
Morphology is a regular relation.
The composition of words (morphosyntax), morphological
alternations, and prosody can be described in finite-state terms.
A complex relation can be decomposed in different ways.
There are many flavors of finite-state morphology: Item-andArrangement, Rewrite rules, Two-level rules, Realizational
Morphology, Classical optimality constraints.
Computing with finite-state tools is fun and easy.
We have sophisticated formalism for describing regular relations,
efficient compilers and runtime software.
‘Pen-and-pencil’ morphology badly needs computational support.
It is difficult to get globally correct results relying on a handful of
interesting words, rules, and constraints.