building machines

Download Report

Transcript building machines

Building Finite-State
Machines
600.465 - Intro to NLP - J. Eisner
1
Finite-State Toolkits
 In these slides, we’ll use Xerox’s regexp notation
 Their tool is XFST – free version is called FOMA
 Usage:
 Enter a regular expression; it builds FSA or FST
 Now type in input string
 FSA: It tells you whether it’s accepted
 FST: It tells you all the output strings (if any)
 Can also invert FST to let you map outputs to inputs
 Could hook it up to other NLP tools that need finitestate processing of their input or output
 There are other tools for weighted FSMs
(Thrax, OpenFST)
600.465 - Intro to NLP - J. Eisner
2
Common Regular Expression
Operators (in XFST notation)
concatenation
* +
iteration
|
union
&
intersection
~ \ - complementation, minus
.x.
crossproduct
.o.
composition
.u
upper (input) language
.l
lower (output) language
600.465 - Intro to NLP - J. Eisner
EF
E*, E+
E|F
E&F
~E, \x, F-E
E .x. F
E .o. F
E.u “domain”
E.l “range”
3
Common Regular Expression
Operators (in XFST notation)
concatenation
EF
EF = {ef: e  E, f  F}
ef denotes the concatenation of 2 strings.
EF denotes the concatenation of 2 languages.
 To pick a string in EF, pick e  E and f  F and concatenate them.
 To find out whether w  EF, look for at least one way to split w
into two “halves,” w = ef, such that e  E and f  F.
A language is a set of strings.
It is a regular language if there exists an FSA that accepts all the
strings in the language, and no other strings.
If E and F denote regular languages, than so does EF.
(We will have to prove this by finding the FSA for EF!)
600.465 - Intro to NLP - J. Eisner
4
Common Regular Expression
Operators (in XFST notation)
* +
concatenation
iteration
EF
E*, E+
E* = {e1e2 … en: n0, e1 E, … en E}
 To pick a string in E*, pick any number of strings in E and
concatenate them.
 To find out whether w  E*, look for at least one way to split w into 0
or more sections, e1e2 … en, all of which are in E.
E+ = {e1e2 … en: n>0, e1 E, … en E} =EE*
600.465 - Intro to NLP - J. Eisner
5
Common Regular Expression
Operators (in XFST notation)
* +
|
concatenation
iteration
union
EF
E*, E+
E|F
E | F = {w: w E or w F} = E  F
 To pick a string in E | F, pick a string from either E or F.
 To find out whether w  E | F, check whether w  E or w  F.
600.465 - Intro to NLP - J. Eisner
6
Common Regular Expression
Operators (in XFST notation)
* +
|
&
concatenation
iteration
union
intersection
EF
E*, E+
E|F
E&F
E & F = {w: w E and w F} = E F
 To pick a string in E & F, pick a string from E that is also in F.
 To find out whether w  E & F, check whether w  E and w  F.
600.465 - Intro to NLP - J. Eisner
7
Common Regular Expression
Operators (in XFST notation)
concatenation
* +
iteration
|
union
&
intersection
~ \ - complementation, minus
EF
E*, E+
E|F
E&F
~E, \x, F-E
~E = {e: e  E} = * - E
E – F = {e: e  E and e  F} = E & ~F
\E =  - E (any single character not in E)
600.465 - Intro to NLP - J. Eisner
 is set of all letters; so * is set of all strings
8
Regular Expressions
A language is a set of strings.
It is a regular language if there exists an FSA that accepts
all the strings in the language, and no other strings.
If E and F denote regular languages, than so do EF, etc.
Regular expression: EF*|(F & G)+
Syntax:
|
Semantics:
Denotes a regular language.
+
concat
As usual, can build semantics
compositionally bottom-up.
&
*
E
F
F
600.465 - Intro to NLP - J. Eisner
G
E, F, G must be regular languages.
As a base case, e denotes {e} (a
language containing a single string),
so ef*|(f&g)+ is regular.
9
Regular Expressions
for Regular Relations
A language is a set of strings.
It is a regular language if there exists an FSA that accepts
all the strings in the language, and no other strings.
If E and F denote regular languages, than so do EF, etc.
A relation is a set of pairs – here, pairs of strings.
It is a regular relation if here exists an FST that accepts all
the pairs in the language, and no other pairs.
If E and F denote regular relations, then so do EF, etc.
EF = {(ef,e’f’): (e,e’)  E, (f,f’)  F}
Can you guess the definitions for E*, E+, E | F, E & F
when E and F are regular relations?
Surprise: E & F isn’t necessarily regular in the case of relations; so not supported.
600.465 - Intro to NLP - J. Eisner
10
Common Regular Expression
Operators (in XFST notation)
concatenation
* +
iteration
|
union
&
intersection
~ \ - complementation, minus
.x.
crossproduct
EF
E*, E+
E|F
E&F
~E, \x, F-E
E .x. F
E .x. F = {(e,f): e  E, f  F}
 Combines two regular languages into a regular relation.
600.465 - Intro to NLP - J. Eisner
11
Common Regular Expression
Operators (in XFST notation)
concatenation
* +
iteration
|
union
&
intersection
~ \ - complementation, minus
.x.
crossproduct
.o.
composition
EF
E*, E+
E|F
E&F
~E, \x, F-E
E .x. F
E .o. F
E .o. F = {(e,f): m. (e,m)  E, (m,f)  F}
 Composes two regular relations into a regular relation.
 As we’ve seen, this generalizes ordinary function composition.
600.465 - Intro to NLP - J. Eisner
12
Common Regular Expression
Operators (in XFST notation)
concatenation
* +
iteration
|
union
&
intersection
~ \ - complementation, minus
.x.
crossproduct
.o.
composition
.u
upper (input) language
EF
E*, E+
E|F
E&F
~E, \x, F-E
E .x. F
E .o. F
E.u “domain”
E.u = {e: m. (e,m)  E}
600.465 - Intro to NLP - J. Eisner
13
Common Regular Expression
Operators (in XFST notation)
concatenation
* +
iteration
|
union
&
intersection
~ \ - complementation, minus
.x.
crossproduct
.o.
composition
.u
upper (input) language
.l
lower (output) language
600.465 - Intro to NLP - J. Eisner
EF
E*, E+
E|F
E&F
~E, \x, F-E
E .x. F
E .o. F
E.u “domain”
E.l “range”
14
Function from strings to ...
Acceptors (FSAs)
Unweighted
c
{false, true}
a
e
c/.7
a/.5
.3
e/.5
c:z
strings
a:x
e:y
numbers
Weighted
Transducers (FSTs)
(string, num) pairs c:z/.7
a:x/.5
.3
e:y/.5
slide courtesy of L. Karttunen (modified)
How to implement?
concatenation
* +
iteration
|
union
~ \ - complementation, minus
&
intersection
.x.
crossproduct
.o.
composition
.u
upper (input) language
.l
lower (output) language
600.465 - Intro to NLP - J. Eisner
EF
E*, E+
E|F
~E, \x, E-F
E&F
E .x. F
E .o. F
E.u “domain”
E.l “range”
16
example courtesy of M. Mohri
Concatenation
r
r
=
600.465 - Intro to NLP - J. Eisner
17
example courtesy of M. Mohri
Union
r
|
=
600.465 - Intro to NLP - J. Eisner
18
example courtesy of M. Mohri
Closure
(this example has outputs too)
*
=
The loop creates (red machine)+ . Then we add a state to get do e | (red machine)+ .
Why do it this way? Why not just make state 0 final?
600.465 - Intro to NLP - J. Eisner
19
example courtesy of M. Mohri
Upper language (domain)
.u
=
similarly construct lower language .l
also called input & output languages
600.465 - Intro to NLP - J. Eisner
20
example courtesy of M. Mohri
Reversal
.r
=
600.465 - Intro to NLP - J. Eisner
21
example courtesy of M. Mohri
Inversion
.i
=
600.465 - Intro to NLP - J. Eisner
22
Complementation
 Given a machine M, represent all strings
not accepted by M
 Just change final states to non-final and
vice-versa
 Works only if machine has been
determinized and completed first (why?)
600.465 - Intro to NLP - J. Eisner
23
example adapted from M. Mohri
Intersection
fat/0.5
0
pig/0.3
1
eats/0
2/0.8
sleeps/0.6
pig/0.4
&
0
fat/0.2
1
sleeps/1.3
2/0.5
eats/0.6
=
0,0
fat/0.7
eats/0.6
0,1
pig/0.7
1,1
sleeps/1.9
600.465 - Intro to NLP - J. Eisner
2,0/0.8
2,2/1.3
24
Intersection
fat/0.5
0
pig/0.3
1
eats/0
2/0.8
sleeps/0.6
pig/0.4
&
0
fat/0.2
1
sleeps/1.3
2/0.5
eats/0.6
=
0,0
fat/0.7
eats/0.6
0,1
pig/0.7
1,1
sleeps/1.9
Paths 0012 and 0110 both accept fat pig eats
So must the new machine: along path 0,0 0,1 1,1 2,0
600.465 - Intro to NLP - J. Eisner
2,0/0.8
2,2/1.3
25
Intersection
fat/0.5
0
pig/0.3
1
eats/0
2/0.8
sleeps/0.6
pig/0.4
&
0
fat/0.2
1
sleeps/1.3
2/0.5
eats/0.6
=
0,0
fat/0.7
0,1
Paths 00 and 01 both accept fat
So must the new machine: along path 0,0 0,1
600.465 - Intro to NLP - J. Eisner
26
Intersection
fat/0.5
0
pig/0.3
1
eats/0
2/0.8
sleeps/0.6
pig/0.4
&
0
fat/0.2
1
sleeps/1.3
2/0.5
eats/0.6
=
0,0
fat/0.7
0,1
pig/0.7
1,1
Paths 00 and 11 both accept pig
So must the new machine: along path 0,1 1,1
600.465 - Intro to NLP - J. Eisner
27
Intersection
fat/0.5
0
pig/0.3
1
eats/0
2/0.8
sleeps/0.6
pig/0.4
&
0
fat/0.2
1
sleeps/1.3
2/0.5
eats/0.6
=
0,0
fat/0.7
0,1
pig/0.7
1,1
Paths 12 and 12 both accept fat
sleeps/1.9
So must the new machine: along path 1,1 2,2
600.465 - Intro to NLP - J. Eisner
2,2/1.3
28
Intersection
fat/0.5
0
pig/0.3
1
eats/0
2/0.8
sleeps/0.6
pig/0.4
&
0
fat/0.2
1
sleeps/1.3
2/0.5
eats/0.6
=
0,0
fat/0.7
600.465 - Intro to NLP - J. Eisner
0,1
pig/0.7
eats/0.6
2,0/0.8
sleeps/1.9
2,2/1.3
1,1
29
What Composition Means
f
ab?d
g
3
2
6
abcd
abed
4
2
8
abjd
abgd
abed
abd
...
600.465 - Intro to NLP - J. Eisner
30
What Composition Means
3+4 abgd
ab?d
Relation composition: f  g
2+2 abed
6+8
abd
...
600.465 - Intro to NLP - J. Eisner
31
does not contain
any pair of the
form abjd  …
Relation = set of pairs
ab?d  abcd
ab?d  abed
ab?d  abjd
…
f
ab?d
abcd  abgd
abed  abed
abed  abd
…
g
3
2
6
abcd
abed
4
2
8
abjd
abgd
abed
abd
...
600.465 - Intro to NLP - J. Eisner
32
Relation = set of pairs
ab?d  abcd
ab?d  abed
ab?d  abjd
…
ab?d
fg
ab?d  abgd
ab?d  abed
ab?d  abd
…
abcd  abgd
abed  abed
abed  abd
…
4
2
f  g = {xz: y (xy  f and yz  g)}
where x, y, z are strings
8
abgd
abed
abd
...
600.465 - Intro to NLP - J. Eisner
33
Intersection vs. Composition
Intersection
pig/0.4
0
pig/0.3
1
&
1
=
0,1
pig/0.7
1,1
Composition
pig:pink/0.4
Wilbur:pig/0.3
0
600.465 - Intro to NLP - J. Eisner
1
.o.
1
=
Wilbur:pink/0.7
0,1
1,1
34
Intersection vs. Composition
Intersection mismatch
elephant/0.4
0
pig/0.3
1
&
1
=
0,1
pig/0.7
1,1
Composition mismatch
elephant:gray/0.4
Wilbur:pig/0.3
0
600.465 - Intro to NLP - J. Eisner
1
.o.
1
=
Wilbur:gray/0.7
0,1
1,1
35
Composition
.o.
example courtesy of M. Mohri
=
Composition
.o.
a:b .o. b:b = a:b
=
Composition
.o.
a:b .o. b:a = a:a
=
Composition
.o.
a:b .o. b:a = a:a
=
Composition
.o.
b:b .o. b:a = b:a
=
Composition
.o.
a:b .o. b:a = a:a
=
Composition
.o.
a:a .o. a:b = a:b
=
Composition
.o.
b:b .o. a:b = nothing
(since intermediate symbol doesn’t match)
=
Composition
.o.
b:b .o. b:a = b:a
=
Composition
.o.
a:b .o. a:b = a:b
=
Composition in Dyna
start = &pair( start1, start2 ).
final(&pair(Q1,Q2)) :- final1(Q1), final2(Q2).
edge(U, L, &pair(Q1,Q2), &pair(R1,R2))
min= edge1(U, Mid, Q1, R1)
+ edge2(Mid, L, Q2, R2).
600.465 - Intro to NLP - J. Eisner
46
Relation = set of pairs
ab?d  abcd
ab?d  abed
ab?d  abjd
…
ab?d
fg
ab?d  abgd
ab?d  abed
ab?d  abd
…
abcd  abgd
abed  abed
abed  abd
…
4
2
f  g = {xz: y (xy  f and yz  g)}
where x, y, z are strings
8
abgd
abed
abd
...
600.465 - Intro to NLP - J. Eisner
47
3 Uses of Set Composition:
 Feed string into Greek transducer:
 {abedabed} .o. Greek = {abedabed, abedabd}

{abed} .o. Greek = {abedabed, abedabd}

[{abed} .o. Greek].l = {abed, abd}
 Feed several strings in parallel:
 {abcd, abed} .o. Greek
= {abcdabgd, abedabed, abedabd}
 [{abcd,abed} .o. Greek].l = {abgd, abed, abd}
 Filter result via Noe = {abgd,
 {abcd,abed} .o. Greek .o. Noe
abd, …}
= {abcdabgd, abedabd}
600.465 - Intro to NLP - J. Eisner
48
What are the “basic”
transducers?
 The operations on the previous slides
combine transducers into bigger ones
 But where do we start?
 a:e for a  
 e:x for x  D
a:e
e:x
 Q: Do we also need a:x? How about e:e ?
600.465 - Intro to NLP - J. Eisner
49
slide courtesy of L. Karttunen (modified)
Some Xerox Extensions
$
=>
-> @->
containment
restriction
replacement
Make it easier to describe complex languages
and relations without extending the formal
power of finite-state systems.
600.465 - Intro to NLP - J. Eisner
50
Containment
a,b,c,?
b
$[ab*c]
a,b,c,?
a
“Must contain a substring
that matches ab*c.”
Accepts xxxacyy
Rejects bcba
?* [ab*c] ?*
c
Warning: ? in regexps means
“any character at all.”
But ? in machines means
“any character not explicitly
mentioned anywhere
in the machine.”
Equivalent expression
600.465 - Intro to NLP - J. Eisner
51
slide courtesy of L. Karttunen (modified)
Restriction
b
a => b _ c
b
?
“Any a must be preceded by b
and followed by c.”
c
c
a
?
c
Accepts bacbbacde
Rejects baca
~[~[?*
]
b] a ?*
&
~[?*
a ~[c ?*]
]
Equivalent expression
600.465 - Intro to NLP - J. Eisner
52
slide courtesy of L. Karttunen (modified)
Replacement
a:b
a b -> b a
b:a
“Replace ‘ab’ by ‘ba’.”
b?
Transduces abcdbaba
to bacdbbaa
[~$[a
b] [[a b] .x. [b a]]
a:b
?
a
]*
a
~$[a b]
Equivalent expression
600.465 - Intro to NLP - J. Eisner
53
Replacement is Nondeterministic
a b -> b a | x
“Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”
Transduces abcdbaba
to {bacdbbaa, bacdbxa, xcdbbaa, xcdbxa}
600.465 - Intro to NLP - J. Eisner
54
Replacement is Nondeterministic
[ a b -> b a | x ] .o. [ x => _ c ]
“Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”
Transduces abcdbaba
to {bacdbbaa, bacdbxa, xcdbbaa, xcdbxa}
600.465 - Intro to NLP - J. Eisner
55
slide courtesy of L. Karttunen (modified)
Replacement is Nondeterministic
a b | b | b a | a b a -> x
applied to “aba”
Four overlapping substrings match; we haven’t told
it which one to replace so it chooses
nondeterministically
a b a
a x a
a b a
a x
600.465 - Intro to NLP - J. Eisner
a b a
x a
a b a
x
56
slide courtesy of L. Karttunen
More Replace Operators
 Optional replacement: a b (->) b a
 Directed replacement
 guarantees a unique result by constraining
the factorization of the input string by
 Direction of the match (rightward or leftward)
 Length (longest or shortest)
600.465 - Intro to NLP - J. Eisner
57
slide courtesy of L. Karttunen
@-> Left-to-right, Longest-match Replacement
a b | b | b a | a b a @-> x
applied to “aba”
a b a
a x a
@->
@>
->@
>@
a b a
a x
left-to-right,
left-to-right,
right-to-left,
right-to-left,
600.465 - Intro to NLP - J. Eisner
a b a
x a
a b a
x
longest match
shortest match
longest match
shortest match
58
slide courtesy of L. Karttunen (modified)
Using “…” for marking
a|e|i|o|u -> [ ... ]
0:[
[
p o t a t o
p[o]t[a]t[o]
]
i
e
?
a
o
u
0:]
Note: actually have to write as -> %[ ... %]
or -> “[” ... “]”
since [] are parens in the regexp language
600.465 - Intro to NLP - J. Eisner
59
slide courtesy of L. Karttunen (modified)
Using “…” for marking
a|e|i|o|u -> [ ... ]
0:[
[
p o t a t o
p[o]t[a]t[o]
]
i
e
?
a
o
u
0:]
Which way does the FST transduce potatoe?
p o t a t o e
p o t a t o e
vs.
p[o]t[a]t[o][e]
p[o]t[a]t[o e]
How would you change it to get the other answer?
600.465 - Intro to NLP - J. Eisner
60
slide courtesy of L. Karttunen
Example: Finnish Syllabification
define C [ b | c | d | f ...
define V [ a | e | i | o | u ];
[C* V+ C*] @-> ... "-" || _ [C V]
“Insert a hyphen after the longest instance of the
C* V+ C* pattern in front of a C V pattern.”why?
s t r u k
t u
r a
l i s
m i
s t r u k - t u - r a - l i s - m i
600.465 - Intro to NLP - J. Eisner
61
slide courtesy of L. Karttunen
Conditional Replacement
A -> B
L _ R
Replacement
Context
The relation that replaces A by B between L and R leaving
everything else unchanged.
Sources of complexity:


Replacements and contexts may overlap
Alternative ways of interpreting
“between left and right.”
600.465 - Intro to NLP - J. Eisner
62
Hand-Coded Example:
Parsing Dates
slide courtesy of L. Karttunen
Today is [Tuesday, July 25, 2000].
Best result
Today is Tuesday, [July 25, 2000].
Today is [Tuesday, July 25], 2000.
Today is Tuesday, [July 25], 2000.
Today is [Tuesday], July 25, 2000.
Bad results
Need left-to-right, longest-match
constraints.
600.465 - Intro to NLP - J. Eisner
63
slide courtesy of L. Karttunen
Source code: Language of Dates
Day = Monday | Tuesday | ... | Sunday
Month = January | February | ... | December
Date = 1 | 2 | 3 | ... | 3 1
Year = %0To9 (%0To9 (%0To9 (%0To9))) - %0?*
from 1 to 9999
AllDates = Day | (Day “, ”) Month “ ” Date
(“, ” Year))
600.465 - Intro to NLP - J. Eisner
64
slide courtesy of L. Karttunen
Object code:
All Dates from 1/1/1 to 12/31/9999
,
,
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Mon
Tue
Wed
Thu
Fri
Sat
Sun
Jan Feb
Jul Aug
Mar Apr
Sep
May Jun
Oct Nov Dec
600.465 - Intro to NLP - J. Eisner
1
2
3
0
1
,
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
,
13 states, 96 arcs
29 760 007 date expressions
65
slide courtesy of L. Karttunen (modified)
Parser for Dates
AllDates @-> “[DT ” ... “]”
Compiles into an
unambiguous transducer
(23 states, 332 arcs).
Xerox left-to-right replacement operator
Today is [DT Tuesday, July 25, 2000] because yesterday
was [DT Monday] and it was [DT July 24] so tomorrow must
be [DT Wednesday, July 26] and not [DT July 27] as it says
on the program.
600.465 - Intro to NLP - J. Eisner
66
slide courtesy of L. Karttunen
Problem of Reference
Valid dates
Tuesday, July 25, 2000
Tuesday, February 29, 2000
Monday, September 16, 1996
Invalid dates
Wednesday, April 31, 1996
Thursday, February 29, 1900
Tuesday, July 26, 2000
600.465 - Intro to NLP - J. Eisner
67
slide courtesy of L. Karttunen (modified)
Refinement by Intersection
AllDates
MaxDays
In Month
“ 31” => Jan|Mar|May|… _
“ 30” => Jan|Mar|Apr|… _
Valid
Dates
Xerox contextual
restriction operator
Q: Why do these rules
start with spaces?
(And is it enough?)
600.465 - Intro to NLP - J. Eisner
WeekdayDate
LeapYears
Feb 29, => _
…
Q: Why does this rule
end with a comma?
Q: Can we write the
whole rule?
68
slide courtesy of L. Karttunen
Defining Valid Dates
AllDates
&
MaxDaysInMonth
&
LeapYears
&
WeekdayDates
AllDates: 13 states, 96 arcs
29 760 007 date expressions
= ValidDates
ValidDates: 805 states, 6472 arcs
7 307 053 date expressions
600.465 - Intro to NLP - J. Eisner
69
slide courtesy of L. Karttunen
Parser for Valid and Invalid Dates
[AllDates - ValidDates] @-> “[ID ” ... “]”
2688 states,
,
20439 arcs
ValidDates @-> “[VD ” ... “]”
Comma creates a single FST
that does left-to-right longest
match against either pattern
Today is [VD Tuesday, July 25, 2000],
not [ID Tuesday, July 26, 2000].
600.465 - Intro to NLP - J. Eisner
valid date
invalid date
70