CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007 Lecture 1, 7/21/2005 Natural Language Processing.

Transcript CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007 Lecture 1, 7/21/2005 Natural Language Processing.

CS60057
Speech &Natural Language
Processing
Autumn 2007
Lecture4
1 August 2007
Lecture 1, 7/21/2005
Natural Language Processing
1
MORPHOLOGY
Lecture 1, 7/21/2005
Natural Language Processing
2
Finite State Machines




FSAs are equivalent to regular languages
FSTs are equivalent to regular relations (over pairs of
regular languages)
FSTs are like FSAs but with complex labels.
We can use FSTs to transduce between surface and
lexical levels.
Lecture 1, 7/21/2005
Natural Language Processing
3
Simple Rules
Lecture 1, 7/21/2005
Natural Language Processing
4
Adding in the Words
Lecture 1, 7/21/2005
Natural Language Processing
5
Derivational Rules
Lecture 1, 7/21/2005
Natural Language Processing
6
Parsing/Generation vs. Recognition


Recognition is usually not quite what we need.
 Usually if we find some string in the language we need
to find the structure in it (parsing)
 Or we have some structure and we want to produce a
surface form (production/generation)
Example
 From “cats” to “cat +N +PL” and back
Lecture 1, 7/21/2005
Natural Language Processing
7
Morphological Parsing


Given the input cats, we’d like to output
cat +N +Pl,
telling us that cat is a plural noun.
Given the Spanish input bebo, we’d like to output
beber +V +PInd +1P +Sg
telling us that bebo is the present indicative first person singular
form of the Spanish verb beber, ‘to drink’.
Lecture 1, 7/21/2005
Natural Language Processing
8
Morphological Anlayser
To build a morphological analyser we need:
 lexicon: the list of stems and affixes, together with basic information
about them

morphotactics: the model of morpheme ordering (eg English plural
morpheme follows the noun rather than a verb)

orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes
combine (e.g., fly+s = flies)
Lecture 1, 7/21/2005
Natural Language Processing
9
Lexicon & Morphotactics

Typically list of word parts (lexicon) and the models of
ordering can be combined together into an FSA which
will recognise the all the valid word forms.

For this to be possible the word parts must first be
classified into sublexicons.

The FSA defines the morphotactics (ordering
constraints).
Lecture 1, 7/21/2005
Natural Language Processing
10
Sublexicons
to classify the list of word parts
reg-noun
irreg-pl-noun
irreg-sg-noun
plural
cat
mice
mouse
-s
fox
sheep
sheep
geese
goose
Lecture 1, 7/21/2005
Natural Language Processing
11
FSA Expresses Morphotactics
(ordering model)
Lecture 1, 7/21/2005
Natural Language Processing
12
Towards the Analyser

We can use lexc or xfst to build such an FSA

To augment this to produce an analysis we must create a
transducer Tnum which maps between the lexical level
and an "intermediate" level that is needed to handle the
spelling rules of English.
Lecture 1, 7/21/2005
Natural Language Processing
13
Three Levels of Analysis
Lecture 1, 7/21/2005
Natural Language Processing
14
1. Tnum: Noun Number Inflection
• multi-character symbols
• morpheme boundary ^
• word boundary #
Lecture 1, 7/21/2005
Natural Language Processing
15
Intermediate Form to Surface


The reason we need to have an intermediate form is that
funny things happen at morpheme boundaries, e.g.
cat^s  cats
fox^s  foxes
fly^s  flies
The rules which describe these changes are called
orthographic rules or "spelling rules".
Lecture 1, 7/21/2005
Natural Language Processing
16
More English Spelling Rules






consonant doubling: beg / begging
y replacement: try/tries
k insertion: panic/panicked
e deletion: make/making
e insertion: watch/watches
Each rule can be stated in more detail ...
Lecture 1, 7/21/2005
Natural Language Processing
17
Spelling Rules


Chomsky & Halle (1968) invented a special notation for
spelling rules.
A very similar notation is embodied in the "conditional
replacement" rules of xfst.
E -> F || L _ R
which means replace E with F when it appears between
left context L and right context R
Lecture 1, 7/21/2005
Natural Language Processing
18
A Particular Spelling Rule
This rule does e-insertion
^ -> e || x _ s#
Lecture 1, 7/21/2005
Natural Language Processing
19
e insertion over 3 levels
The rule corresponds to the mapping between
surface and intermediate levels
Lecture 1, 7/21/2005
Natural Language Processing
20
e insertion as an FST
Lecture 1, 7/21/2005
Natural Language Processing
21
Incorporating Spelling Rules



Spelling rules, each corresponding to an FST, can be run in parallel
provided that they are "aligned".
The set of spelling rules is positioned between the surface level and
the intermediate level.
Parallel execution of FSTs can be carried out:
 by simulation: in this case FSTs must first be aligned.
 by first constructing a a single FST corresponding to their
intersection.
Lecture 1, 7/21/2005
Natural Language Processing
22
Putting it all together
execution of FSTi
takes place in
parallel
Lecture 1, 7/21/2005
Natural Language Processing
23
Kaplan and Kay: The Xerox View
FSTi are aligned
but separate
Lecture 1, 7/21/2005
FSTi intersected
together
Natural Language Processing
24
Finite State Transducers

The simple story
 Add another tape
 Add extra symbols to the transitions

On one tape we read “cats”, on the other we write
“cat +N +PL”, or the other way around.
Lecture 1, 7/21/2005
Natural Language Processing
25
FSTs
Lecture 1, 7/21/2005
Natural Language Processing
26
English Plural
surface
lexical
cat
cat+N+Sg
cats
cat+N+Pl
foxes
fox+N+Pl
mice
mouse+N+Pl
sheep
sheep+N+Pl
sheep+N+Sg
Lecture 1, 7/21/2005
Natural Language Processing
27
Transitions
c:c



a:a
t:t
+N:ε
+PL:s
c:c means read a c on one tape and write a c on the other
+N:ε means read a +N symbol on one tape and write nothing on the
other
+PL:s means read +PL and write an s
Lecture 1, 7/21/2005
Natural Language Processing
28
Typical Uses


Typically, we’ll read from one tape using the first symbol
on the machine transitions (just as in a simple FSA).
And we’ll write to the second tape using the other
symbols on the transitions.
Lecture 1, 7/21/2005
Natural Language Processing
29
Ambiguity


Recall that in non-deterministic recognition multiple
paths through a machine may lead to an accept state.
 Didn’t matter which path was actually traversed
In FSTs the path to an accept state does matter since
differ paths represent different parses and different
outputs will result
Lecture 1, 7/21/2005
Natural Language Processing
30
Ambiguity


What’s the right parse for
 Unionizable
 Union-ize-able
 Un-ion-ize-able
Each represents a valid path through the derivational
morphology machine.
Lecture 1, 7/21/2005
Natural Language Processing
31
Ambiguity

There are a number of ways to deal with this problem
 Simply take the first output found
 Find all the possible outputs (all paths) and return
them all (without choosing)
 Bias the search so that only one or a few likely paths
are explored
Lecture 1, 7/21/2005
Natural Language Processing
32
The Gory Details



Of course, its not as easy as
 “cat +N +PL” <->
“cats”
As we saw earlier there are geese, mice and oxen
But there are also a whole host of spelling/pronunciation
changes that go along with inflectional changes
 Cats vs Dogs
 Fox and Foxes
Lecture 1, 7/21/2005
Natural Language Processing
33
Multi-Tape Machines

To deal with this we can simply add more tapes and use
the output of one tape machine as the input to the next

So to handle irregular spelling changes we’ll add
intermediate tapes with intermediate symbols
Lecture 1, 7/21/2005
Natural Language Processing
34
Generativity



Nothing really privileged about the directions.
We can write from one and read from the other or viceversa.
One way is generation, the other way is analysis
Lecture 1, 7/21/2005
Natural Language Processing
35
Multi-Level Tape Machines

We use one machine to transduce between the lexical and the
intermediate level, and another to handle the spelling changes to the
surface tape
Lecture 1, 7/21/2005
Natural Language Processing
36
Lexical to Intermediate Level
Lecture 1, 7/21/2005
Natural Language Processing
37
Intermediate to Surface

The add an “e” rule as in fox^s# <-> foxes#
Lecture 1, 7/21/2005
Natural Language Processing
38
Foxes
Lecture 1, 7/21/2005
Natural Language Processing
39
Note

A key feature of this machine is that it doesn’t do
anything to inputs to which it doesn’t apply.

Meaning that they are written out unchanged to the
output tape.

Turns out the multiple tapes aren’t really needed; they
can be compiled away.
Lecture 1, 7/21/2005
Natural Language Processing
40
Overall Scheme

We now have one FST that has explicit information
about the lexicon (actual words, their spelling, facts
about word classes and regularity).
 Lexical level to intermediate forms

We have a larger set of machines that capture
orthographic/spelling rules.
 Intermediate forms to surface forms
Lecture 1, 7/21/2005
Natural Language Processing
41
Overall Scheme
Lecture 1, 7/21/2005
Natural Language Processing
42
http://nltk.sourceforge.net/index.php/Documentation
Lecture 1, 7/21/2005
Natural Language Processing
43

CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007 Lecture 1, 7/21/2005 Natural Language Processing.

Transcript CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007 Lecture 1, 7/21/2005 Natural Language Processing.

Directory