CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ Jim.

Download Report

Transcript CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ Jim.

CSC 9010
Natural Language Processing
Lecture 3: Morphology, Finite State
Transducers
Paula Matuszek
Mary-Angela Papalaskari
Presentation slides adapted from:
Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/
Jim Martin: http://www.cs.colorado.edu/~martin/csci5832.html
CSC 9010- NLP - 3: Morphology, Finite State Transducers
and
1
Today
Elementary Morphology
Computational morphology
Finite State Transducers
Lexicon-only schemes
Rule-only schemes
Lab: Introduction to NLTK
CSC 9010- NLP - 3: Morphology, Finite State Transducers
2
Morphology
Morphology:
The study of the way words are built up from smaller meaning
units.
Morphemes:
The smallest meaningful unit in the grammar of a language.
Contrasts:
Derivational vs. Inflectional
Regular vs. Irregular
Concatinative vs. Templatic (root-and-pattern)
A useful resource:
Glossary of linguistic terms by Eugene Loos
http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm
CSC 9010- NLP - 3: Morphology, Finite State Transducers
3
Examples (English)
“unladylike”
3 morphemes, 4 syllables
unlady
-like
‘not’
‘(well behaved) female adult human’
‘having the characteristics of’
Can’t break any of these down further without
distorting the meaning of the units
“technique”
1 morpheme, 2 syllables
“dogs”
2 morphemes, 1 syllable
-s, a plural marker on nouns
CSC 9010- NLP - 3: Morphology, Finite State Transducers
4
Morpheme Definitions
Root
The portion of the word that:
– is common to a set of derived or inflected forms, if any, when all
affixes are removed
– is not further analyzable into meaningful elements
– carries the principal portion of meaning of the words
Stem
The root or roots of a word, together with any derivational affixes,
to which inflectional affixes are added.
Affix
A bound morpheme that is joined before, after, or within a root or
stem.
Clitic
a morpheme that functions syntactically like a word, but does not
appear as an independent phonological word
– Spanish: un beso, las aguas
– English: Hal’s (genetive marker)
– Proto-European: Kwe  -que (Latin), te (Greek), and –ca (Sanskrit)
CSC 9010- NLP - 3: Morphology, Finite State Transducers
5
Inflectional vs. Derivational
Word Classes
Parts of speech: noun, verb, adjectives, etc.
Word class dictates how a word combines with morphemes to
form new words
Inflection:
Variation in the form of a word, typically by means of an
affix, that expresses a grammatical contrast.
– Doesn’t change the word class
– Usually produces a predictable, non-idiosyncratic change of
meaning.
Derivation:
The formation of a new word or inflectable stem from another
word or stem.
CSC 9010- NLP - 3: Morphology, Finite State Transducers
6
Inflectional Morphology
Adds:
tense, number, person, mood, aspect
Word class doesn’t change
Word serves new grammatical role
Examples
come is inflected for person and number:
The pizza guy comes at noon.
las and rojas are inflected for agreement with
manzanas in grammatical gender by -a and in
number by –s
las manzanas rojas
(‘the red apples’)
CSC 9010- NLP - 3: Morphology, Finite State Transducers
7
Derivational Morphology
Nominalization (formation of nouns from other parts of speech,
primarily verbs in English):
computerization
appointee
killer
fuzziness
Formation of adjectives (primarily from nouns)
computational
clueless
Embraceable
Diffulcult cases:
building  from which sense of “build”?
CSC 9010- NLP - 3: Morphology, Finite State Transducers
8
Concatinative Morphology
Morpheme+Morpheme+Morpheme+…
Stems: also called lemma, base form, root, lexeme
hope+ing  hoping
hop  hopping
Affixes
Prefixes: Antidisestablishmentarianism
Suffixes: Antidisestablishmentarianism
Infixes: hingi (borrow) – humingi (borrower) in Tagalog
Circumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languages
uygarlaştıramadıklarımızdanmışsınızcasına (Turkish)
uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
Behaving as if you are among those whom we could not cause to become civilized
CSC 9010- NLP - 3: Morphology, Finite State Transducers
9
Templatic Morphology
Roots and Patterns
Example: Hebrew verbs
Root:
– Consists of 3 consonants CCC
– Carries basic meaning
Template:
– Gives the ordering of consonants and vowels
– Specifies semantic information about the verb
 Active, passive, middle voice
Example:
– lmd (to learn or study)
 CaCaC -> lamad (he studied)
 CiCeC -> limed (he taught)
 CuCaC -> lumad (he was taught)
CSC 9010- NLP - 3: Morphology, Finite State Transducers
10
Nouns and Verbs (in English)
Nouns have simple inflectional morphology
cat
cat+s, cat+’s
Verbs have more complex morphology
CSC 9010- NLP - 3: Morphology, Finite State Transducers
11
Nouns and Verbs (in English)
Nouns
Have simple inflectional morphology
Cat/Cats
Mouse/Mice, Ox, Oxen, Goose, Geese
Verbs
More complex morphology
Walk/Walked
Go/Went, Fly/Flew
CSC 9010- NLP - 3: Morphology, Finite State Transducers
12
Regular (English) Verbs
Morphological Form Classes
Regularly Inflected Verbs
Stem
walk
merge
try
map
-s form
walks
merges
tries
maps
-ing form
walking
merging
trying
mapping
Past form or –ed participle
walked
merged
tried
mapped
CSC 9010- NLP - 3: Morphology, Finite State Transducers
13
Irregular (English) Verbs
Morphological Form Classes
Irregularly Inflected Verbs
Stem
eat
catch
cut
-s form
eats
catches
cuts
-ing form
eating
catching
cutting
Past form
ate
caught
cut
-ed participle
eaten
caught
cut
CSC 9010- NLP - 3: Morphology, Finite State Transducers
14
“To love” in Spanish
CSC 9010- NLP - 3: Morphology, Finite State Transducers
15
Syntax and Morphology
Phrase-level agreement
Subject-Verb
– John studies hard (STUDY+3SG)
Noun-Adjective
– Las vacas hermosas
Sub-word phrasal structures
‫שבספרינו‬
‫נו‬+‫ים‬+‫ספר‬+‫ב‬+‫ש‬
That+in+book+PL+Poss:1PL
Which are in our books
CSC 9010- NLP - 3: Morphology, Finite State Transducers
16
Phonology and Morphology
Script Limitations
Spoken English has 14 vowels
– heed hid hayed head had hoed hood who’d hide
how’d taught Tut toy enough
English Alphabet has 5
– Use vowel combinatios: far fair fare
– Consonantal doubling (hopping vs. hoping)
CSC 9010- NLP - 3: Morphology, Finite State Transducers
17
Computational Morphology
Approaches
Lexicon only
Rules only
Lexicon and Rules
– Finite-state Automata
– Finite-state Transducers
Systems
WordNet’s morphy
PCKimmo
– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen,
Ron Kaplan, and Martin Kay
– Accurate but complex
– http://www.sil.org/pckimmo/
Two-level morphology
– Commercial version available from InXight Corp.
Background
Chapter 3 of Jurafsky and Martin
A short history of Two-Level Morphology
– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/
CSC 9010- NLP - 3: Morphology, Finite State Transducers
18
Computational Morphology
WORD
cats
cat
cities
geese
ducks
merging
caught
STEM (+FEATURES)*
cat +N +PL
cat +N +SG
city +N +PL
goose +N +PL
(duck +N +PL) or
(duck +V +3SG)
merge +V +PRES-PART
(catch +V +PAST-PART) or
(catch +V +PAST)
CSC 9010- NLP - 3: Morphology, Finite State Transducers
19
FSAs and the Lexicon
First we’ll capture the morphotactics
The rules governing the ordering of affixes in a
language.
Then we’ll add in the actual words
CSC 9010- NLP - 3: Morphology, Finite State Transducers
20
Simple Rules
CSC 9010- NLP - 3: Morphology, Finite State Transducers
21
Adding the Words
CSC 9010- NLP - 3: Morphology, Finite State Transducers
22
Derivational Rules
CSC 9010- NLP - 3: Morphology, Finite State Transducers
23
Parsing/Generation
vs. Recognition
Recognition is usually not quite what we need.
Usually if we find some string in the language we need to
find the structure in it (parsing)
Or we have some structure and we want to produce a
surface form (production/generation)
Example
From “cats” to “cat +N +PL” and back
Morphological analysis
CSC 9010- NLP - 3: Morphology, Finite State Transducers
24
Finite State Transducers
The simple story
Add another tape
Add extra symbols to the transitions
On one tape we read “cats”, on the other we write
“cat +N +PL”, or the other way around.
CSC 9010- NLP - 3: Morphology, Finite State Transducers
25
FSTs
CSC 9010- NLP - 3: Morphology, Finite State Transducers
26
Transitions
c:c
a:a
t:t
+N:ε
+PL:s
c:c means read a c on one tape and write a c on the other
+N:ε means read a +N symbol on one tape and write nothing
on the other
+PL:s means read +PL and write an s
CSC 9010- NLP - 3: Morphology, Finite State Transducers
27
Ambiguity
Recall that in non-deterministic recognition multiple
paths through a machine may lead to an accept state.
Didn’t matter which path was actually traversed
In FSTs the path to an accept state does matter since
different paths represent different parses and different
outputs will result
CSC 9010- NLP - 3: Morphology, Finite State Transducers
28
Ambiguity
What’s the right parse for
Unionizable
Union-ize-able
Un-ion-ize-able
Each represents a valid path through the derivational
morphology machine.
CSC 9010- NLP - 3: Morphology, Finite State Transducers
29
Ambiguity
There are a number of ways to deal with this
problem
Simply take the first output found
Find all the possible outputs (all paths) and return
them all (without choosing)
Bias the search so that only one or a few likely paths
are explored
CSC 9010- NLP - 3: Morphology, Finite State Transducers
30
The Gory Details
Of course, its not as easy as
“cat +N +PL” <->
“cats”
As we saw earlier there are geese, mice and oxen
But there are also a whole host of
spelling/pronunciation changes that go along with
inflectional changes
Cats vs Dogs
Multi-tape machines
CSC 9010- NLP - 3: Morphology, Finite State Transducers
31
Multi-Level Tape Machines
We use one machine to transduce between the lexical and the
intermediate level, and another to handle the spelling changes
to the surface tape
CSC 9010- NLP - 3: Morphology, Finite State Transducers
32
Lexical to Intermediate Level
CSC 9010- NLP - 3: Morphology, Finite State Transducers
33
Intermediate to Surface
The add an “e” rule as in fox^s# <-> foxes
CSC 9010- NLP - 3: Morphology, Finite State Transducers
34
Foxes
CSC 9010- NLP - 3: Morphology, Finite State Transducers
35
Foxes
CSC 9010- NLP - 3: Morphology, Finite State Transducers
36
FST Review
FSTs allow us to take an input and deliver a structure
based on it
Or… take a structure and create a surface form
Or take a structure and create another structure
In many applications its convenient to decompose the
problem into a set of cascaded transducers where
The output of one feeds into the input of the next.
We’ll see this scheme again for deeper semantic
processing.
CSC 9010- NLP - 3: Morphology, Finite State Transducers
37
Overall Plan
CSC 9010- NLP - 3: Morphology, Finite State Transducers
38
Lexicon-only Morphology
• The lexicon lists all surface level and lexical level pairs
• No rules …
• Analysis/Generation is easy
• Very large for English
• What about
•Arabic or
•Turkish or
• Chinese?
acclaim
acclaim
acclaimed
acclaimed
acclaiming
acclaims
acclaims
acclamation
acclamations
acclimate
acclimated
acclimated
acclimates
acclimating
CSC 9010- NLP - 3: Morphology, Finite State Transducers
acclaim $N$
acclaim $V+0$
acclaim $V+ed$
acclaim $V+en$
acclaim $V+ing$
acclaim $N+s$
acclaim $V+s$
acclamation
$N$
acclamation
$N+s$
acclimate
$V+0$
acclimate
$V+ed$
acclimate
$V+en$
acclimate
$V+s$
acclimate
$V+ing$
39
Stemming vs Morphology
Sometimes you just need to know the stem of a word
and you don’t care about the structure.
In fact you may not even care if you get the right
stem, as long as you get a consistent string.
This is stemming… it most often shows up in IR
applications
CSC 9010- NLP - 3: Morphology, Finite State Transducers
40
Stemming in IR
Run a stemmer on the documents to be indexed
Run a stemmer on users queries
Match
This is basically a form of hashing
Example: Computerization
ization -> -ize computerize
ize -> ε computer
CSC 9010- NLP - 3: Morphology, Finite State Transducers
41
Porter Stemmer
Step 4: Derivational Morphology I: Multiple Suffixes
(m>0) ATIONAL ->
(m>0) TIONAL ->
ATE
TION
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
ENCE
ANCE
IZE
ABLE
AL
ENT
E
OUS
IZE
ATE
ATE
AL
IVE
FUL
OUS
AL
IVE
BLE
ENCI
ANCI
IZER
ABLI
ALLI
ENTLI
ELI
OUSLI
IZATION
ATION
ATOR
ALISM
IVENESS
FULNESS
OUSNESS
ALITI
IVITI
BILITI
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
relational
->
conditional
->
rational
->
valenci
->
hesitanci
->
digitizer
->
conformabli
->
radicalli
->
differentli
->
vileli
- >
analogousli
->
vietnamization ->
predication
->
operator
->
feudalism
->
decisiveness
->
hopefulness
->
callousness
->
formaliti
->
sensitiviti
->
sensibiliti
->
CSC 9010- NLP - 3: Morphology, Finite State Transducers
relate
condition
rational
valence
hesitance
digitize
conformable
radical
different
vile
analogous
vietnamize
predicate
operate
feudal
decisive
hopeful
callous
formal
sensitive
sensible
42
Porter
No lexicon needed
Basically a set of staged sets of rewrite rules that
strip suffixes
Handles both inflectional and derivational suffixes
Doesn’t guarantee that the resulting stem is really a
stem (see first bullet)
Lack of guarantee doesn’t matter for IR
CSC 9010- NLP - 3: Morphology, Finite State Transducers
43
Porter Stemmer
Errors of Omission
European
analysis
matrices
noise
explain
Europe
analyzes
matrix
noisy
explanation
Errors of Commission
organization
doing
generalization
numerical
university
organ
doe
generic
numerous
universe
CSC 9010- NLP - 3: Morphology, Finite State Transducers
44
Soundex
You work as the Villanova telephone operator.
Someone calls looking for:
Dr Papalarsky
or
Dr Matuzka
???????? What do you type as your query string?
CSC 9010- NLP - 3: Morphology, Finite State Transducers
45
Soundex
1. Keep the first letter
2. Drop non-initial occurrences of vowels, h, w and y
3. Replace the remaining letters with numbers
according to group (e.g.. b, f, p, and v -> 1
4. Replace strings of identical numbers with a single
number (333 -> 3)
5. Drop any numbers beyond a third one
CSC 9010- NLP - 3: Morphology, Finite State Transducers
46
Soundex
Effect is to map (hash) all similar sounding
transcriptions to the same code.
Structure your directory so that it can be accessed by
code as well as by correct spelling
Used for census records, phone directories, author
searches in libraries etc.
CSC 9010- NLP - 3: Morphology, Finite State Transducers
47