CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ Jim.
Download ReportTranscript CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ Jim.
CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ Jim Martin: http://www.cs.colorado.edu/~martin/csci5832.html CSC 9010- NLP - 3: Morphology, Finite State Transducers and 1 Today Elementary Morphology Computational morphology Finite State Transducers Lexicon-only schemes Rule-only schemes Lab: Introduction to NLTK CSC 9010- NLP - 3: Morphology, Finite State Transducers 2 Morphology Morphology: The study of the way words are built up from smaller meaning units. Morphemes: The smallest meaningful unit in the grammar of a language. Contrasts: Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) A useful resource: Glossary of linguistic terms by Eugene Loos http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm CSC 9010- NLP - 3: Morphology, Finite State Transducers 3 Examples (English) “unladylike” 3 morphemes, 4 syllables unlady -like ‘not’ ‘(well behaved) female adult human’ ‘having the characteristics of’ Can’t break any of these down further without distorting the meaning of the units “technique” 1 morpheme, 2 syllables “dogs” 2 morphemes, 1 syllable -s, a plural marker on nouns CSC 9010- NLP - 3: Morphology, Finite State Transducers 4 Morpheme Definitions Root The portion of the word that: – is common to a set of derived or inflected forms, if any, when all affixes are removed – is not further analyzable into meaningful elements – carries the principal portion of meaning of the words Stem The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. Affix A bound morpheme that is joined before, after, or within a root or stem. Clitic a morpheme that functions syntactically like a word, but does not appear as an independent phonological word – Spanish: un beso, las aguas – English: Hal’s (genetive marker) – Proto-European: Kwe -que (Latin), te (Greek), and –ca (Sanskrit) CSC 9010- NLP - 3: Morphology, Finite State Transducers 5 Inflectional vs. Derivational Word Classes Parts of speech: noun, verb, adjectives, etc. Word class dictates how a word combines with morphemes to form new words Inflection: Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. – Doesn’t change the word class – Usually produces a predictable, non-idiosyncratic change of meaning. Derivation: The formation of a new word or inflectable stem from another word or stem. CSC 9010- NLP - 3: Morphology, Finite State Transducers 6 Inflectional Morphology Adds: tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Examples come is inflected for person and number: The pizza guy comes at noon. las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) CSC 9010- NLP - 3: Morphology, Finite State Transducers 7 Derivational Morphology Nominalization (formation of nouns from other parts of speech, primarily verbs in English): computerization appointee killer fuzziness Formation of adjectives (primarily from nouns) computational clueless Embraceable Diffulcult cases: building from which sense of “build”? CSC 9010- NLP - 3: Morphology, Finite State Transducers 8 Concatinative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme hope+ing hoping hop hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages uygarlaştıramadıklarımızdanmışsınızcasına (Turkish) uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized CSC 9010- NLP - 3: Morphology, Finite State Transducers 9 Templatic Morphology Roots and Patterns Example: Hebrew verbs Root: – Consists of 3 consonants CCC – Carries basic meaning Template: – Gives the ordering of consonants and vowels – Specifies semantic information about the verb Active, passive, middle voice Example: – lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught) CSC 9010- NLP - 3: Morphology, Finite State Transducers 10 Nouns and Verbs (in English) Nouns have simple inflectional morphology cat cat+s, cat+’s Verbs have more complex morphology CSC 9010- NLP - 3: Morphology, Finite State Transducers 11 Nouns and Verbs (in English) Nouns Have simple inflectional morphology Cat/Cats Mouse/Mice, Ox, Oxen, Goose, Geese Verbs More complex morphology Walk/Walked Go/Went, Fly/Flew CSC 9010- NLP - 3: Morphology, Finite State Transducers 12 Regular (English) Verbs Morphological Form Classes Regularly Inflected Verbs Stem walk merge try map -s form walks merges tries maps -ing form walking merging trying mapping Past form or –ed participle walked merged tried mapped CSC 9010- NLP - 3: Morphology, Finite State Transducers 13 Irregular (English) Verbs Morphological Form Classes Irregularly Inflected Verbs Stem eat catch cut -s form eats catches cuts -ing form eating catching cutting Past form ate caught cut -ed participle eaten caught cut CSC 9010- NLP - 3: Morphology, Finite State Transducers 14 “To love” in Spanish CSC 9010- NLP - 3: Morphology, Finite State Transducers 15 Syntax and Morphology Phrase-level agreement Subject-Verb – John studies hard (STUDY+3SG) Noun-Adjective – Las vacas hermosas Sub-word phrasal structures שבספרינו נו+ים+ספר+ב+ש That+in+book+PL+Poss:1PL Which are in our books CSC 9010- NLP - 3: Morphology, Finite State Transducers 16 Phonology and Morphology Script Limitations Spoken English has 14 vowels – heed hid hayed head had hoed hood who’d hide how’d taught Tut toy enough English Alphabet has 5 – Use vowel combinatios: far fair fare – Consonantal doubling (hopping vs. hoping) CSC 9010- NLP - 3: Morphology, Finite State Transducers 17 Computational Morphology Approaches Lexicon only Rules only Lexicon and Rules – Finite-state Automata – Finite-state Transducers Systems WordNet’s morphy PCKimmo – Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay – Accurate but complex – http://www.sil.org/pckimmo/ Two-level morphology – Commercial version available from InXight Corp. Background Chapter 3 of Jurafsky and Martin A short history of Two-Level Morphology – http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/ CSC 9010- NLP - 3: Morphology, Finite State Transducers 18 Computational Morphology WORD cats cat cities geese ducks merging caught STEM (+FEATURES)* cat +N +PL cat +N +SG city +N +PL goose +N +PL (duck +N +PL) or (duck +V +3SG) merge +V +PRES-PART (catch +V +PAST-PART) or (catch +V +PAST) CSC 9010- NLP - 3: Morphology, Finite State Transducers 19 FSAs and the Lexicon First we’ll capture the morphotactics The rules governing the ordering of affixes in a language. Then we’ll add in the actual words CSC 9010- NLP - 3: Morphology, Finite State Transducers 20 Simple Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers 21 Adding the Words CSC 9010- NLP - 3: Morphology, Finite State Transducers 22 Derivational Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers 23 Parsing/Generation vs. Recognition Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation) Example From “cats” to “cat +N +PL” and back Morphological analysis CSC 9010- NLP - 3: Morphology, Finite State Transducers 24 Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. CSC 9010- NLP - 3: Morphology, Finite State Transducers 25 FSTs CSC 9010- NLP - 3: Morphology, Finite State Transducers 26 Transitions c:c a:a t:t +N:ε +PL:s c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s CSC 9010- NLP - 3: Morphology, Finite State Transducers 27 Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result CSC 9010- NLP - 3: Morphology, Finite State Transducers 28 Ambiguity What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able Each represents a valid path through the derivational morphology machine. CSC 9010- NLP - 3: Morphology, Finite State Transducers 29 Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored CSC 9010- NLP - 3: Morphology, Finite State Transducers 30 The Gory Details Of course, its not as easy as “cat +N +PL” <-> “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Multi-tape machines CSC 9010- NLP - 3: Morphology, Finite State Transducers 31 Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape CSC 9010- NLP - 3: Morphology, Finite State Transducers 32 Lexical to Intermediate Level CSC 9010- NLP - 3: Morphology, Finite State Transducers 33 Intermediate to Surface The add an “e” rule as in fox^s# <-> foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers 34 Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers 35 Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers 36 FST Review FSTs allow us to take an input and deliver a structure based on it Or… take a structure and create a surface form Or take a structure and create another structure In many applications its convenient to decompose the problem into a set of cascaded transducers where The output of one feeds into the input of the next. We’ll see this scheme again for deeper semantic processing. CSC 9010- NLP - 3: Morphology, Finite State Transducers 37 Overall Plan CSC 9010- NLP - 3: Morphology, Finite State Transducers 38 Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules … • Analysis/Generation is easy • Very large for English • What about •Arabic or •Turkish or • Chinese? acclaim acclaim acclaimed acclaimed acclaiming acclaims acclaims acclamation acclamations acclimate acclimated acclimated acclimates acclimating CSC 9010- NLP - 3: Morphology, Finite State Transducers acclaim $N$ acclaim $V+0$ acclaim $V+ed$ acclaim $V+en$ acclaim $V+ing$ acclaim $N+s$ acclaim $V+s$ acclamation $N$ acclamation $N+s$ acclimate $V+0$ acclimate $V+ed$ acclimate $V+en$ acclimate $V+s$ acclimate $V+ing$ 39 Stemming vs Morphology Sometimes you just need to know the stem of a word and you don’t care about the structure. In fact you may not even care if you get the right stem, as long as you get a consistent string. This is stemming… it most often shows up in IR applications CSC 9010- NLP - 3: Morphology, Finite State Transducers 40 Stemming in IR Run a stemmer on the documents to be indexed Run a stemmer on users queries Match This is basically a form of hashing Example: Computerization ization -> -ize computerize ize -> ε computer CSC 9010- NLP - 3: Morphology, Finite State Transducers 41 Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> (m>0) TIONAL -> ATE TION (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) ENCE ANCE IZE ABLE AL ENT E OUS IZE ATE ATE AL IVE FUL OUS AL IVE BLE ENCI ANCI IZER ABLI ALLI ENTLI ELI OUSLI IZATION ATION ATOR ALISM IVENESS FULNESS OUSNESS ALITI IVITI BILITI -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> relational -> conditional -> rational -> valenci -> hesitanci -> digitizer -> conformabli -> radicalli -> differentli -> vileli - > analogousli -> vietnamization -> predication -> operator -> feudalism -> decisiveness -> hopefulness -> callousness -> formaliti -> sensitiviti -> sensibiliti -> CSC 9010- NLP - 3: Morphology, Finite State Transducers relate condition rational valence hesitance digitize conformable radical different vile analogous vietnamize predicate operate feudal decisive hopeful callous formal sensitive sensible 42 Porter No lexicon needed Basically a set of staged sets of rewrite rules that strip suffixes Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem (see first bullet) Lack of guarantee doesn’t matter for IR CSC 9010- NLP - 3: Morphology, Finite State Transducers 43 Porter Stemmer Errors of Omission European analysis matrices noise explain Europe analyzes matrix noisy explanation Errors of Commission organization doing generalization numerical university organ doe generic numerous universe CSC 9010- NLP - 3: Morphology, Finite State Transducers 44 Soundex You work as the Villanova telephone operator. Someone calls looking for: Dr Papalarsky or Dr Matuzka ???????? What do you type as your query string? CSC 9010- NLP - 3: Morphology, Finite State Transducers 45 Soundex 1. Keep the first letter 2. Drop non-initial occurrences of vowels, h, w and y 3. Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 4. Replace strings of identical numbers with a single number (333 -> 3) 5. Drop any numbers beyond a third one CSC 9010- NLP - 3: Morphology, Finite State Transducers 46 Soundex Effect is to map (hash) all similar sounding transcriptions to the same code. Structure your directory so that it can be accessed by code as well as by correct spelling Used for census records, phone directories, author searches in libraries etc. CSC 9010- NLP - 3: Morphology, Finite State Transducers 47