Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein.

Transcript Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein.

Part of Speech (POS) Tagging Lab

CSC 9010: Special Topics. Natural Language Processing.

Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein and Loper: NLTK Tutorial, Tagging, nltk.sourceforge.net/tutorial/tagging/index.html

CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 1

Simple Taggers

• Three simple taggers in NLTK – Default tagger – Regular expression tagger – Unigram tagger • All start with tokenized text.

>>>

from nltk.tokenizer import *

>>>

text_token = Token(TEXT="John saw 3 polar bears .")

>>>

WhitespaceTokenizer().tokenize(text_token)

>>>

print text_token

<[, , <3>, , , <.>]> CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 2

Default Tagger

• Assigns the same tag to every token.

• We create an instance of the tagger and give it the desired tag.

>>>

from nltk.tagger import *

>>>

my_tagger = DefaultTagger('nn')

>>>

my_tagger.tag(text_token)

>>>

print text_token

<[, , <3/nn>, , , <./nn>]> • We’ve just labeled everything as a noun.

• 20-30% accuracy (terrible), but useful as an adjunct to other taggers.

CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 3

Regular Expression Tagger

• • • • Takes a list of regular expressions and tags to assign when they match.

>>>

NN_CD_tagger = RegexpTagger([(r'^[0-9]+(.[0-9]+)?$', 'cd'), (r'.*', 'nn')])

>>>

NN_CD_tagger.tag(text_token)

>>>

print text_token

<[, , <3/cd>, , , <./nn>]> • This tags cardinal numbers as CD and everything else as nouns.

• Still pretty poor, but may be a useful step in conjunction with other taggers.

CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 4

Unigram Tagger

• • Assign each word its most frequent tag • Must be

trained

the training set.

to determine frequency.

• Will assign “none” as a tag to any word not seen in • About 90% accurate.

• Example training case (from Brown corpus) The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/ ‘ that/cs any/dti irregularities/nns took/vbd place/nn ./. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 5

Train the Unigram Tagger

>>>

from nltk.tagger import *

>>>

from nltk.corpus import brown

# Tokenize ten texts from the Brown Corpus

>>>

train_tokens = [ ]

>>>

for item in brown.items()[:10]:

...

train_tokens.append(brown.read(item))

# Initialise and train a unigram tagger

>>>

mytagger = UnigramTagger(SUBTOKENS='WORDS')

>>>

for tok in train_tokens: mytagger.train(tok)

CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 6

And Then Tag New Text

>>>

text_token = Token(TEXT="John saw the book on the table")

>>>

WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token)

>>>

mytagger.tag(text_token)

>>>

print text_token

<[, , , , , ,

]> CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 7

Testing a Tagger

• So how well does the tagger do?

• Split up the inputs into training and testing sets >>>

train_tokens = [ ]

>>> for item in brown.items()[:10]: # texts 0-9 ...

train_tokens.append(brown.read(item))

>>>

unseen_tokens = [ ]

>>> for item in brown.items()[10:12]: # texts 10-11 ...

unseen_tokens.append(brown.read(item))

CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 8

Train And Test

>>>

for tok in train_tokens: mytagger.train(tok)

>>>

acc = tagger_accuracy(mytagger, unseen_tokens)

>>>

print 'Accuracy = %4.1f%%' % (100 * acc)

Accuracy = 64.6% CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 9

More in NLTK

• Error analysis • Higher order taggers – Bigram – Nth-order • Combining taggers • Brill tagger CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 10

For Lab/Homework

• Complete the tagger tutorial from the NLTK tutorial page.

• Tutorial exercises 1, 3, 4, 5 and 10. • 8.2 (we will compare next time) • 8.9 (using the NLTK and any higher order tagger) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 11

Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein.

Transcript Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein.