Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein.
Download ReportTranscript Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein.
Part of Speech (POS) Tagging Lab
CSC 9010: Special Topics. Natural Language Processing.
Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein and Loper: NLTK Tutorial, Tagging, nltk.sourceforge.net/tutorial/tagging/index.html
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 1
Simple Taggers
• Three simple taggers in NLTK – Default tagger – Regular expression tagger – Unigram tagger • All start with tokenized text.
>>>
from nltk.tokenizer import *
>>>
text_token = Token(TEXT="John saw 3 polar bears .")
>>>
WhitespaceTokenizer().tokenize(text_token)
>>>
print text_token
<[
Default Tagger
• Assigns the same tag to every token.
• We create an instance of the tagger and give it the desired tag.
>>>
from nltk.tagger import *
>>>
my_tagger = DefaultTagger('nn')
>>>
my_tagger.tag(text_token)
>>>
print text_token
<[
• 20-30% accuracy (terrible), but useful as an adjunct to other taggers.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 3
Regular Expression Tagger
• • • • Takes a list of regular expressions and tags to assign when they match.
>>>
NN_CD_tagger = RegexpTagger([(r'^[0-9]+(.[0-9]+)?$', 'cd'), (r'.*', 'nn')])
>>>
NN_CD_tagger.tag(text_token)
>>>
print text_token
<[
• Still pretty poor, but may be a useful step in conjunction with other taggers.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 4
Unigram Tagger
• • Assign each word its most frequent tag • Must be
trained
the training set.
to determine frequency.
• Will assign “none” as a tag to any word not seen in • About 90% accurate.
• Example training case (from Brown corpus) The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/ ‘ that/cs any/dti irregularities/nns took/vbd place/nn ./. CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 5
Train the Unigram Tagger
>>>
from nltk.tagger import *
>>>
from nltk.corpus import brown
# Tokenize ten texts from the Brown Corpus
>>>
train_tokens = [ ]
>>>
for item in brown.items()[:10]:
...
train_tokens.append(brown.read(item))
# Initialise and train a unigram tagger
>>>
mytagger = UnigramTagger(SUBTOKENS='WORDS')
>>>
for tok in train_tokens: mytagger.train(tok)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 6
And Then Tag New Text
>>>
text_token = Token(TEXT="John saw the book on the table")
>>>
WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token)
>>>
mytagger.tag(text_token)
>>>
print text_token
<[
Testing a Tagger
• So how well does the tagger do?
• Split up the inputs into training and testing sets >>>
train_tokens = [ ]
>>> for item in brown.items()[:10]: # texts 0-9 ...
train_tokens.append(brown.read(item))
>>>
unseen_tokens = [ ]
>>> for item in brown.items()[10:12]: # texts 10-11 ...
unseen_tokens.append(brown.read(item))
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 8
Train And Test
>>>
for tok in train_tokens: mytagger.train(tok)
>>>
acc = tagger_accuracy(mytagger, unseen_tokens)
>>>
print 'Accuracy = %4.1f%%' % (100 * acc)
Accuracy = 64.6% CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 9
More in NLTK
• Error analysis • Higher order taggers – Bigram – Nth-order • Combining taggers • Brill tagger CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 10
For Lab/Homework
• Complete the tagger tutorial from the NLTK tutorial page.
• Tutorial exercises 1, 3, 4, 5 and 10. • 8.2 (we will compare next time) • 8.9 (using the NLTK and any higher order tagger) CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 11