Evaluation.pptx

Download Report

Transcript Evaluation.pptx






Based on probability theory
First we’ll introduce the simple “mostfrequent-tag” algorithm
Most-freq-tag is another baseline algorithm.
Meaning that no one would use it if they
really wanted some data tagged
But it’s useful as a comparison
1
•
•
•
•
•
•
P(Verb) is probability of randomly selected word
being a verb.
P(Verb|race) is “what’s the probability of a word
being a verb given that it’s the word “race”?
Race: noun or a verb.
It’s more likely to be a noun.
P(Verb|race) can be estimated by looking at some
corpus and saying “out of all the times we saw
‘race’, how many were verbs?
In Brown corpus, P(Verb|race) = 96/98 = .98
Count(race is verb)
P(V | race) 
total Count(race)
2
would/MD prohibit/VB a/DT suit/NN for/IN
refund/NN
of/IN section/NN 381/CD (/( a/NN )/) ./.


We could count in a corpus
Counts from the Brown Corpus part of
speech tagged at U Penn:
21830 DT
6
NN
3
FW
3

For each word
◦ Create dictionary with each possible tag for a word
◦ Take a tagged corpus
◦ Count the number of times each tag occurs for that
word

Given a new sentence
◦ For each word, pick the most frequent tag for that
word from the corpus.

NOTE: Dictionary comes from the corpus
4


The/DT City/NNP Purchasing/NNP Department/NNP ,/,
the/DT jury/NN said/VBD,/, is/VBZ lacking/VBG in/IN
experienced/VBN clerical/JJ personnel/NNS …
From this sentence, dictionary is:
clerical
department
experienced
in
is
jury
…
5





How do we know how well a tagger does?
Gold Standard: a test sentence, or a set of
test sentences, already tagged by a human
We could run a tagger on this set of test
sentences
And see how many of the tags we got right.
This is called “Tag accuracy” or “Tag percent
correct”
6




We take a set of test sentences
Hand-label them for part of speech
The result is a “Gold Standard” test set
Who does this?
◦ Brown corpus: done by U Penn
◦ Grad students in linguistics

Don’t they disagree?
◦ Yes! But on about 97% of tags no disagreements
◦ And if you let the taggers discuss the remaining 3%, they
often reach agreement
7


But we can’t train our frequencies on the test
set sentences. (Why not?)
So for testing the Most-Frequent-Tag
algorithm (or any other stochastic algorithm),
we need 2 things:
◦ A hand-labeled training set: the data that we
compute frequencies from, etc
◦ A hand-labeled test set: The data that we use to
compute our % correct.
8
Of all the words in the test set
 For what percent of them did the tag
chosen by the tagger equal the
human-selected tag.

# of words tagged correctly in test set
%correct 
total # of words in test set

Human tag set: (“Gold Standard” set)
9



Often they come from the same labeled
corpus!
We just use 90% of the corpus for training
and save out 10% for testing!
Even better: cross-validation
◦ Take 90% training, 10% test, get a % correct
◦ Now take a different 10% test, 90% training, get %
correct
◦ Do this 10 times and average
10


Does the same evaluation metric work for
rule-based taggers?
Yes!
◦ Rule-based taggers don’t need the training set.
◦ But they still need a test set to see how well the
rules are working.
11

Baseline: 91%

Rule-based: not reported

TBL:
◦ 97.2% accuracy when trained on 600,000 words
◦ 96.7% accuracy when trained on 64,000 words
7/15/2016
12