Number Sense Disambiguation

Download Report

Transcript Number Sense Disambiguation

Number Sense Disambiguation
Stuart Moore
Supervised by:
Anna Korhonen (Computer Lab)
Sabine Buchholz (Toshiba CRL)
2
Number Sense Disambiguation

Similar to Word Sense Disambiguation

Seek to classify numbers into different senses

e.g. year, time, telephone number...
3
Applications

Speech Synthesis

1990



nineteen-ninety
one thousand, nine hundred and ninety
2015


two thousand and fifteen
eight fifteen p.m.

Information Retrieval

Parsing
4
Aim


To successfully classify numbers into sense
categories
To use a semi-supervised method

Avoids the need for a large, human annotated
training set

Allows economical adaptation to different
languages and domains
Differences with Word Sense
Disambiguation


There are infinitely many numbers – you will
almost certainly come across 'digit strings' you
have not seen in training data.
Intuitively, the models for 2007 and 2008 should
be similar


But the model for 5, or 2007.4, should be different
There is no resource equivalent to a dictionary,
enumerating all possible senses of a number.
5
6
Previous System




The report Normalization of non-standard
Words (Sproat et al, 2001) defines a taxonomy
of 13 'senses' for numbers
They annotated 4 corpora, the largest of which
is a subsection of the North American News
Text Corpus – newswire text from 1994-97
They used this to create a decision tree
classifier
The main focus of the report was the
performance when expanding abbreviations,
and numbers are not examined in detail.
7
Number Sense Categories
Label
NUM
NYER
Description
Number (Cardinal)
Year(s)
NORD
MONEY
NIDE
NTEL
NTIME
NDATE
NDIG
NADDR
NZIP
Examples
Count
12, 45, 1/ 2, 0.6
21253 (56.53%)
1998, 80s, 1900s,
7659 (20.37%)
2003
Number (Ordinal)
May 7, 3rd, Bill
3264
(8.68%)
Gates III
Money (US or other)
$3.45, HK$300,
2909
(7.74%)
Y20,000, $200K
747, 386, I5, pc110,
Identifier
1027
(2.73%)
3A
Telephone number (or part of) 212 555-4523
507 (1.35%)
A (compound) time
3:20, 11:45
440 (1.17%)
A (compound) date
2/ 2/ 99, 14/ 03/ 87
307 (0.82%)
(or US) 03/ 14/ 87
Number as digits
Room 101
74 (0.20%)
Number as street address
45 North Street,
69 (0.18%)
5000 Pennsylvania
Ave
Zip code or PO box
91020
66 (0.18%)
(Counts are from the training data of the North American News Text Corpus)
8
Overview of my system





Based on work by Yarowsky (1995)
investigating decision lists for Word Sense
Disambiguation
Takes a few annotated 'seed examples',
together with a large, unannotated corpus.
Generates one model using the seed examples,
and applies this to the unannotated corpus.
This is used as input to generate another
model.
The process can be iterated many times
9
Overview of my system
10
Features



The context of each number is examined for a
list of features.
Local context: ± 5 tokens from the number

Punctuation, words, word stems, number features

Specific location (e.g. token following number)
Wider context: ± 15 tokens from the number

Words and Word stems only

Bag of words (anywhere within the window)
11
Rules

Each rule is conditional on the presence of one
or two features


Consider all possible combinations of features that
occur together at least five times in the training
corpus.
Based on Yarowsky's rules, but more powerful

He had 'Bag of word' rules, and some rules
combining two words in the local area

He did not have any specific numeric or punctuation
features.
12
Ranking Rules



Follows Yarowsky (1995)
For each rule, count the number of examples
for each number sense
Calculate Log Likelihood:
 (Count PositiveExamples) 

LogLike  log
 (Count NegativeExamples)  



α is a parameter that can be varied to change the effect
of negative examples on the model
Rank rules according to log likelihood
When classifying, use the first rule that matches
the target sentence
Performance as a fully supervised
system

We applied the method to the entire training set,
and investigated its performance on the training
and test sets

This gives an idea of the 'upper bound' of
performance of the system
13
14
Performance on training data
100%
97.2%
90%
80%
70%
60%
Incorrect
Default makes incorrect
Default makes correct
Correct
50%
40%
30%
20%
10%
0%
0
0.2 0.6 1
1.4 1.8 2.2 2.6 3
3.4 3.8 4.2 4.6 5 5.4 5.8 6.2 6.6 7 7.4 7.8 8.2 8.6
0.4 0.8 1.2 1.6
2 2.4 2.8 3.2 3.6
4 4.4 4.8 5.2 5.6 6 6.4 6.8 7.2 7.6 8 8.4
Log Likelihood cut off
15
Performance on test data
100%
90%
81.2%
80%
70%
66.0%
60%
Incorrect
Default makes incorrect
Default makes correct
Correct
50%
40%
30%
20%
10%
0%
0.2 0.6 1 1.4 1.8 2.2 2.6 3 3.4 3.8 4.2 4.6 5 5.4 5.8 6.2 6.6 7 7.4 7.8 8.2
0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6 6.4 6.8 7.2 7.6 8 8.4
Log Likelihood cut off
Performance as a fully Supervised
system - Summary



Accuracy is 66.0% on test data
Using the most common number type for
unclassified examples increases accuracy to
81.2%
The Sproat et al system achieves an accuracy
of 97.6% on the same task

Uses decision trees instead of decision lists

Decision trees generally classify everything – less
suitable for an iterative process.
16
Performance as a fully Supervised
system - Summary


A large proportion of the test data –
approximately 25% - was unclassified.
By adding in unlabelled data to the training set,
we hope to increase coverage of the rules, and
thereby boost accuracy

(experiment not yet performed)
17
Performance as a semi-supervised
system

Concept:


Important to have high precision in the first
iteration


Provide a small number of seed examples, from
which rules are extrapolated over various iterations.
(Recall can be low, as long as it's not too low)
Future iterations aim to improve recall
18
Performance as a semi-supervised
system




After experimenting with a few different
strategies for the first iteration, the following
was found to perform best:
Rank all rules based on their scores from the
seed examples
For each number type, take the three highest
scoring rules (more if several had an equal
score)
Apply these rules to the unlabelled data.

If a number is matched by rules from more than
one number type, do not classify it
19
20
How many seed examples are
needed?

Equal numbers of
seed examples for
each number type
Definite improvement
seen for going up to
40 seed examples
100%
Precision

Seed examples were
randomly picked from
the training data
(% of those assigned where the category is correct)

90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
20

Limited improvement
after this point
30
40
50
Number of seed examples per number type
60
Performance of the second iteration
– training data
100%
90%
Peak – 84.84%
(LogLike >= 5.0)
80%
70%
60%
Baseline - 56.24%
50%
40%
30%
Incorrect
Default makes incorrect
Default makes correct
Correct
20%
10%
0%
0
0.2 0.6
1 1.4 1.8 2.2 2.6 3
3.4 3.8 4.2 4.6
5
5.4 5.8 6.2 6.6
7 7.4 7.8 8.2 8.6
0.4 0.8 1.2 1.6 2
2.4 2.8 3.2 3.6
4
4.4 4.8 5.2 5.6 6
6.4 6.8 7.2 7.6
8
8.4
Log Likelihood cut off
21
Performance of the second iteration
– test data
22
100%
90%
80%
Peak – 75.2%
(LogLike >= 5.2)
70%
60%
Incorrect
Default makes incorrect
Default makes correct
Correct
50%
40%
30%
Using previous peak value,
cut off=5.0, gives
74.93%
accuracy
20%
10%
0%
0.2
0
0.6
1
1.4 1.8 2.2 2.6
3
3.4 3.8 4.2
4.6
5
5.4 5.8 6.2 6.6
7
7.4 7.8 8.2
8.6
0.4 0.8 1.2 1.6
2
2.4 2.8 3.2
3.6
4
4.4 4.8 5.2
5.6
6
6.4 6.8 7.2 7.6
8
8.4
Log Likelihood cut off
23
Future Work

Error analysis of the data

More sophisticated features


More sophisticated rules


Part of Speech tags, or a parser
Try to allow more than two features per rule, without
creating too many rules to be handled.
Different rule strategies

Closer to a decision tree

Other machine learning methods?
24
Future Work

Increase coverage


Investigate different strategies for picking the
seed examples


Investigate use of document level features, using
method from Stevenson et al, 2008
Distribute according to relative frequency of
categories, rather than a set number per category
Investigate the effects of more unannotated data

Can use sections of the North American News
Corpus that haven't been annotated.
25
Future Work


Consider modifying the number classes

Should some categories be combined?

Would moving the categories into a tree structure
improve performance?

Are different classes needed for different domains
(e.g. financial, biomedical) or languages?
Investigate corpus for consistency

A few inconsistent examples have been identified
27
Number Features

Does the number start with a leading zero

Is the number an integer

How many digits in the number

The real value of the number

The number rounded to one significant figure


So 1500 ≤ x < 2500 maps to 2000
The token with all digits removed

1st becomes st, 70mph becomes mph