A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010

Download Report

Transcript A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010

A large list of confusion sets for
spellchecking assessed against a
corpus of real-word errors
Jenny Pedler, Roger Mitton
LREC 2010
Some real-word errors
The sand-eel is the principle food for many
birds and animals.
Our teacher tort us to spell.
Henley Regatta comes near the top of the
English social calender.
Spellchecker-induced real-word errors
The Wine Bar Company is opening a chain of
brassieres.
The nightwatchman threw the switch and
eliminated the backyard.
Cupertino, California
... to encourage cooperation and ...
... to encourage cooperation and ...
... to encourage cooperation and ...
Cupertino
co-operation
....
The original Cupertinos
"reinforcing bilateral and multilateral Cupertino"
"South Asian Association for regional Cupertino"
Confusion sets
{cite, sight, site}
{form, from}
{passed, past}
{peace, piece}
{principal, principle}
{quiet, quite, quit}
{their, there, they're}
{weather, whether}
{you're, your}
He had quiet a young girl staying with him
of 17 named Ethel Monticue.
He had quiet a young girl staying with him
quite?
quit?
of 17 named Ethel Monticue.
The confusion-set approach has been
demonstrated to work with
(a) a short list of confusion sets,
(b) artificial test data.
To assess its potential for real, unrestricted
text, we need:
(1) a realistically-sized list of confusion sets,
(2) a corpus of running text containing
genuine real-word errors.
A list of confusion sets
• Tuned string-to-string edit-distance
• ~ 6000 sets
• Headword (confusables)
– wright (right, write)
– right (rite, write)
– write (right, rite, writ)
 Inflected forms
 Proper nouns
 Usage errors – e.g. <fewer, less>
A corpus of real-word errors
Sentences
Words
Total errors (tokens)
Distinct errors (types)
Distinct error/target pairs
quit
quit


quiet
quite
675
12024
833
428
495
Corpus mark-up example
The collation of the information was
<ERR targ = really> relay </ERR>
<ERR targ = quite> quit </ERR> easy to do.
Corpus profile: Frequent errors
Error|target pair
there|their
form|from
to|too
their|there
a|an
its|it's
your|you're
weather|whether
cant|can't
collage|college
Frequency
35
20
19
19
18
17
15
12
10
9
Corpus profile: Homophone errors
Homophone set
there, their, they're
N. Occs
38
to, too, two
its, it's
your, you're
23
17
15
weather, whether
herd, heard
witch, which
12
5
4
hear, here
wile, while
14% of distinct error/target pairs
3
3
Corpus profile: Simple errors
Error Type
N.Errors
% Errors
Omission
(e.g. ether, either)
142
29%
Substitution
(e.g. vary, very)
104
21%
Insertion
(e.g. bellow, below)
56
11%
12
2%
All simple
314
63%
All error pairs
495
100%
Transposition (e.g. dose, does)
How would our list cope with our
corpus?
Types
Tokens
44%
58%
16%
12%
Not detectable (inflection error) 23%
17%
Detectable and correctable
E.g. shod (should)
Detectable but not correctable
E.g. martial (material)
E.g. friend (friends), take (taken)
Not detectable (other)
17%
13%
495
833
E.g. pads (passed)
Total (100%)
Non-detectable/non-correctable
Error not a headword
(“non-detectable”)
Pair
Frequency
a, an
17
the, they
4
is, his
2
is, it
2
i, it
2
u, your
2
Target not a candidate
(“non-correctable”)
Pair
Frequency
an, a
4
cause, because
3
as, has
2
easy, easily
2
for, from
2
in, is
2
mouths, months
2
none, non
2
no, know
2
Using the list for spellchecking
• Rules based on surrounding context
• May be unreliable
– 25% errors have another error within 2 words
– 9% are another real-word error
• Syntax-based methods
– Easiest to implement
– Shown to have good performance
Syntax-based rules: potential
Tagsets
 Distinct
Types
Tokens
58%
68%
31%
25%
11%
7%
299
580
bellow (NN1,VVB,VVI)
below (AV0, PRP)
? Overlapping
pray (VVB, VVI, AV0)
prey (NN1, VVB, VVI)
 Matching
confirm (VVI, VVB)
conform (VVI, VVB)
Total errors (=100%)
Resources available for download
www.dcs.bbk.ac.uk/~jenny/resources.html