Phrasal Spelling Correction for Web Queries

Download Report

Transcript Phrasal Spelling Correction for Web Queries

Spelling Correction for Advertising:
How “Noise” Can Help
Silviu Cucerzan
Microsoft Research
Text Mining Search and Navigation
NISS Workshop on Computational
Text Mining Search
Advertising,
and Navigation
November 2009
Buying Cheap(er) on eBay
Canon 30d
Not good for the sellers.
Not good for most buyers.
Not good for the middle man.
Cannon 30d
Text Mining Search and Navigation
Good Ads for Bad Queries
espresso machines
cingular wireless
epresso machines
singular wireless
espesso machines
cingulair wireless
espreso machines
cigular wireless
espressomachines
cingulare wireless
esspreso machines
cingullar wireless
esspresso machines
cinguilar wireles
expresso machines
cingluarwireless
exspresso machines
circular wireless
Text Mining Search and Navigation
Is a Trusted Dictionary Enough?
• Search:
cheats
max payne chats and codes
new humwee pics
• Music: celine
colour
selin dion color of my love
christina aguilera
cristina aquillara
• Shopping: panasonic
recorders
pansonic dvd reorders
filter
brita water filer
• Help and Support:
drivers
windows
printer divers for window vista
insert flash flies into power point
files
powerpoint
Text Mining Search and Navigation
Web Query Logs as Corpora
• Web Search: over to 1 billion queries per day!
• 10-15% of the queries contain spelling errors
• highly dynamic domain:
many new names and concepts become popular every day
e.g.:
divx, ecard, ipod, korn, xbox, zune,
naboo, nimh, nsync, shrek, 5dmkii, tsx
extremely difficult to maintain a high-coverage lexicon
• difficult to define what a valid web query is
Text Mining Search and Navigation
Problems To Be Handled
Context-sensitive correction of out-of-lexicon words
video crd  video card
power crd  power cord
Context-sensitive correction of in-lexicon words
chicken sop  chicken soup
sop opera  soap opera
Concatenate and split
cheese cake factory  cheesecake factory
chat inspanich  chat in spanish
Recognize out-of-lexicon valid words
amd processors  amd processors
Change in-lexicon words to out-of-lexicon words
gun dam fighter  gundam fighter
Text Mining Search and Navigation
An HMM Architecture for Spelling Correction
input query:
states:
all
alternative
spellings
from the
query log
brita
water
filer
brita
brit
brit.
brits
briat
rita
water
eater
hater
later
mater
oater
rater
wader
wafer
wager
waiter
walter
waster
waters
watery
waver
filer
fiber
fifer
file
filed
filers
files
filet
filler
filner
filter
finer
firer
fiver
fixer
flier
Text Mining Search and Navigation
What about terrible misspellings?
• input:
• desired output:
arnol shwartzeggar
arnold schwarzenegger
unweighted edit distance:
5
Text Mining Search and Navigation
An Iterative Approach
Misspelled query:
arnol shwartzeggar
Speller
output:
First
iteration:
arnold schwartzneggar
Second iteration:
arnold schwartzenegger
Third iteration:
arnold schwaxrzenegger
Fourth iteration:
arnold schwarzenegger
no
more
changes
Text Mining Search and Navigation
Some Intuition
Search Query Log Statistics
honemoon
honemoons
honeybeemon
honeymonn
hunny moon
Iterative
spelling
correction
process
honeymoon
honeymoon's
honeymooner
honeymooner's
honeymooners
honeymooning
honeymoonitis
honeymoons
honneymoon
honneymoons
honnymoon
honoeymoon
honymoon
huneymoon
honey
honey
honey
honey
moon
moon's
mooners
moons
honney moon
hony moon
8
3
3
14
19019
12
3
6
771
29
6
5259
6
9
4
3
19
10
333
5
34
136
honeymoon
6
4
Text Mining Search and Navigation
Basic Assumptions about the “Noise”
• query logs contain a lot of different
misspellings for most words
• the better spelled a word form, the
more frequent it is
• the correct forms are much more
frequent than their misspellings
Text Mining Search and Navigation
Another Example
albert einstein
albert einstien
albert einstine
albert einsten
albert einsteins
albert einstain
albert einstin
albert eintein
albeart einstein
aolbert einstein
alber einstein
albert einseint
albert einsteirn
albert einsterin
albert eintien
alberto einstein
albrecht einstein
alvert einstein
4834
525
149
27
25
11
10
9
6
6
4
3
3
3
3
3
3
3
Text Mining Search and Navigation
Concatenation and Splitting
Store word unigrams and bigrams in the same searchable trie structure.
Find alternative spellings for the input words in this common structure.
s0
britenetspear inconcert
l0  2
s1
britneyspears in concert
l1  3
s2
britney spears in concert
l2  4
s3
britney spears in concert
Text Mining Search and Navigation
Avoid Changing the User’s Intent
brita
water
filer
brita
brit
brit.
brits
briat
rita
water
eater
hater
later
mater
oater
rater
wader
wafer
wager
waiter
walter
waster
waters
watery
waver
filer
fiber
fifer
file
filed
filers
files
filet
filler
filner
filter
finer
firer
fiver
fixer
flier
Text Mining Search and Navigation
Modified Viterbi Search – Fringes
e.g.: water filer  waiter file
d
in-lexicon words
op
st
w1
w2
w3
a11
a12
a12
rd
o
w
k
un
wn
o
n
r
wo
w5
w6
w7
a13
a15
a16
a17
a 22
a 23
a 25
a 26
a 27






1
k1
2
k2
3
k3
a k55
a k66
a k77
a
a
a
w4
a14
a 24
k1k2  k1+k2 paths

a k44
Text Mining Search and Navigation
Modified Viterbi Search – Stop words
e.g.: lord of teh rigs  lord of the rings
d
op
st
w1
w2
w3
a11
a12
a12
r
wo
w5
w6
w7
a13
a15
a16
a17
a 22
a 23
a 25
a 26
a 27






1
k1
2
k2
3
k3
a k55
a k66
a k77
a
a
a
w
4
a14
a 24

a k44
Text Mining Search and Navigation
Evaluation
Nr. queries
Full system
No lexicon
No query log
All edits equal
Unigrams only
1 iteration only
2 iterations only
No fringes
All queries
Valid
Misspelled
1044
81.8
70.3
77.0
80.4
54.7
80.9
81.3
80.6
864
84.8
72.2
82.1
83.3
57.4
88.0
84.4
83.3
180
67.2
61.1
52.8
66.1
41.7
47.2
66.7
67.2
Text Mining Search and Navigation
A Closer Look to the Results
• 81.8% overall agreement with the annotators
annotator inter-agreement rate: 91.3%
• Errors:
– alternative queries for valid queries
many false positives are reasonable suggestions
e.g.
cowboy robes  cowboy ropes
– alternative queries for misspelled queries
some suggestions could be valid (user’s intent not known)
e.g.
massanger  massager / messenger
Text Mining Search and Navigation
Evaluation – When we “know” user’s intent
(audio flie, audio file)

(bueavista, buena vista)

(carrabean nooms, carrabean rooms) 
audio file
buena vista
caribbean rooms
368 queries
Full system
No lexicon
73.1
59.2
No query log
All edits equal
Unigrams only
1 iteration only
44.9
69.9
43.0
45.5
2 iterations only
No fringes
68.2
71.0
Text Mining Search and Navigation
Learning Curve
85
80.7
81.8
81.6
80
81.2
All queries
Mispelled queries
75
70
67.2
68.9
69.4
2 months
3 months
4 months
66.1
65
1 month
Silviu Cucerzan and Eric Brill – “Spelling correction as an iterative process
that exploits the collective knowledge of web users”, EMNLP 2004
Text Mining Search and Navigation