Transcript irf97c4
This Class
How
stemming is used in IR
Stemming algorithms
Frakes: Chapter 8
Kowalski: pages 67-76
Stemming algorithms
Affix
removing stemmers
Dictionary lookup stemmers
n-gram stemmers
Successor variety stemmers
Stemming
Conflation
- combining
morphological term variants
Done manually or automatically
Automatic algorithms called
stemmers
Stemming algorithms
Conflation methods
Manual
Automatic
Affix
Successor Dictionary
n-grams
Removal Variety
Lookup
Longest
Match
Simple
Removal
Stemming is used for:
Enhance
query formulation
(and improve recall)
by providing term variants
Reduce size of index files
by combining term variants
into single index term
Stemming during indexing
Index
terms are stemmed words
Saves dictionary space
One inverted index list for all variants
Saves inverted index file space when
position information in document not
included
Query terms are also stemmed
Index is not stemmed
In
this case the index contains words
No compression is achieved
No information is lost
Enables wild card searches
Enables long phrase searches
when position information included
Providing term variants
during search
A stemming
algorithm generate term
variants
Term variants added to query
automatically (query expansion)
or
The user is provided
with term variants and
decides which ones to include
Example
A user
searching for
ystem users?is provided
in the CATALOG system with
term variants for sers?and
ystem
Example (cont.)
Search term: users
Term Occurrences
1. user
15
2. users
1
3. used
3
4. using
2
User selects variants to include in query
Stemmer correctness
A stemmer
–
–
can be incorrect by either
Under-stemming or by
Over-stemming
Over-stemming
can reduce precision
Under-stemming can affect recall
Over-stemming
Terms
with different meanings are
conflated
onsiderate? and
onsider?and
onsideration
should not be stemmed to on? with
ontra?
ontact? etc.
Under-Stemming
Prevents
related terms from being
conflated
Under-stemming
onsideration?to
onsiderat?
prevents conflating it with
onsider
Evaluating stemmers
In
information retrieval stemmers
are evaluated by their:
–
–
–
effect on retrieval and
compression rate, and
not linguistic correctness
Evaluating stemmers
Studies
have shown that stemming has a
positive effect on retrieval.
Performance of algorithms comparable
Results vary between test collections
Affix removal stemmers
Remove
suffixes and and/or
– prefixes from terms
– leaving a stem
–
Affix removal stemmers
In
English stemmers are suffix
removers
In other languages,
for example Hebrew,
both prefix and suffix are removed
Affix removal stemmers
Most
affix removal stemmers in use
are:
–
–
iterative - for example,
onsideration?stemmed first to
onsiderat?then to onsider
longest match stemmers using a set of
stemming rules.
A simple stemmer
Harman
–
concluded minimal stemming helpful
Her
–
–
experimented
simple stemmer changes:
Plural to singular
Third person to first person
A simple stemmer
Algorithm
changes:
kies?to ky? ies->y
etrieves?to
etrieve? es->s, and
oors?to
oor? s->NULL
(leaves
orpus?or ellness?
ies?to
y?
A simple stemmer
1. word ends in es?but not
ies?or ies?change end to ?
2. word ends in s? but not es? es?or
es?change to ?
3. word ends in ?but not s?or s?
remove s
The Paice/Husk stemmer
Uses
a table of rules grouped into sections
Section for each last letter of a suffix
(rules for forms ending in a, then b, etc.)
A form is any word or part of a word
considered for stemming
The Paice/Husk stemmer
Each
rule specifies a deletion or a
replacement of an ending
The order of the rules in each section is
important.
Rules tried until one can be applied, and
the current form is updated
Rule structure
Each
rule contains 5 parts (2 are
optional):
An ending (one or more characters in
reverse order)
An optional
ntact?flag ??denoting
form not yet stemmed
Rule structure
A digit
(>=0) specifying no. characters to
remove
An optional string to append (after
removal)
A rule ending with
??denotes stemming should continue
?? terminating the stemming process
Examples of rules
ei3y>?
if form ends in
es?then replace the last
3 letters by ?and continue stemming
( ries?becomes ry?
Examples of rules
u*2.?
if form ends with
m?and word is intact
remove 2 last letters and terminate
stemming.
aximum?is stemmed to
axim? but
resum?from resumably?remains
unchanged
Examples of rules
lp0.?- if word terminates in
ly?terminate. Next rule l2>?does not
remove y?from ultiply
ois4j>?causes
ion?to be replaced by
?
?acts as dummy ending
rovision?converted to
rovij?and then
to rovid
Acceptability conditions
Rule
not applied unless conditions
satisfied
Attempt to prevent over-stemming
Without them
ent? ant? ice? ate?
ation? iver?reduce to ?
There
are 2 rules:
Acceptability conditions
If
form starts with a vowel then at least 2
letters must remain (owed/owing->ow but
not ear->e)
If a form starts with a consonant then at
least 3 letters must remain, and
at least one must be a vowel or
(saying->say, crying->cry, but not string>str, meant->me, or cement->ce)
Acceptability conditions
These
rules cause error in the stemming
of some short-rooted words
(doing, dying, being).
These could be dealt with separately with
a table lookup
Example with Paice stemming
eparately?- use ?section
mismatch ylb1>, yli3y>, ylp0.
match yl2>. Form becomes
eparate?
use rule
1>?in ?section
form changes to
eparat?- use t section
mismatch with
acilp4y.? match with
a2>? change form to epar
use r section, match with
a2.? So ep
Other examples
p r e p a r a tio n
prepare
prepared
r u l e n o i s 4 j>
fa ils
ru le n o ix 4 c t.
fa ils
ru le n o i2 >
preparat
ru le ta 2 >
prepar
ru le ra 2 .
prep
ru le e 1 >
prepar
ru le ra 2 .
prep
ru le d e 2 >
prepar
ru le ra 2 .
prep
n-grams
Fixed
length consecutive series of
?characters
Bigrams:
–
Sea colony -> (se ea co ol lo on ny)
Trigrams
–
Sea colony -> (sea col olo lon ony), or
-> (#se sea ea# #co col olo lon ony ny#)
Usage of n-grams
Used
in world war II by cryptographers
Spell checking
Text compression
Signature files
Stemming
n-gram
temmers
Adamson
and Borcham (1974)
Method for grouping term variants
Language independent
n-gram
Each
temmers
term transformed to n-gram
A similarity value
is generated between
any pair of terms in database,
resulting in a similarity matrix
n-gram
temmers
A clustering
method (single link)
groups highly similar terms into
clusters
Most matrix elements had value 0.
Used a cutoff value of 0.6 for their
clustering algorithm
Dice Coefficient
Many
formulas for computing set
similarity
Dice coefficient:
S=2(|A B|)/(|A|+|B|)
0 S 1
S=1 if A=B, S=0 if A B=
Sets of Unique Bigrams
Let A and
B denote the sets of
unique bigrams associated with two
terms, and let C=A B
statistics -> (st ta at ti is st ti ic cs)
Set of unique bigrams for statistics:
A={at cs ic is st ta ti},
|A|=7
n-gram
temmers
statistical=
(st ta at ti is st ti ic ca al)
Set of unique bigrams for statistical
B= {al at ca ic is st ta ti}, |B|=8
C={at ic is ta st ti}, |C|=6
S=2|C|/(|A|+|B|)=2x6/(7+8)=.8
Table lookup method
Ideally,
a table is constructed with
stem for every word
Stemming - look up word find stem
There is no such data for English
Systems use a combination of
dictionary lookup and conflation
rules
Dictionary lookup method
INQUERY uses
Kstem
Kstem is a morphological analyzer
that conflates word variants to root
form
Dictionary lookup method
Tries
to avoid collapsing words with
different meaning to same root
The original word or a stemmed
version is looked up in a dictionary
and replaced by the best stem
Successor variety stemmer
Based
on work in structural linguistic
(Hafer and Weiss)
Performed less well than affix removing
stemmers
Given a set of words,
the successor variety (SV) of a string is
the number of different characters that
follow it in words in the set
Successor variety stemmers
Terms
: {able, axle, accident, ape, about,
apply, application, applies}
The SV of
p?is 2
p?is followed by ?in pe?and
by ?in pply application and applies
The
SV of
?is 4
?followed in set by
?
?
? and
SVs for
pply?and
P r e fix
a
SV
4
ap
app
appl *
a p p ly
2
1
2
1
L e tte r s
b, x, c,
p
e, p
l
y, i
b la n k
pplies
P r e fix
a
SV
4
ap
app
appl *
a p p li
a p p lie
a p p lie
s
2
1
2
2
1
1
* denotes a break point at peak
L e tte r s
b, x, c,
p
e, p
l
y, i
e, c
s
b la n k
SV for
pplication
Prefix
a
ap
app
appl
appli *
applic
applica
applicat
applicati
applicatio
application
SV
4
2
1
2
3
1
1
1
1
1
1
Letters
b, x, c, p
e, p
l
y, i
c, y, e
a
t
i
o
n
blank
Segmenting words
4
–
–
–
–
ways:
Cut-off SV is reached
SV eaks
A substring of a word is equal to
another word in the set
eadable?breaks into ead?and
Entropy based method
ble
Selecting a stem
First
segment is selected if it occurs
in at most 12 words,
Otherwise the second segment is
selected (3 segments are unlikely)
Summary
All
automatic stemmers - sometimes
incorrect
n-gram method can be used for
different languages
In general affix removing stemmers are
more orrect
Longest match stemming does not always
generate satisfactory word stems