Lecture 6: Linguistic Methods for Searching Stemming Thesaurus

Download Report

Transcript Lecture 6: Linguistic Methods for Searching Stemming Thesaurus

CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lecture 6: Linguistic Methods
for Searching


Stemming
Thesaurus


Online resources
Automatic construction of thesaurus
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Outline of Stemming Methods


Goal of Stemming Process
Algorithm





Affix Removal (Porter’s Algorithm)
Dictionary Look-up Stemmers
Successor Variety
n-Gram Stemming
Applications
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
The advantage

Originally designed to improve
performance by reducing the
requirement on system resources.

With the continued significant increase in
storage and computing power, use of
stemming for performance reason is no
longer as important.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Other Potentials

It may make improvement in recall.


There may be associated decline in precision.
System designer make their own choice of
including stemming or not.


Google does not use the stemming
Hotbot includes the word stemming for user
choice
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Porter Stemming Algorithm



The Porter algorithm is the most commonly
accepted algorithm.
Based upon a set of conditions of the stem,
suffix and prefix and associated actions given
the condition.
See, e.g,

http://www.tartarus.org/~martin/PorterStemmer/
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Porter Stemming (Condition)


m, the measure of a stem is a function of
sequences of vowels (a,e,i,o,u,y) followed by
a consonant.
C(VC)mV where the initial C and final V are
optional and m is the number VC repeats
Measure
Example
m=0
free, why
m=1
frees, whose
m=2
prologue, compute
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Porter Stemming (Condition)




*<X> -stem ends with letter X
*v*
-stem contains a vowel
*d
-stem ends in double consonant
*o
-stem ends with consonantvowel-consonant sequence where the
final consonant is not w, x, or y
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Rules
Step
1a
Condition Suffix
NULL
sses
1b
*v*
ing
1b1
NULL
at
1c
*v*
y
Replacement Examples
ss
stresses
->stress
NULL
making ->
mak
ate
inflat(ed)->
inflate
i
happy->
happi
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Rules (continued)
2
m>0
aliti
al
3
m>0
icate
ic
4
m>1
able
NULL
5a
m>1
e
NULL
5b
m>1 and *d
and *<L>
NULL
single
letter
formaliti->
formal
duplicated
->duplic
adjustable
->adjust
inflate->
inflat
controll->
control
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Example

duplicatable



duplicat
duplicate
duplic
rule 4
rule 1b1
rule 3
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Dictionary Look-Up Stemmer


A dictionary contains the pairing of a word
and its stem for all the words.
The structure of the dictionary should be well
designed for speeding up the search
TERM
computer
compute
computation
STEM
comput
comput
comput
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Successor Variety Stemming


Hafer and Weiss (1974) “word segmentation by
letter successor varieties”, Information Storage
and Retrieval 10, 371-385.
Main Idea: Determine word and morpheme
boundaries based on

the distribution of phonemes in a large body of
utterances.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Note

Morpheme: smallest meaningful part into which
a word can be divided



Run-s contains two morphemes
un-like-ly contains three morphemes
Phoneme: unit of the system of sounds in a
language

English has 24 consonant phonemes
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Overall approach

Hafer and Weiss use


letters in place of phonemes
texts in place of phonemically transcribed utterances
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Formal Definition






Let w be a word of length n
wi is a length I prefix of w
Let D be a collection of words
D(wi) is the subset of D containing terms whose
first I letters match wi exactly
S(wi) the successor variety of wi is the number
of distinct letters that occupy the (i+1)st
position of words in D(wi).
A test word of length n has n successor varieties
S(w1) S(w2) … S(wn).
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Informal Definition


The successor variety of a string in a collection
D of words is the number of different characters
that follows it in D.
That it, it depends on


the string
the collection D of words under consideration
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
An example


D={able, axle, accident, ape, about, be}
The successor variety for






a: 4 (b,x,c,p)
ap: 1 (e)
app: 0
ab: 2 (l, o)
b: 1 (e)
Using Trie, successor variety of a string is the
number of children for the node the string
reaches in the trie (terminal node is treated as
having one child
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Trie for the corpus of data D
1
b
a
b
3
l
2
x
c
be
axle
p
o
ape
accident
able
about
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Segment in Words


From a large body of text, usually the successor
variety of a substring decreases as a character
is added, until a segment boundary is reached
Consider the following example






D={able,ape,beatable, fixable, read, readable,
reading, reads, red rope, ripe}
r
3
(e,I,o)
re
2
(a,d)
rea
1
(d)
read
3
(a,I,s)
read is a segment (or stem)
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Selecting segments of words




Cut off method:
 a boundary is identified if some cutoff value is
reached.
Peak and plateau method
 a segment break is made after a character whose
successor variety is larger than that of both the
character immediate before and the character
immediately after it.
Complete word method
 a break is made after a segment if the segment is a
complete word in the corpus
Entropy method
 cutoff method applied to entropy defined for words.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Peak and Plateau Method

D={able,ape,beatable, fixable, read, readable,
reading, reads, red rope, ripe}









r
re
rea
read
reada
readab
readabl
readable
3
2
1
3
1
1
1
1
(e,I,o)
(a,d)
(d)
(a,I,s)
(b)
(l)
(e)
(blank)
the successor variety of {read} is 3 larger than
that of both “rea” and “reada”
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Peak and Plateau Method



Input: A document of many terms.
Output: each term is segmented.
E.G., the output of readable is read-able
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Stem method of Hafer and Weiss



Determine successor variety of a word
Use this information to segment the word using one of
the previous methods (say peak&plateau)
Choose one of the segment as stem

if (first segment is in <=12 words in the corpus)



//comment: maybe a prefix
first segment is stem
else

second segment is stem
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Stem method of Hafer and Weiss



Input: segmented word
Output: the stem of the word
For example:
 read-able is input
 read is the output
 //may be able is the output dependent on what
happens in the algorithms
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Accessor Variety Method in
Chinese



The notation is introduced by Feng, Chen, Zheng, Deng
for chinese word extraction.
The idea is similar to successor variety
It is use to determine chinese text segmentation since it
is difficult to separate words in Chinese text. In
comparison, English words are separated by a space
symbol in text.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Definition: Accessor Variety




We treat each Chinese character as a letter
For each string (a potential word) consisting of several
characters, we define successor variety as in English
Symmetrically, we also define a predecessor variety for
each string.
A word is considered a word if it has a large successor
variety and a large predecessor variety.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Testing Results



The accessor variety method turns out a very simple yet
efficient way to recognize Chinese words when
combined with some simple grammar rules.
For details, look at our paper:
http://www.cs.cityu.edu.hk/~deng/5286/feng.pdf
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Word similarity

N-gram method:





break a word of length n into (n-1) digrams, consisting of
substring of two characters of the word.
Count the number of distinguished digrams
Let A (B) be the number of distinguished digrams in
word 1 (2). Let C be the number of distinguished
digrams shared by word 1 and word 2.
The similarity of the two words is
S=2C/(A+B)
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Example of Word similarity





Statistics: st, ta, at, ti, is, st, ti, ic, cs
 its distinguished digrams
 at, cs, ic, is, st, ta, ti
statistical: st, ta, at, ti, is, st, ti, ic, ca, al
 its distinguished digrams:
 al, at, ca, ic, is, st, ta, ti
A=7, B=8, C=6
Similarity =2x6/(7+8)=12/15=4/5=80%
One may build a similarity matrix of all words in a
corpus, calculated as above, and complemented by
cutoff value method (set to zero if less than a certain
value, and to 1 else)
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Thesaurus


Vocabulary control in an information
retrieval system
Thesaurus construction


Manual construction
Automatic construction
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Vocabulary control

Standard vocabulary for both indexing
and searching (for the constructors of
the system and the users of the system)
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Objectives of vocabulary control


To promote the consistent representation of
subject matter by indexers and
searchers ,thereby avoiding the dispersion of
related materials.
To facilitate the conduct of a comprehensive
search on some topic by linking together
terms whose meanings are related
paradigmatically.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Thesaurus

Not like common dictionary




Words with their explanations
May contain words in a language
Or only contains words in a specific domain.
With a lot of other information especially the
relationship between words


Classification of words in the language
Words relationship like synonyms, antonyms
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
On-Line Thesaurus



http://www.thesaurus.com
http://www.dictionary.com/
http://www.cogsci.princeton.edu/~
wn/
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Dictionary vs. Thesaurus
Check Information use http://www.thesaurus.com
Dictionary

Thesaurus
in·for·ma·tion ( n f r-m
sh n)
n.



Knowledge derived
from study, experience,
or instruction.
Knowledge of specific
events or situations that
has been gathered or
received by
communication;
intelligence or news.
See Synonyms at
knowledge.
......
[Nouns] information, enlightenment,
acquaintance ……
[Verbs] tell; inform, inform of; acquaint,
acquaint with; impart, ……
[Adjectives] informed; communique;
reported; published
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Use of Thesaurus


To control the term used in
indexing ,for a specific domain only use
the terms in the thesaurus as indexing
terms
Assist the users to form proper queries
by the help information contained in the
thesaurus
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Construction of Thesaurus


Stemming can be used for reduce the
size of thesaurus
Can be constructed either manually or
automatically
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
WordNet: manually constructed

WordNet® is an online lexical reference
system whose design is inspired by
current psycholinguistic theories of
human lexical memory. English nouns,
verbs, adjectives and adverbs are
organized into synonym sets, each
representing one underlying lexical
concept. Different relations link the
synonym sets.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Relations in WordNet
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Automatic Thesaurus Construction


A variety of methods can be used in
construction the thesaurus
Term similarity can be used for
constructing the thesaurus
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Complete Term Relation Method
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
Doc1
0
4
0
0
0
2
1
3
Doc2
3
1
4
3
1
2
0
1
Doc3
3
0
0
0
3
0
3
0
Doc4
0
1
0
3
0
0
2
0
Doc5
2
2
2
3
1
4
0
2
Term – Document Relationship can be calculated using a variety of methods
Like tf-idf
Term similarity can be calculated base on the term – document relationship
 for example:
Sim(Termi , Term j ) 
 ( DocTerm
All Document K
k ,i
)( DocTermk , j )
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Complete Term Relation Method
Term1
Term1
Term2
Term3
Term4
Term5
Term6
Term7
Term8
7
16
15
14
14
9
7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
Term2
7
Term3
16
8
Term4
15
12
18
Term5
14
3
6
6
Term6
14
18
16
18
6
Term7
9
6
0
6
9
2
Term8
7
17
8
9
3
16
Set threshold to 10
3
3
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Complete Term Relation Method
T3
T1
Group
T1,T3,T4,T6
T2
T1,T5
T4
T2,T4,T6
T5
T6
T2,T6,T8
T7
T8
T7