Transcript Slide 1
Special Topics in Computer Science
Advanced Topics in Information Retrieval
Lecture 8:
Natural Language Processing and IR.
Synonymy, Morphology, and Stemming
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: Conclusions
Parallel computing can improve
response time for each query and/or
throughput: number of queries processed with same speed
Document partitioning is simple
good for distributed computing
Term partitioning is good for some data structures
Distributed computing is MIMD computing with slow c
ommunication
SIMD machines are good for Signature files
Both are out of favor now
2
Previous Chapter: Research topics
How to evaluate the speedup
New algorithms
Adaptation of existing algorithms
Merging the results is a bottleneck
Meta search engines
Creating large collections with judgements
Is recall important?
3
Problem
Recall image retrieval:
Find images similar in color, size, ...
Find photos of Korean President ?
Find nice girls ? (Don’s show ugly ones!)
Looks very stupid
Lacks understanding
Too difficult
Text retrieval is no exception
Find stories with sad beginning and happy end ?
Lacks understanding
Difficult but possible
4
Possible?
Text is intended to facilitate understanding
Supposedly, even partial understanding should help
Degrees of understanding:
Character strings (what is used now): well, geese, him
Words (often used now): goose, he
Concepts: hole in the ground (well), Roh Moo-Hyun
Complex concepts: oil well, hot dog
Situations (sentences, paragraphs)
The story (direct meaning)
The message (pragmatics, intended impact)
5
Easy?
Main problems:
Multiple ways to say the same
• Query does not match the doc
• Difficult to specify all variants
Ambiguity of the text
• False alarms in matching
Lack of implicit knowledge of the computer
• The computer “does not understand” the message
• Difficult to make inferences
Natural Language Processing tries to solve them
6
Solutions
Multiple ways to say the same?
Normalizing: transforming to a “standard” variant
Ambiguity of the text?
Ambiguity resolution
Normalizing to one of the variants
Perhaps the main problem in natural language processing
Lack of implicit knowledge of the computer?
Dictionaries, grammars
Knowledge on language structure is needed in all tasks
Knowledge of world is useful for advanced task
Knowledge on language use is a substitute
7
Synonymy
Multiple ways to say the same
Or at least when the difference does not matter
Can be substituted in any (many?) context
Lexical synonymy
Woman / female, professor / teacher
Dictionaries
Phrase-level or sentence-level synonymy
They game a book / I was given a book by them
Syntactic analyzers
Semantic-level synonymy
Reasoning
8
Not only synonymy
Multiple ways to say
the same (synonymy)
less: more general (hypernymy)
more: more specific (hyponymy)
Complete synonyms are rare
professor teacher
Abbreviations are usually (almost) complete synonyms
When the differences do not matter, can be treated as
synonymy
But: different data structures and methods
9
Lexical-level synonymy
Lexical synonymy
Woman / female
Mixed-type synonymy: USA / United States
Morphology is a kind of synonymy (actually hyponymy)
‘geese’ = ‘goose’ + ‘many’
Russian ‘knigu’ = ‘kniga’ + ‘dative role’
the “second” part of the meaning is either not important or
is another term
Morphology is a very common problem in IR
10
Lexical synonymy
Woman / female
Dictionaries
Synonym dictionaries
WordNet
Automatic learning of synonymy
Clustering of contexts
If the contexts are very similar, then possible synonyms
Problem: preserves meaning? Monday / Tuesday
An interesting solution: compare dictionary definitions
11
Uses in IR
Query expansion
Add synonyms of the word to the query and process
normally
Flexible, slow
Best for lexical synonymy: few synonyms, doubtful
Reducing at index time
When reading the documents, reduce each word to a
“standard” synonym
Fast, rigid
Best for morphology: many synonyms, less doubtful
Hierarchical indexing
12
Hierarchical indexing
(Gelbukh, Sidorov, Guzman-Arenas 2002)
Tree of concepts
Living things
•
Animals
1. a. Cat, b. cats
2. a. Dog, b. dogs
•
Persons
3. a. Professor, b. professors
4. a. Student, b. students
Order vocabulary by the order of the leaves of tree
Query expansion is done by ranges:
cat: 1, living things: 1-4
13
Morphology
One of the large concerns in IR
Can be done
precisely
approximately (quick-and-dirty)
Level of generalization
inflection: student – students
derivation: study – student
Ambiguity
all variants
one variant
14
... morphology
Result is
The unique ID
The dictionary form
A “stem”: part of the same string
15
Morphological analyzers
Precise analysis
Ambiguous
Give all variants
Tables: to table or the table?
Spanish charlas: charla ‘talk’ or charlar ‘to talk’
Russian dush: dush ‘shower’ or dusha ‘soul’
Common in languages with developed morphology
For short words, some 3 – 5 – 10 variants
Dictionaries are used
16
Morphological system
Dictionary specifies:
Stem: bak-, askPOS (part of speech): verb
Inflection class (what endings it accepts): 1, 2
Tables of endings specify
Paradigms:
1. -e -es -ed -ed -ing
2. -, -s -ed -ed -ing
Meanings: participle, ...
17
... morphological system
Algorithm
Decompose the word into an existing stem and ending
Check compatibility of stem and ending
Give the stem ID and ending meaning
Ambiguous
Many variants of decompositions
Many stems with different IDs
Many endings with different meaning
• -ed: past or participle
Problem: words absent in dictionary
18
Stemming
Substitute for real analysis
Both inflection and derivation
Quick-and-dirty
Only one variant
Result: a part of the string
• gene, genial gen-
Cheap development
bad results
simple description. Standard
Often used in academic research
Used to be used in real systems, but now less
19
Porter stemmer
Martin Porter, 1980
Standard stemmer
Provides equal basis
for evaluation of
different IR programs
Uses “measure” m:
[C](VC){m}[V].
m=0 TR, EE, TREE, Y, BY.
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
20
... Porter stemmer
Step 1a
SSES -> SS caresses -> caress
IES -> I ponies -> poni ties -> ti
SS -> SS caress -> caress
S -> cats -> cat
21
... Porter stemmer
Step 1b
(m>0) EED -> EE feed -> feed agreed -> agree
(*v*) ED -> plastered -> plaster bled -> bled
(*v*) ING -> motoring -> motor sing -> sing
22
... Porter stemmer
If 2nd or 3rd rule successful
AT -> ATE conflat(ed) -> conflate
BL -> BLE troubl(ed) -> trouble
IZ -> IZE siz(ed) -> size
(*d and not (*L or *S or *Z)) -> single letter
•
•
•
•
•
hopp(ing) -> hop
tann(ed) -> tan
fall(ing) -> fall
hiss(ing) -> hiss
fizz(ed) -> fizz
(m=1 and *o) -> E
• fail(ing) -> fail
• fil(ing) -> file
23
... Porter stemmer
Step 1c
(*v*) Y -> I
• happy -> happi
• sky -> sky
24
... Porter stemmer
Step 2
(m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate
(m>0) ATOR -> ATE operator -> operate
(m>0) ALISM -> AL feudalism -> feudal
(m>0) IVENESS -> IVE decisiveness -> decisive
(m>0) FULNESS -> FUL hopefulness -> hopeful
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ALITI -> AL formaliti -> formal
(m>0) IVITI -> IVE sensitiviti -> sensitive
(m>0) BILITI -> BLE sensibiliti -> sensible
25
... Porter stemmer
Step 3
(m>0) ICATE -> IC triplicate -> triplic
(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good
26
... Porter stemmer
Step 4
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ
(m>1) ITI -> angulariti -> angular
(m>1) OUS -> homologous -> homolog
(m>1) IVE -> effective -> effect
(m>1) IZE -> bowdlerize -> bowdler
27
... Porter stemmer
Step 5a
(m>1) E -> probate -> probat rate -> rate
(m=1 and not *o) E -> cease -> ceas
Step 5b
(m > 1 and *d and *L) -> single letter
• controll -> control
• roll -> roll
28
Statistical stemmers
Take a list of words
Construct a model of language that “generates” it
The “best” one
The simplest one? How to find?
List of stems, list of endings
Determine their probabilities
Usage statistics
Decompose any input string into a stem and an
ending
Take the most probable variant
29
Research topics
Constructing and application of ontologies
Building of morphological dictionaries
Treatment of unknown words with morphological
analyzers
Development of better stemmers
Statistical stemmers?
30
Conclusions
Reducing synonyms can help IR
Better matching
Ontologies are used. WordNet
Morphology is a variant of synonymy
widely used in IR systems
Precise analysis: dictionary-based analyzers
Quick-and-dirty analysis: stemmers
Rule-based stemmers. Porter stemmer
Statistical stemmers
31
Thank you!
Till May 24? 25?, 6 pm
32