Transcript Slide 1

Special Topics in Computer Science
Advanced Topics in Information Retrieval
Lecture 8:
Natural Language Processing and IR.
Synonymy, Morphology, and Stemming
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: Conclusions
 Parallel computing can improve
 response time for each query and/or
 throughput: number of queries processed with same speed
 Document partitioning is simple
 good for distributed computing
 Term partitioning is good for some data structures
 Distributed computing is MIMD computing with slow c
ommunication
 SIMD machines are good for Signature files
 Both are out of favor now
2
Previous Chapter: Research topics




How to evaluate the speedup
New algorithms
Adaptation of existing algorithms
Merging the results is a bottleneck
 Meta search engines
 Creating large collections with judgements
 Is recall important?
3
Problem
 Recall image retrieval:
 Find images similar in color, size, ...
 Find photos of Korean President ?
 Find nice girls ? (Don’s show ugly ones!)
 Looks very stupid
 Lacks understanding
 Too difficult
 Text retrieval is no exception
 Find stories with sad beginning and happy end ?
 Lacks understanding
 Difficult but possible
4
Possible?
 Text is intended to facilitate understanding
 Supposedly, even partial understanding should help
 Degrees of understanding:







Character strings (what is used now): well, geese, him
Words (often used now): goose, he
Concepts: hole in the ground (well), Roh Moo-Hyun
Complex concepts: oil well, hot dog
Situations (sentences, paragraphs)
The story (direct meaning)
The message (pragmatics, intended impact)
5
Easy?
 Main problems:
 Multiple ways to say the same
• Query does not match the doc
• Difficult to specify all variants
 Ambiguity of the text
• False alarms in matching
 Lack of implicit knowledge of the computer
• The computer “does not understand” the message
• Difficult to make inferences
 Natural Language Processing tries to solve them
6
Solutions
 Multiple ways to say the same?
 Normalizing: transforming to a “standard” variant
 Ambiguity of the text?
 Ambiguity resolution
 Normalizing to one of the variants
 Perhaps the main problem in natural language processing
 Lack of implicit knowledge of the computer?




Dictionaries, grammars
Knowledge on language structure is needed in all tasks
Knowledge of world is useful for advanced task
Knowledge on language use is a substitute
7
Synonymy
 Multiple ways to say the same
 Or at least when the difference does not matter
 Can be substituted in any (many?) context
 Lexical synonymy
 Woman / female, professor / teacher
 Dictionaries
 Phrase-level or sentence-level synonymy
 They game a book / I was given a book by them
 Syntactic analyzers
 Semantic-level synonymy
 Reasoning
8
Not only synonymy
 Multiple ways to say
 the same (synonymy)
 less: more general (hypernymy)
 more: more specific (hyponymy)
 Complete synonyms are rare
 professor  teacher
 Abbreviations are usually (almost) complete synonyms
 When the differences do not matter, can be treated as
synonymy
 But: different data structures and methods
9
Lexical-level synonymy
 Lexical synonymy
 Woman / female
 Mixed-type synonymy: USA / United States
 Morphology is a kind of synonymy (actually hyponymy)
 ‘geese’ = ‘goose’ + ‘many’
 Russian ‘knigu’ = ‘kniga’ + ‘dative role’
 the “second” part of the meaning is either not important or
is another term
 Morphology is a very common problem in IR
10
Lexical synonymy
 Woman / female
 Dictionaries
 Synonym dictionaries
 WordNet
 Automatic learning of synonymy




Clustering of contexts
If the contexts are very similar, then possible synonyms
Problem: preserves meaning? Monday / Tuesday
An interesting solution: compare dictionary definitions
11
Uses in IR
 Query expansion
 Add synonyms of the word to the query and process
normally
 Flexible, slow
 Best for lexical synonymy: few synonyms, doubtful
 Reducing at index time
 When reading the documents, reduce each word to a
“standard” synonym
 Fast, rigid
 Best for morphology: many synonyms, less doubtful
 Hierarchical indexing
12
Hierarchical indexing
(Gelbukh, Sidorov, Guzman-Arenas 2002)
 Tree of concepts

Living things
•
Animals
1. a. Cat, b. cats
2. a. Dog, b. dogs
•
Persons
3. a. Professor, b. professors
4. a. Student, b. students
 Order vocabulary by the order of the leaves of tree
 Query expansion is done by ranges:

cat: 1, living things: 1-4
13
Morphology
 One of the large concerns in IR
 Can be done
 precisely
 approximately (quick-and-dirty)
 Level of generalization
 inflection: student – students
 derivation: study – student
 Ambiguity
 all variants
 one variant
14
... morphology
 Result is
 The unique ID
 The dictionary form
 A “stem”: part of the same string
15
Morphological analyzers
 Precise analysis
 Ambiguous






Give all variants
Tables: to table or the table?
Spanish charlas: charla ‘talk’ or charlar ‘to talk’
Russian dush: dush ‘shower’ or dusha ‘soul’
Common in languages with developed morphology
For short words, some 3 – 5 – 10 variants
 Dictionaries are used
16
Morphological system
 Dictionary specifies:



Stem: bak-, askPOS (part of speech): verb
Inflection class (what endings it accepts): 1, 2
 Tables of endings specify

Paradigms:
1. -e -es -ed -ed -ing
2. -, -s -ed -ed -ing

Meanings: participle, ...
17
... morphological system
 Algorithm
 Decompose the word into an existing stem and ending
 Check compatibility of stem and ending
 Give the stem ID and ending meaning
 Ambiguous
 Many variants of decompositions
 Many stems with different IDs
 Many endings with different meaning
• -ed: past or participle
 Problem: words absent in dictionary
18
Stemming
 Substitute for real analysis
 Both inflection and derivation
 Quick-and-dirty
 Only one variant
 Result: a part of the string
• gene, genial  gen-
 Cheap development
 bad results
 simple description. Standard
 Often used in academic research
 Used to be used in real systems, but now less
19
Porter stemmer
 Martin Porter, 1980
 Standard stemmer
 Provides equal basis
for evaluation of
different IR programs
 Uses “measure” m:
 [C](VC){m}[V].
 m=0 TR, EE, TREE, Y, BY.
 m=1 TROUBLE, OATS, TREES, IVY.
 m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
20
... Porter stemmer
 Step 1a




SSES -> SS caresses -> caress
IES -> I ponies -> poni ties -> ti
SS -> SS caress -> caress
S -> cats -> cat
21
... Porter stemmer
 Step 1b
 (m>0) EED -> EE feed -> feed agreed -> agree
 (*v*) ED -> plastered -> plaster bled -> bled
 (*v*) ING -> motoring -> motor sing -> sing
22
... Porter stemmer
 If 2nd or 3rd rule successful




AT -> ATE conflat(ed) -> conflate
BL -> BLE troubl(ed) -> trouble
IZ -> IZE siz(ed) -> size
(*d and not (*L or *S or *Z)) -> single letter
•
•
•
•
•
hopp(ing) -> hop
tann(ed) -> tan
fall(ing) -> fall
hiss(ing) -> hiss
fizz(ed) -> fizz
 (m=1 and *o) -> E
• fail(ing) -> fail
• fil(ing) -> file
23
... Porter stemmer
 Step 1c
 (*v*) Y -> I
• happy -> happi
• sky -> sky
24
... Porter stemmer

Step 2




















(m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate
(m>0) ATOR -> ATE operator -> operate
(m>0) ALISM -> AL feudalism -> feudal
(m>0) IVENESS -> IVE decisiveness -> decisive
(m>0) FULNESS -> FUL hopefulness -> hopeful
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ALITI -> AL formaliti -> formal
(m>0) IVITI -> IVE sensitiviti -> sensitive
(m>0) BILITI -> BLE sensibiliti -> sensible
25
... Porter stemmer
 Step 3







(m>0) ICATE -> IC triplicate -> triplic
(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good
26
... Porter stemmer

Step 4



















(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ
(m>1) ITI -> angulariti -> angular
(m>1) OUS -> homologous -> homolog
(m>1) IVE -> effective -> effect
(m>1) IZE -> bowdlerize -> bowdler
27
... Porter stemmer
 Step 5a
 (m>1) E -> probate -> probat rate -> rate
 (m=1 and not *o) E -> cease -> ceas
 Step 5b
 (m > 1 and *d and *L) -> single letter
• controll -> control
• roll -> roll
28
Statistical stemmers
 Take a list of words
 Construct a model of language that “generates” it
 The “best” one
 The simplest one? How to find?
 List of stems, list of endings
 Determine their probabilities
 Usage statistics
 Decompose any input string into a stem and an
ending
 Take the most probable variant
29
Research topics
 Constructing and application of ontologies
 Building of morphological dictionaries
 Treatment of unknown words with morphological
analyzers
 Development of better stemmers
 Statistical stemmers?
30
Conclusions
 Reducing synonyms can help IR
 Better matching
 Ontologies are used. WordNet
 Morphology is a variant of synonymy
 widely used in IR systems
 Precise analysis: dictionary-based analyzers
 Quick-and-dirty analysis: stemmers
 Rule-based stemmers. Porter stemmer
 Statistical stemmers
31
Thank you!
Till May 24? 25?, 6 pm
32