IR Through the Ages
Download
Report
Transcript IR Through the Ages
Intelligent Information Retrieval
CS 336 –Lecture 3: Text Operations
Xiaoyan Li
Spring 2006
Topics
• 5-step Documents preprocessing
• Porter stemming algorithm
• Text compression
Five-Step Document Preprocessing
• Lexical analysis of the text
– How to treat digits, hyphens, punctuation marks, the case of
letters
• Elimination of stopwords
– Words with low discrimination values
• Stemming
– Removing prefixes and suffixes
• Selection of index terms
– Determine which words/stems will be used as indexing
elements
• Construction of term categorization structures
– a thesaurus,
Step 1: Lexical analysis of the text
• Converting the text of a document (a large
string/or a stream of characters) to a stream
of words
– Word separators (English, Chinese)
• How to deal with digits, punctuation marks,
hyphens, and the case of letters
Step 2: Elimination of stopwords
• Frequent words in the collection
• Not good discriminators
– Filtered out as potential index terms
• Elimination of stopwords reduces the size of
the indexing structure considerable.
– 40% or more
• Examples
– Articles, prepositions, conjunctions, etc.
– Even some verbs, adverbs and adjectives
Step 3: Stemming
• Problem with perfect match:
– One query word “connect” and its multiple “connected”,
“connecting”, “connects” in different documents
• Stemming: Reduce variants of the same root word to
a common concept
• Stemming also reduces the number of distinct index
terms
• The Porter Algorithm
Stemming Approaches
• Table lookup
– Generation is complex
– Final tables are often incomplete
• Affix removal
– Suffix vs. prefix (e.g. mega-volt)
– Doesn’t always work, esp. not in German
• Successor variety stemming
– More complex than suffix removal
– Uses (e.g.) linguistic approaches and techniques from
morphology
• N-grams
– General clustering approach which can also be used for
stemming
Step 4: Selection of index terms
• Full text representation vs. selected set of
terms as index terms
• Many distinct automatic approaches
• The identification of noun groups (Inquery
system)
– Most of the semantics is carried by the noun
words in a sentence
– Combine nearby nouns into noun groups.
Step 5: Construction of term categorization
structures
• A thesaurus
– A standard vocabulary for indexing and searching
– Relationships among indexed terms
– Assist users with locating terms for proper query formulation
• An example of an entry in Roget’s thesaurus
– Cowardly adjective
– Ignobly lacking in courage: cowardly turncoats
– Syns: chicken (slang), chicken-hearted, craven, dastardly,
faint-hearted, gutless, lily-livered, pusillanimous, unmanly,
yellow (slang), yellow-bellied (slang).
Thesauri
• Indexed terms
– Denotes a concept, basic semantic unit
– Can be individual words, group of words, or
phrases
– Terms are basically nouns
– Terms can also be verbs in gerund form whenever
they are used as nouns. (teaching, acting etc.)
• Relationships
– A set of related terms to a entry is mostly
composed of synonyms or near-synonyms.
The Use of Thesauri in IR
• Selecting related terms in a thesaurus to reformulate
a query when initial query words are erroneous and
improper.
• Unfortunately, this approach does not work well in
general.
– Relationships captured in a thesaurus are not valid in the
local context of a given query.
• An alternative: determine thesaurus-like relationships
at query time
– Challenging for web search- can’t afford the effort for each
individual query
The Porter Algorithm
• Special algorithm for the English language
based on suffix removal
• 5 successive distinct phases, applied to words
sequentially one after another
• Example: Remove plural ‘s’ and ‘sses’ Rules:
sses -> ss, s -> NIL (obey order!)
Porter Algorithm
• Conventions
– C: consonant, V: vowel, L: consonant or vowel
– Combination of C, V, L to define patterns
– Operators ”+” and “*” to form complex patterns
• *: zero or more repetitions of a given pattern: (V*C)
• +: one of more repetitions of a given pattern :( (C)*((V)+(C)+)+(V)*)
• Statements/commands
– Rule-base statements
• Single rule: If (*V*L) then ed Nil (remove ed)
• Multiple rules:
– Select rule with longest suffix{
sses ss
ies i;
ss ss;
s->
}
Try Porter Algorithm
•
•
•
•
•
•
Played
Classes
Policy
Position
Capability
Active, actively, activity
The Porter Algorithm: advantages &
disadvantages
• Advantage: Easy algorithm with good
results
– abate abated abatement abatements abates -->abat
• Disadvantage: Not always correct, e.g.
– Same root for police – policy, execute –executive,
…
– Different root for european – europe, search –
searcher,
Next Lecture:
• Compression. Ch. 7