WXGB6106/WXGB5009 : INFORMATION RETRIEVAL

Download Report

Transcript WXGB6106/WXGB5009 : INFORMATION RETRIEVAL

WMES3103 : INFORMATION
RETRIEVAL
TEXT OPERATIONS
INTRODUCTION
Not all words in a document = significant to
represent the contents/meanings of a
document
Some word carry more meaning than others
Noun words or group of noun words = most
representative of a document content
Therefore, need to preprocess the text of a
document in a collection to be used as index
terms
Using the set of all words in a collection to
index documents = too much noise for the
retrieval task
Reduce noise = reduce words which can be
used to refer to the document
Preprocessing = process of controlling the
size of the vocabulary or the number of
distinct words used as index terms
Preprocessing will lead to an improvement in
the information retrieval performance
However, some search engines on the
Web omit preprocessing


Every word in the document is an index
term
Suppose to make the retrieval task simpler
and easier for the user
DOCUMENT PREPROCESSING
Text operations = text transformations
5 main operations :
a. Lexical analysis of the text - digits, hyphens,
punctuations marks, and the case of letters
b. Elimination of stop words - filter out words which
are not useful in the retrieval process
c. Stemming of the remaining words - remove
affixes (prefixes and suffixes)
d. Selection of index terms – choose words/stems
(or groups of words) to be used as indexing terms
e. Construction of term categorization structures
such as thesaurus, or extraction of structure
directly represented in the text, for allowing the
expansion of the original query with related terms
a – d = production of a set of good index
terms
e = building of categorization hierarchies to
capture relationship
LEXICAL ANALYSIS OF TEXT
Change text of the documents into words to
be adopted as index terms
Objective - identify words in the text





Digits, hyphens, punctuation marks, case of letters
Numbers not good index terms – 1910, 1999 - but
510 B.C. – unique
Hyphen – break up the words (eg. state-of-the-art
= state of the art)- but some words, eg. giltedged, B-49 - unique words which require
hyphens
Punctuation marks – remove totally unless
significant , eg. program code x.id and xid
Case of letters – not important and can convert all
to upper or lower
ELIMINATION OF STOPWORD
A word which occurs in 80% of the
documents in a collection = useless for
retrieval= stopwords and filtered out as
potential
index
terms
(eg.
articles,
prepositions, conjunctions)
Reduces size of indexing structure
Indexing structure compressed by 40%
Some verbs, adverbs and adjectives can also
be treated as stopwords
425 stopwords identified by W.B. Frakes and
R. Baeza-Yates. Information retrieval : data
structures & algorithms. Englewood Cliffs :
Prentice Hall, 1992.
Programs in C for lexical analysis are also
provided
Elimination of stopwords might reduce recall
(eg. “To be or not to be” – all eliminated
except “be” – no or irrelevant retrieval)
STEMMING
Stem = a portion of a word which is left after
the removal of it affixes (i.e. prefixes and
suffixes)
Reduces variants of the same root to a
common concept
Reduces size of indexing structure because
number of distinct index terms is reduced
Many Web search engines do not use
stemming
INDEX TERM SELECTION
If a full text representation of the text is
adopted, then all words in the text are used
as index terms = full text indexing
Need to select the words to be used as index
terms
Not all words will be selected
Bibliographic sciences – done by a specialist
Other alternative method is automatic
selection
THESAURI
Consists of :



a precompiled list of important words in a given
discipline
for each word, a set of related words
Words and concepts
Aim



to provide a standard vocabulary for indexing and
searching
to assist users with locating terms for proper
query formulation
to provide classified hierarchies that allow the
broadening and narrowing of the current request
according to user needs
Main components of a thesaurus – index
terms, relationship among terms (BT, NT, RT)
and a layout design for the term
relationships, sometimes a definition or
explanation (eg. seal (animal) and seal
(document)
Controlled vocabulary for indexing and
searching – useful for established body of
knowledge with established terms.
Web – thesaurus or free-text searching ?????
eg. Yahoo – present user with term
classification hierarchy that reduces the
space to be searched
OTHERS
Document clustering – group similar or
related documents in classes, operation on all
documents in the collection and not operation
of the text for a document
Text compression – ways to represent the
data in fewer bits and bytes, greatly reduces
amount of space to store text on computers,
text – compression – original text
reconstructed, takes less time to transmit