Document 7884523

Download Report

Transcript Document 7884523

Indexing
•
•
•
•
Overview
Approaches to indexing
Automatic indexing
Information extraction
Overview
Indexing: the transformation of documents to searchable data
structures.
 May be manual or automatic
 Creates basis for direct search,or for search through index files.
 Historically performed by professional indexers associated with
library organizations.
 A critical process: user’s ability to find documents on particular
subject is limited by the indexer creating index terms for this
subject.
 Initial computerization still relied on human indexers, but
encouraged using more index terms(index cards no longer being
required for each index term)
Changes in Objectives of Indexing
Due to full Tex Availability







•


Indexing defines the source major concepts of documents.
The use of a controlled vocabulary(the domain of the index),help
standardize the choice of terms.
Controlled vacabularies slow the indexing process,but aid users
because they know the domain the indexer had to use
With the availability of full text the need for manual indexing is
diminishing
Source information (citation data) can easily be extracted.
Every word of a document(after appropriate normalization) may be
used as a term
Thesauri compensate for lack of controlled vocabularies.
Hence,importance of manual indexing shifts to its ability to
Perform abstractions and determine additional related terms.
Judge the value of the information (e.g. , more difficult to “cheat”)
Approaches:Scope
 Exhaustively: the extent to which concepts are indexed.
 Should we index only the most important concepts, of also more minor
concepts?
 In a 10-page document, should a 2-sentence discussion of some subject be
indexed?
 Specificity: the preciseness of the index term used.
 Should we use general indexing terms of more specific terms?
 Should we use the term “computer”, “personal computer”, or “IBM Aptiv a
Model M61”?
 Main effect:
 Low exhaustivity has adverse effect on recall.
 Low specificity has adverse effect on precision.
 Related issues:
 Index title and abstract only, or the entire document?
 Should index terms be weighted?
Approaches : Pre-coordination
 Post-coordination : when a query uses a set of terms linked by AND,
it links these terms together.
 Pre-coordination : links among terms are specified in the index. Precoordination improves retrieval for post-coordinated queries.
 Example : Document discusses drilling of oil wells in Mexico by
CITGO and introduction of oil refineries in Peru by the U.S.
1 No pre-coordination of terms:
• oil, wells, Mexico, CITGO, refineries, Peru, U.S.
 Document retrieval if query links “oil”, “Mexico” and “Peru”.
2 Simple re-coordination:
• (oil wells, Mexico, CITGO)
• (oil refineries, Peru, U.S.)
 Document not retrieved if query links “oil”, “Mexico” and “Peru”
Example(cont.)
3 Pre-coordination with position indicating role:
• (CITGO, drill, old wells, Mexico)
• (U.S. introduce, oil refineries, Peru)
Discriminates which country introduces refineries into the other
country
4 Pre-coordination with modifier indicating role:
• (Subject: CITGO, Action:drill, Object: oil wells, Modifier: in
Mexico)
• (Subject: U.S. , Action: introduce, Object: oil refineries, Modifier :
in Peru)
If document discussed U.S. introducing refineries in Peru, Bolivia,
and Argentina, one entry is used with three Modifier fields.
Automatic Indexing
• System automatically determines the index terms assigned to
documents.
• Relative advantages
– Human indexing:
• Ability to determine concept abstractions.
• Ability to judge the value of concepts.
– Automatic indexing:
• Reduced cost : once initial hardware cost is amortized,
operational cost is cheaper vs. compensation for human
indexers.
• Reduced processing time : at most few seconds vs. at least a
few minutes.
• Improved consistency : algorithms select index terms terms
much more consistently than humans.
Weighted and Unweighted indexes
• Unweighted indexing:
– No attempt to determine the value of the different terms assigned
to a document. Not possible to distinguish between major topics
and casual references.
– All retrieved documents are equal in value.
– Typical of commercial systems through the 1980s.
• Weighted indexing:
– Attempt made to place a value on each term as a description of the
document.
– This value is related to frequency of occurrence of the term in the
document(higher is better), but also to number of collection
documents that use this term (lower is better).
– Query weights and document weights are combined to a value
describing the likelihood that a document matches a query, and a
threshold value limits the number of documents returned.
– Typically used only with automatic indexing.
Automatic Indexing by Term and by Concept
• Indexing by Term: The item is represented by terms extracted
from the item.
– The Vector model
– The Bayesian Model
– Natural language processing
• indexing by concept:The document is represented by concepts
not necessarily used in the document.
Indexing by Term:the Vector Model
• The SMART system developed by Salton at Cornell
University.
– Each document is stored as a vector of weights.
– Each vector position represents a term in the database domain(the
dimension of these vectors is the size of the vocabulary).
– The value is represented by a similar vector
– The Search involves calculating the vector distance between the
query vector and each document vector.
Indexing by Term : the Bayesian Model
• Bayes rule of conditional probability :
– P(A/B) = P(A,B)/P(B) = P(A)P(B/A)/P(B)
• Bayesian methods can be used to determine the processing
tokens and their weights.
• Principle : calculate the (posterior) probability that a given
document contains concept C, given the presence of
features (words) F1,…,Fm in the document.
• To calculate this probability we need to know :
– The prior probability that the document is relevant to the concept C.
– The conditional probability that the features Fi are present in a
document, given that the document is relevant to the concept C.
Indexing by Term : Natural Language
Processing
• The DR-LINK system.
– Enhance indexing by using semantic information ( in
addition to statistical information).
– Process the language, rather than treat each word as an
independent entity.
– Process documents at different levels : morphological,
lexical, semantic, syntactic, and discourse ( beyond the
sentence).
Indexing by Concept
• There are many ways to represent the same idea and increased retrieval
performance comes from using a single representation.
• Hence, a single canonical set of concepts is determined and is used for
indexing all documents.
• The MatchPlus system:
– A set of n features (concepts) is selected.
– For each word stem a context vector of dimension n is built, describing
how strongly the stem reflects each feature.
– The context vectors for the word stems are combined with a weighted sum,
to create a single context vector for the entire documents.
– This vector represents the document in terms of the concepts.
– Queries go through same analysis to determine vector representations.
– During search, query vector is compared to document vectors.
Information Extraction
• Two processed related to indexing :
– Extraction of facts(e.g, when building indexes automatically).
– Document summarization.
• Extraction of facts into a database:
– Extract specific types of information using extraction criteria
(indexing attempts to represent the entire document).
– Recall now refers to how much information was extracted from a
documents(vs. how much should have been extracted).
– Precision now refers to the proportion of the extracted information
which is accurate.
– Experiments show that automatic extraction performs much worse
than human extracion (55% precision and recall vs. about 80%),
but operates about 20 times faster.
Information Extraction(cont.)
• Documents summarization:
– Extract the most important ideas, while reducing the size
significantly.
– Example : the abstract of a document.
– “True summarization” is not feasible.
– Instead, most summarization techniques extract the “most
significant” subsets(e.g. , sentences), and concatenate them.
– Each sentence is assigned a score, and the highest scoring
sentences are extracted.
– No guarantee of a coherent narrative.
– Heuristic algorithms, with no overall theory. For example,
• Consider sentences over 5 words in length.
• Look for “cues”; e.g., “in conclusion”.
• Focus on the first 10 and last 5 paragraphs.