Subject Analysis - University of the Philippines Mindanao

Download Report

Transcript Subject Analysis - University of the Philippines Mindanao

SUBJECT ANALYSIS AND
REPRESENTATION
Presented by
GARRY L. BASTIDA
INTRODUCTION/REVIEW
One of the major functions of an
information retrieval system is to match
the contents of documents with users
queries.
The system personnel have to prepare a
surrogate for every document, and all such
surrogates must be maintained in an
organized manner. (indexing).
INTRODUCTION/REVIEW
TASK: analyze the content of the given document and
represent this analysis by some content identifiers or
keywords.
Lancaster: indexing involves two quite distinct
contents. Conceptual analysis and representation.
In subject classification, the basic objective of which is
to arrange documents according to their subject
contents, the result of the conceptual analysis is
represented by some artificial analysis is represented by
some artificial language or notational symbol
Subject
Analysis
What’s it all about,
Garry?
5
What is it?
Subject analysis
 Examination of a bibliographic item by a trained
subject specialist to determine the most specific
subject heading(s) or descriptor(s) that fully describe
its content, to serve in the bibliographic record as
access points in a subject search of a library catalog,
index, abstracting service, or bibliographic
database. When no applicable subject heading can
be found in the existing headings list or thesaurus of
indexing terms, a new one must be created.
What is it?
It means the presence, identification
and expression of subject matter in
document texts, databases, controlled
and natural languages, information
requests and search strategies.
7
Say what?
8
Why do all that?
If we don’t we can’t find stuff!
 “Subject analysis is [essentially] all methods
and processes which can be described as
representation for retrieval of information by
its subjects, be they names, geographic
locations, or topical subjects.”
 Quoted from Williamson, N. J. (1997). The Importance of
Subject Analysis in Library and Information Science
Education. Technical Services Quarterly 15(1/2):67-87
by Pamela Hill in LS 500 Organization of Information
Tuesday, February 24, 2004
9
Why use a standardized list?
Why Subject Headings?
 Subject headings often indicate the contents of books in
terms that their titles do not use, which often may be
nondescriptive or very general. Subject headings in online
databases are often referred to as descriptors, but they
serve the same purpose in locating valuable resources.
 Along with their subdivisions, subject headings provide a
clear and systematic way of scanning the catalog for what is
needed. Assigned headings are usually the dominant, and
most important, subjects of a given item.
 Subject headings bring like materials together, requiring
less use of the wide variation of synonomous terms that may
appear to describe a single concept (teen, youth, adolescent,
young adult, etc.).
•
Using Subject Headings in PantherCat
BS 65296 factors in choosing
subject of document.
Does the document deal with a specific
product condition or phenomenon?
Does the subject contain an action
concept, an operation or a process?
Is the object or patient affected by the
action identified?
Does the document deal with the agent of
this action?
BS 65296 factors in choosing
subject of document
Does it refer to a particular means for
accomplishing the action
Were these factors considered in the
content of a particular location or
environment?
Are any independent or dependent
variables identified?
Was the subject considered from a
special viewpoint not normally associated
with that field of study.
SUBJECT INDEXING
is the act of describing a document by index terms to
indicate what the document is about or to summarize
its content. Indexes are constructed, separately, on
three distinct levels: terms in a document such as a
book; objects in a collection such as a library; and
documents (such as books and articles) within a field of
knowledge.
Subject indexing systems have been classified broadly
as pre-coordinate and post-coordinate systems. The
major objective of any indexing system is to represent
the contents of documents through keywords or
descriptors
Exhaustively and Specificity
An exhaustive index is one which lists all possible index terms.
Greater exhaustivity gives a higher recall, or more likelihood of
all the relevant articles being retrieved, however, this occurs at
the expense of precision. This means that the user may retrieve a
larger number of irrelevant documents or documents which only
deal with the subject in little depth. In a manual system a greater
level of exhaustivity brings with it a greater cost as more man
hours are required.
The specificity describes how closely the index terms match the
topics they represent . An index is said to be specific if the
indexer uses parallel descriptors to the concept of the document
and reflects the concepts precisely
Recall vs Precision
Number of relevant documents retrieved
Recall = ---------------------------------------------------------Number of relevant documents in the collection
Number of relevant documents retrieved
Precision = ---------------------------------------------------------Total number of documents retrieved
Manual indexing
Analysis of subject
Identification of keywords
Standardization of keywords
Choice of an indexing system
 If the chosen system is a post – coordinate
one then
Preparation of entries under each term with reference
to the document identification number.
Preparation of reference entries.
Manual indexing
 If the chosen system is a pre-coordinate
one then:
Preparation of an entry (main entry) using all the
keywords organized in a way prescribed by the system.
Preparation of index entries by using each significant
term as an entry element and the full entry (main entry)
as the context, or by rotation/permutation of the
significant terms in the main entry according to the
rules prescribed by the system chosen.
Preparation of reference entries.
filing entries
STEPS IN
MANUAL
INDEXING
SYSTEM
Pre – coordinate indexing system
Chain indexing
Dr. S.R. Ranganathan developed a
method a pre-coordinate indexing. It
attempts to represent, in natural
language, the chain of concept’s that
constitutes a subject
Pre – coordinate indexing system
Basic steps in chain indexing may be
represented as follows:
 Take the class number prepared for the given
document.
 Consult the corresponding classification schedule
and write the notation at each step and the
correspondence term or phrase (from the
schedule). This will produce a chain of concepts
from the general to the specific.
Basic steps in chain indexing may be represented as
follows:
Identify the sought, unsought , and false
links. Sought links denote the concepts
that the user is likely to use as access
points; unsought links are those that are
not likely to be used as access points, and
false links are those that really do not
represent any valid concepts.
Invert the chain, and this will generate
the index entries.
Pre – coordinate indexing system
Relational indexing
 J.E.L. Farradane devised a scheme. The system
was developed first in the early 1950s and has
been modified several times since then. The
latest changes may be noted from Farradane’s
own papers that appeared in 1980. According to
Farradane, any subject can be represented by
identifying and representing in the form of what
he called analets (pairs of terms interposed by an
operator), the relationship between each pair of
the contituent concepts, and he suggested that
any possible relationship can be represented by
either of these nine relational operators.
Pre – coordinate indexing system
PRECIS – PREserved contect Index System.
 Developed by Derek Austin and first came out in 1974.
Major tasks:
 Analysing the document concerned and identifying key concepts.
 Organizing the concepts into a subject statement based on the principle of
context dependency.
 Assigning codes (operators) which signify the syntactical function of each
term
 Deciding which terms should be the access points and which terms would
be in other positions in the index entries, and assigning further codes to
achieve these results.
 Adding further prepositions, auxiliaries or phrases which would result in
clarity and expressiveness of the resulting index entries.
 Making supporting reference entries from semantically elated terms taken
from a thesaurus.
Pre – coordinate indexing system
POPSI, Postulated – based Permuted Subject
Indexing
 Developed by Bhattacharyya. It uses the anyticosynthetic method for string formulation and permutation
of the constituent terms in order to satisfy different
approach points to the document.
 There are two parts- the lead heading, which contains the
index term or the access term, the context heading, which
generally appears in the line following the lead heading
and contains the subject words, with auxiliary words,
denoting the context in which the lead term has been
discussed in the given document.
Rules that govern POSI
A manifestation of property follows immediately the
manifestation in relation to which it is a property.
A manifestation of action follows immediately the
manifestation in relation to which it is an action
Property and action can have another property and/or
action directly related.
A species or part follows immediately the
manifestation in relation to which it process part, and
part is used to denote the whole part relationship
A modifier follows immediately the manifestation in
relation to which it is a modifier.
Post – coordinate indexing system
Uniterm
 Developed by Mortimer Taube in 1953. A card is
prepared for each term that is considered to be an
appropriate index term for a given document. It relies on
the ability of the searcher to notice matching numbers on
the cards that are retrieved.
Optical coincidence/peek-a-boo cards
 Developed to overcome the problem of manual searching.
This is based on each card is divided into small units of
numbered squares, each unit bearing a specific number,
and a document number is punched on the appropriate
position on the card.
PROBLEMS OF MANUAL
INDEXING
Salton and Salton and McGill two major
shortcomings:
 It is not quite clear that all the complexities and
refinements, exemplified by the categorization of
terms and assignment of relations between terms,
are really beneficial.
 It that even if the indexing process is carried out
accurately, and at the right level of detail, it is not
possible to maintain consistency since more than
one indexer will be needed in practice.
Theory of indexing
1st level: is concordance, which consist of references to all words
in the original text arranged in alphabetical order.
2nd level: information theoretical level, which calculates the
likelihood of a word being chosen for indexing based on its
frequency of occurrence in a given text document.
3rd level: linguistic one, which attempts to explain how
meaningful words are extracted from large units of text.
4th level: textual or skeletal framework, the text is prepared by
the author in an organized manner and held together by a
skeletal structure.
5th level: inferential level. An indexer should be able to make
inferences about the relationships between words and phrases by
observing the sentence and paragraph structure, and by
strippping the sentence of extraneous details.
Fugmann proposes theory based
on axioms
Axiom of definability, proposes that compiling information
relevant to a topic can only be accomplished to the degree to
which a topic can be defined.
Axiom of order, suggests that any compilation of information
relevant to a topic is an order creation process.
Axiom of the sufficient degree of order, that demands made on
the degree of order increase as the size of a collection and
frequency of searches increase.
Axiom of predictability, the success of any directed search for
relevant information hinges on how readily predictable or
reconstructible are the modes of expression for concepts and
statements in the search file.
Axiom of fidelity, equates the success of any directed search for
relevant information with the fidelity with which concepts and
statements are expressed in the search file.
29