Computer Aided Indexing Overview of the key features of software candidates Mary Huxlin Australia Human vs.

Download Report

Transcript Computer Aided Indexing Overview of the key features of software candidates Mary Huxlin Australia Human vs.

Computer Aided
Indexing
Overview of the key features of
software candidates
Mary Huxlin
Australia
Human vs. Computer-Aided
Indexing
Human indexing
logical
 controllable
 resource intensive
 inconsistent - subject
knowledge, human
error
 not scalable to the
same degree as
automated
Computer-Aided
Indexing
 accurate,
 limited
accuracy
 difficult to train
 lack control
 consistent
 speeding up the process;
greater throughput
Content management technologies
 Use
various algorithms for statistical analysis,
semantic processing, NLP and neural networks
(AI) - most often combining them.
 Offer
automatic or semi-automatic (hybrid)
solutions for:
 extracting
relevant content from a document or web
page
 mapping documents into one or multiple pre-defined
or customer-specific hierarchies of categories
(taxonomies)
 machine-aided indexing
 thesaurus development
Definitions

Thesaurus - one type of controlled-vocabulary - a
collection of terms referring to specific concepts,
together with variants and conceptual relationships

Taxonomy (from Greek taxis meaning arrangement or
division and nomos meaning law) is the science of
classification according to a pre-determined system, that
divides a subject area hierarchically into progressively
smaller subdivisions.
The integration of classification and thesauri
in an automated environment result in the
construction of TAXONOMY
Taxonomy process
Reproduced from Delphi Group White Paper
Taxonomy for the Semantic Web
defines topics and their relationships - improves user and technology efficiency
Browse alphabetically:
ABCDEFGHIJKLMNOPQRSTUVWXYZ#
Browse categorically:
WhatIs:All Categories:Computing Fundamentals:General
Computing Terms
Word list for General Computing Terms
>tag (searchWebServices)
>Tag Image File Format (whatis)
>task (whatis)
>taxonomy (whatis)
>TCB (searchSystemsManagement)
>TCP (searchNetworking)
Browse alphabetically:
ABCDEFGHIJKLMNOPQRSTUVWXYZ#
All Categories- Computing Fundamentals- General Computing -Terms
taxonomy
Taxonomy (from Greek taxis meaning arrangement or division and nomos meaning
law) is the science of classification according to a pre-determined system, with the
resulting catalog used to provide a conceptual framework for discussion, analysis, or
information retrieval. In theory, the development of a good taxonomy takes into account
the importance of separating elements of a group (taxon) into subgroups (taxa) that are
mutually exclusive, unambiguous, and taken together, include all possibilities. In
practice, a good taxonomy should be simple, easy to remember, and easy to use.
One of the best known taxonomies is the one devised by the Swedish scientist, Carl
Linnaeus, whose classification for biology is still widely used (with modifications). In
Web portal design, taxonomies are often created to describe categories and
subcategories of topics found on the Web site. The categorization of words on
whatis.com is similar to any Web portal taxonomy.
Statistical Text Analysis and Clustering
measures:
 Co-occurrences of words.
– “Java ” used in connection with “Starbucks” it is more likely to
relates to a document about coffee instead of a programming
language.
 Relative
placement of words
– Words in the first lines of a document or in the title are likely more
important than information contained in the copyright section.
 Word
frequency, placement and grouping, as
well as the distance between words in a
document.
Advanced statistical text analysis and
clustering
 Bayesian
probability uses statistical models from
words in training sets and pattern analysis to
assign the probability of correlation.
– If a given document contains the words “cerium”and “gadolinium” it is
more than likely this document is about Rare Earths, which leads to the
assumption that other metals nouns such as “neodymium” or “ytterbium”
will occur.
 Pattern
analysis improves precision of
statistical analysis and clustering by resolving
ambiguous or multiple meanings of words learn though an iterative process.
– “SHRIMP” in a document may refers to a method of analysis (sensitive
high-resolution ion microprobe) or a crustaceans
Semantic and Linguistic Clustering
 Linguistic
(natural language processing) software
– analyses the structure of the sentences identifying the subject, verbs and
objects
– apply sentence structure analysis to extract the meaning.
– use stemming or reducing a word to its root (prone to overkill!)
 Documents
are clustered or grouped depending
on meaning of words
– using thesaurus / knowledge base, probabilistic grammar, recognition of idioms, verb
chain recognition, and noun phrase identifiers
 Only
slightly improvement over statistically
generated phrases
Rule-based classification/indexing

Rules enable the system to think like humans
 Identity
rules to identify concepts that match or are equivalent
in meaning (forbidden terms) with Thesaurus terms
 Context rules (proximity, case of letters, location in the
document)
 “If-Then” or “If-EndIf” when word meanings are ambiguous
 Rules
could be a powerful and flexible means for
automatically classifying content based on not just
content itself but the metadata that describes the
content
(e.g. subject categories, journal title).
 The down side of rule-based system
– human experts (=expensive) have to write and maintain the rules.
– rules could be complex and thus prone to failure
Rule-based indexing
“If rules”
Text to match: SAFETY
IF (NEAR “reactor”) WITHIN 3 WORDS
USE Reactor safety
ENDIF
IF (WITH “standard”) WITHIN SENTENCE
USE Safety standards
Machine learning
An iterative process
 Identify
patterns in manually indexed sample
texts (training set) and make predictions
about unseen text - also called computational
linguistics.
 Improves
its performance based on
experience
 Require
a large number of documents
Linguistic DNA
 Statistic

& Linguistic processing (extraction)
Key concepts extracted have full semantic meaning on
their own
 Enhanced
with Logico-Deductive
Reasoning and Fuzzy Logic techniques
(manipulation)

Fuzzy concepts are highly context-dependent
Application Programming
Interfaces
 Software
applications such as portals, content
management systems, knowledge management
systems, search and retrieval software, data
extraction, and data mining can all benefit from
automatic or semi-automatic generated
taxonomies.
 Practically
all the software under consideration are
sold with Application Programming Interfaces
(API) to integrate into local existing applications.
Assessment
 Most
of the companies profiled are very
technology-centric and spend much of their
marketing effort trying to convince us of the
advantages of their approach or methodology.
 None
of the products work “out of the box” and
require a closer relationship between the user and
the supplier of the technology

The bottom line is to understand how these
differences affect system performance in the only
environment that matters —our unique data
environment.
Assessment
(cont.)
 The
main business of the reviewed software
candidates is to automatically categorise
information, identifying where documents belong
within a taxonomy - a solution to “infoglut”
problem
 They
are market driven; have or could develop a
CAI capability - if there is a sizeable market for it.
 Some
of the products analysed have an advanced
CAI component, warranting further
investigations:
How can we measure CAI software
performance?
 Recall = proportion of the correct indexes generated

Precision = proportion of the generated indexes that are
correct
 Overindexing = proportion of incorrect indexes
generated
 Cost
benefit (ROI)
ISSUES
 Training
 Integration
with WinFIBRE (or similar
products)
 Tools
for authoring and maintenance
 Could
we maintain the quality of the
database?
 Ensure
the user can navigate from need to
“...Control exercised by machines,
far from enslaving human beings,
will liberate them for tasks only
them can perform…”
Wellisch, H.H. (1998). Indexing after the millennium 3: the indexer as
helmsman
The Indexer, 21(2), 89