Transcript PowerPoint
CS 430 / INFO 430
Information Retrieval
Lecture 26
Classification 1
1
Course Administration
2
Classification and Categorization
pre-defined
classes
terms
documents
thesaurus
classification
text
categorization
3
empirically-defined
classes
document
clustering
Text categorization
Text categorization
• Problem is to classify documents by whether they
belong to a fixed set of pre-determined categories.
• Each document may belong to many categories.
• Each category is taken as a separate binary
classification problem.
Classification
• Problem is to assign each document to exactly one
pre-determined category.
4
Text categorization
Outline
•
Select a subject domain.
•
Choose a corpus of documents that cover the domain.
• Obtain a training set of documents that have been assigned to a
set of categories.
5
•
Create a vocabulary by extracting terms, normalization,
precoordination of phrases, etc.
•
Use the terms in a document as a feature set that describes the
document. Scale the terms using idf or similar measure.
•
Use machine learning methods (e.g., support vector machine) to
train an automatic classifier.
Lexicon and Thesaurus
Lexicon contains information about words, their
morphological variants, and their grammatical usage.
Thesaurus relates words by meaning:
ship, vessel, sail; craft, navy, marine, fleet, flotilla
book, writing, work, volume, tome, tract, codex
search, discovery, detection, find, revelation
(From Roget's Thesaurus, 1911)
6
Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination)
A. Manual
A human indexer assigns standard terms and associations.
used for
broader term
top term
related term
7
computer-aided instruction
see also education
UF teaching machines
BT educational computing
TT computer applications
RT education
RT teaching
From: INSPEC Thesaurus
Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination)
B. Automatic
Divide terms into thesaurus classes. Replace similar terms by a
thesaurus class.
8
408 dislocation
409 blast-cooled
junction
heat-flow
minority-carrier
heat-transfer
n-p-n
p-n-p
410 anneal
point-contact
strain
recombine
transition
unijunction
From: Salton
and McGill
Desirable Properties for Information
Retrieval
9
•
Thesaurus is specific to a subject area. Contains only
terms of interest for identification within that subject
area.
•
Ambiguous terms are coded only for the senses
important for that field.
•
Target is that each thesaurus class should include
terms of moderate frequency. Ideally the classes
should have similar frequency.
Alexandria Thesaurus: Example
canals
A feature type category for places such as the Erie Canal.
Used for:
The category canals is used instead of any of the following.
canal bends
ditches
canalized streams
drainage canals
ditch mouths
drainage ditches
Broader Terms:
Canals is a sub-type of hydrographic structures.
10
... more ...
Alexandria Thesaurus: Example
(continued)
canals (continued)
Related Terms:
The following is a list of other categories related to canals (nonhierarchial relationships).
channels
locks
transportation features
tunnels
Scope Note:
Manmade waterway used by watercraft or for drainage, irrigation,
mining, or water power. » Definition of canals.
11
Art and Architecture Thesaurus
•
Controlled vocabulary for describing and retrieving information:
fine art, architecture, decorative art, and material culture.
•
Almost 128,000 terms for objects, textual materials, images,
architecture and culture from all periods and all cultures.
•
Used by archives, museums, and libraries to describe items in their
collections.
•
Used as a database accessed via a Web site.
•
Used by computer programs, for information retrieval, and natural
language processing.
http://www.getty.edu/research/conducting_research/vocabularies/aat/
12
A project of the J. Paul Getty Trust
Art and Architecture Thesaurus
Provides the terminology for objects, and the vocabulary necessary to
describe them, such as style, period, shape, color, construction, or
use, and scholarly concepts, such as theories, or criticism.
Concept:
a cluster of terms, one of which is established as the preferred term,
or descriptor.
Categories:
associated concepts, physical attributes, styles and periods, agents,
activities, materials, and objects.
13
Art and Architecture Thesaurus:
Sample Record
Record ID: 198841
Descriptor: rhyta
Note: Refers to vessels from Ancient Greece, eastern Europe, or
the Middle East that typically have a closed form with two
openings, one at the top for filling and one at the base so that
liquid could stream out. They are often in the shape of a horn or
an animal's head, and were typically used as a drinking cup or for
pouring wine into another vessel.
14
Hierarchy:
Containers [TQ]
...<containers by function or context>
...........<culinary containers>
...................<containers for serving and consuming food>
Art and Architecture Thesaurus:
Sample Record (continued)
Terms:
rhyta
rhyton (alternate, singular)
protomai
protome
rhea
rheon
rheons
Related concepts:
stirrup cups
sturzbechers
drinking vessels
ceremonial vessels
15
Medical Subject Headings (MeSH)
National Library of Medicine's controlled vocabulary thesaurus
The library provides MeSH subject headings for each article in the
4,800 journals that it indexes every year and every book, etc.
acquired by the library.
• 23,000 primary headings.
• Additional thesaurus of about 151,000 chemical terms.
• Terms are organized in a hierarchy.
• 135,000 cross-references.
Experts who understand the field and are able to formulate queries
using MeSH terms and the MeSH structures.
16
MeSH hierarchy
Biological Sciences [G]
Biological Sciences [G01] +
Health Occupations [G02] +
Environment and Public Health [G03] +
Biological Phenomena, Cell Phenomena, and Immunity [G04] +
Genetics [G05] +
Biochemical Phenomena, Metabolism, and Nutrition [G06] +
Physiological Processes [G07] +
Reproductive and Urinary Physiology [G08] +
Circulatory and Respiratory Physiology [G09] +
Digestive, Oral, and Skin Physiology [G10] +
Musculoskeletal, Neural, and Ocular Physiology [G11] +
Chemical and Pharmacologic Phenomena [G12] +
17
MeSH hierarchy (continued)
18
Physiological Processes [G07]
Adaptation, Physiological [G07.062] +
Aging [G07.168] +
Body Constitution [G07.265] +
Body Temperature [G07.315]
Body Temperature Regulation [G07.315.232] +
Skin Temperature [G07.315.753]
Chronobiology [G07.450] +
Electrophysiology [G07.453] +
Fluid Shifts [G07.503]
Growth and Embryonic Development [G07.553] +
Homeostasis [G07.621] +
Tensile Strength [G07.900]
Tropism [G07.950] +
MeSH hierarchy (continued)
19
MeSH Heading
Body Temperature
Tree Number
E01.370.600.120
Tree Number
G07.315
Entry Term
Organ Temperature
See Also
Fever
See Also
Thermography
See Also
Thermometers
Allowable Qualifiers
DE GE IM PH RE
Unique ID
D001831
Automatic Thesaurus Construction
Outline
• Select a subject domain.
• Choose a corpus of documents that cover the domain.
• Create vocabulary by extracting terms, normalization,
precoordination of phrases, etc.
• Devise a measure of similarity between terms and
thesaurus classes.
• Cluster terms into thesaurus classes, using a cluster
method that generates compact clusters (e.g., complete
linkage).
20
Normalization of vocabulary
Normalization rules map variant forms into base
expressions. Typical normalization rules for manual
thesaurus construction are:
(a) Nouns only, or nouns and noun phrases.
(b) Singular nouns only.
(c) Spelling (e.g., U.S.).
(d) Capitalization, punctuation (e.g., hyphens), initials (e.g.,
IBM), abbreviations (e.g., Mr.).
Usually, many possible decisions can be made, but they
should be followed consistently.
Which of these can be carried out automatically with
reasonable accuracy?
21
Phrase construction
In a precoordinated thesaurus, term classes may contain phrases.
Informal definitions:
pair-frequency (i, j) is the frequency that a given pair of words
occur in context (e.g., in succession within a sentence)
phrase is a pair of words, i and j that occur in context with a
higher frequency than would be expected from their overall
frequency
cohesion (i, j) =
22
observed pair-frequency (i, j)
expected pair-frequency if i, j independent
Phrase construction: simple case
Example: corpus of n terms
pi, j is the observed frequency that a given pair of terms occur in
succession.
fi is the number of occurrences of term i in the corpus.
There are n-1 pairs. If the terms are independent, the probability
that a given pair begins with term i and ends with term j is
(fi/n).(fj/n)
cohesion (i, j) =
23
n2.pi, j
(n-1)fi.fj
Phrase construction
Salton and McGill algorithm
1. Computer pair-frequency for all terms.
2. Reject all pairs that fall below a certain frequency threshold
3. Calculate cohesion values
4. If cohesion above a threshold value, consider word pair as a
phrase.
There is promising research on phrase identification using
methods of computational linguistics
24
Similarities
The vocabulary consists of a set of elements, each of which
can be a single term or a phrase.
The next step is to calculate a measure of similarity
between elements.
One measure of similarity is the number of documents that
have terms i and k in common:
n
S(tj, tk) = tijtik
i=1
where tij = 1 if document i contains term j and 0 otherwise.
25
Similarities: Incidence array
alpha bravo charlie delta
D1
1
D2
1
D3
26
1
1
n
3
1
1
foxtrot golf
1
1
1
D4
1
echo
1
1
1
1
2
2
3
1
2
1
1
1
3
3
Term similarity matrix
alpha
alpha
27
bravo
1
charlie delta
echo
foxtrot golf
1
3
1
2
3
2
1
2
2
1
1
2
2
1
1
2
3
2
1
bravo
1
charlie
1
2
delta
3
1
1
echo
1
2
2
1
foxtrot
2
2
2
2
2
golf
3
1
1
3
1
2
2
Using count of documents that have two terms in common
Similarity measures
Improved similarity measures can be generated by:
• Using term frequency matrix instead of incidence matrix
• Weighting terms by frequency:
cosine measure
n
S(tj, tk) =
t t
i=1 ij ik
|tj| |tk|
dice measure
n
S(tj, tk) =
tijtik
i=1
n
tik + tij
i=1
28
n
i=1
Term similarity matrix
alpha
alpha
29
bravo
0.2
charlie delta
echo
foxtrot golf
0.2
0.5
0.2
0.33
0.5
0.5
0.2
0.5
0.4
0.2
0.2
0.5
0.4
0.2
0.2
0.33
0.5
0.4
0.2
bravo
0.2
charlie
0.2
0.5
delta
0.5
0.2
0.2
echo
0.2
0.5
0.5
0.2
foxtrot
0.33
0.4
0.4
0.33
0.4
golf
0.5
0.2
0.2
0.5
0.2
Using incidence matrix and dice weighting
0.33
0.33
Clustering terms to form concepts
The final stage is to group similar terms together into concepts.
This is done by cluster analysis. Cluster analysis is the topic of the
next lecture.
30