CS 430: Information Discovery Cluster Analysis 2 Thesaurus Construction Lecture 23

Download Report

Transcript CS 430: Information Discovery Cluster Analysis 2 Thesaurus Construction Lecture 23

CS 430: Information Discovery
Lecture 23
Cluster Analysis 2
Thesaurus Construction
1
Course Administration
Next week
• Guest lecture on Thursday, Thorsten Joachims.
Final examination
• The final examination will include questions on all
lectures, including the guest lectures, and the readings
for the discussion classes.
• Examination date: Wednesday, December 18, 12:00
noon - 1:30 p.m.
• Early examination: Thursday December 12, 12:00 noon
- 1:30 p.m. Contact Anat Nidar-Levi
([email protected]) if you plan to take the early
examination.
2
Example 2: Concept Spaces for
Scientific Terms
Large-scale searches can only match terms specified by the
user to terms appearing in documents. Cluster analysis can
be used to provide information retrieval by concepts, rather
than by terms.
Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B.
Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen
(University of Arizona), Federating Diverse Collections of
Scientific Literature, IEEE Computer, May 1996. Federating
Diverse Collections of Scientific Literature
3
Concept Spaces: Methodology
Concept space:
A similarity matrix based on co-occurrence of terms.
Approach:
Use cluster analysis to generate "concept spaces" automatically,
i.e., clusters of terms that embrace a single semantic concept.
Arrange concepts in a hierarchical classification.
4
Concept Spaces: INSPEC Data
Data set 1: All terms in 400,000 records from INSPEC, containing
270,000 terms with 4,000,000 links.
computer-aided instruction
see also education
UF teaching machines
BT educational computing
TT computer applications
RT education
RT teaching
[24.5 hours of CPU on 16-node Silicon Graphics supercomputer.]
5
Concept Space: Compendex Data
Data set 2:
(a) 4,000,000 abstracts from the Compendex database covering all
of engineering as the collection, partitioned along classification
code lines into some 600 community repositories.
[ Four days of CPU on 64-processor Convex Exemplar.]
(b) In the largest experiment, 10,000,000 abstracts, were divided
into sets of 100,000 and the concept space for each set generated
separately. The sets were selected by the existing classification
scheme.
6
Objectives
7
•
Semantic retrieval (using concept spaces for term
suggestion)
•
Semantic interoperability (vocabulary switching across
subject domains)
•
Semantic indexing (concept identification of document
content)
•
Information representation (information units for
uniform manipulation)
Use of Concept Space: Term Suggestion
8
Future Use of Concept Space:
Vocabulary Switching
"I'm a civil engineer who designs bridges. I'm interested in
using fluid dynamics to compute the structural effects of
wind currents on long structures. Ocean engineers who
design undersea cables probably do similar computations
for the structural effects of water currents on long
structures. I want you [the system] to change my civil
engineering fluid dynamics terms into the ocean
engineering terms and search the undersea cable literature."
9
Example 3: Visual thesaurus for browsing
large collections of geographic images
Methodology:
• Divide images into small regions.
• Create a similarity measure based on properties of these images.
• Use cluster analysis tools to generate clusters of similar images.
• Provide alternative representations of clusters.
Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of
Visual Thesauri for Browsing Large Collections of Geographic
Images, May 1997.
(http://ai.bpa.arizona.edu/~mramsey/papers/visualThesaurus/visual
Thesaurus.html)
10
11
Information Visualization
Human eye is excellent in identifying patterns in graphical
data.
12
•
Trends in time-dependent data.
•
Broad patterns in complex data.
•
Anomalies in scientific data.
•
Visualizing information spaces for browsing.
Pad++
Concept. A large collection of information viewed at many
different scales. Imagine a collection of documents spread out
on an enormous wall.
Zoom. Zoom out and see the whole collection with little detail.
Zoom in part way to see sections of the collection. Zoom in to
see every detail.
Semantic Zooming. Objects change appearance when they
change size, so as to be most meaningful. (Compare maps.)
Performance. Rendering operations timed so that the frame
refresh rate remains constant during pans and zooms.
13
Pad++ File Browser
14
Pad++ File Browser
15
Pad++ File Browser
16
Example: Tilebars
The figure represents a set of hits
from a text search.
Each large rectangle represents a
document or section of text.
Each row represents a search term or
subquery.
The density of each small square
indicates the frequency with which a
term appears in a section of a
document.
Hearst 1995
17
Self Organizing
Maps (SOM)
18
Automatic Thesaurus Construction
Approach
• Select a subject domain.
• Choose a corpus of documents that cover the domain.
• Create vocabulary by extracting terms, normalization,
precoordination of phrase, etc.
• Devise a measure of similarity between terms and
thesaurus classes.
• Cluster terms into thesaurus classes, using complete
linkage or other cluster method that generates compact
clusters.
19
Decisions in creating a thesaurus
1. Which terms should be included in the thesaurus?
2. How should the terms be grouped?
20
Terms to include
• Only terms that are likely to be of interest for content
identification
• Ambiguous terms should be coded for the senses likely
to be important in the document collection
• Each thesaurus class should have approximately the
same frequency of occurrence
• Terms of negative discrimination should be eliminated
after Salton and McGill
21
Discriminant value
Discriminant value is the degree to which a term is
able to discriminate between the documents of a
collection
= (average document similarity without term k)
- (average document similarity with term k)
Good discriminators decrease the average document
similarity
Note that this definition uses the document similarity.
22
Incidence array
D1:
D2:
D3:
D4:
alpha bravo charlie delta echo foxtrot golf
golf golf golf delta alpha
bravo charlie bravo echo foxtrot bravo
foxtrot alpha alpha golf golf delta
alpha bravo charlie delta
D1
1
D2
1
D3
D4
23
1
1
1
foxtrot golf
1
1
1
1
1
echo
1
1
1
1
7
1
3
4
1
1
1
4
Document similarity matrix
D1
D1
D2
D3
D4
0.65
0.76
0.76
0.00
0.87
D2
0.65
D3
0.76
0.00
D4
0.76
0.87
0.25
0.25
Average similarity = 0.55
24
Discriminant value
Average similarity = 0.55
without
25
average similarity
DV
alpha
bravo
0.53
0.56
-0.02
+0.01
charlie
delta
echo
0.56
0.53
0.56
+0.01
-0.02
+0.01
foxtrot
golf
0.52
0.53
-0.03
-0.02
alpha, delta,
foxtrot, golf
are good
discriminators
Phrase construction
In a thesaurus, term classes may contain phrases.
Informal definitions:
pair-frequency (i, j) is the frequency that a pair of words occur
in context (e.g., in succession within a sentence)
phrase is a pair of words, i and j that occur in context with a
higher frequency than would be expected from their overall
frequency
cohesion (i, j) =
26
pair-frequency (i, j)
frequency(i)*frequency(j)
Phrase construction
Salton and McGill algorithm
1. Computer pair-frequency for all terms.
2. Reject all pairs that fall below a certain threshold
3. Calculate cohesion values
4. If cohesion above a threshold value, consider word pair as a
phrase.
Automatic phrase construction by statistical methods is rarely
used in practice. There is promising research on phrase
identification using methods of computational linguistics
27