Transcript PowerPoint

CS 430 / INFO 430
Information Retrieval
Lecture 27
Classification 2
1
Course Administration
2
Cluster Analysis
Cluster Analysis
Methods that divide a set of n objects into m nonoverlapping subsets.
For information discovery, cluster analysis is applied to
• terms for thesaurus construction
• documents to divide into categories (sometimes called
automatic classification, but classification usually
requires a pre-determined set of categories).
3
Cluster Analysis Metrics
 Documents clustered on the basis of a similarity
measure calculated from the terms that they contain.
 Documents clustered on the basis of co-occurring
citations.
 Terms clustered on the basis of the documents in which
they co-occur.
4
Non-hierarchical and Hierarchical Methods
Non-hierarchical methods
Elements are divided into m non-overlapping sets where m is
predetermined.
Hierarchical methods
m is varied progressively to create a hierarchy of solutions.
Agglomerative methods
m is initially equal to n, the total number of elements, where
every element is considered to be a cluster with one element.
The hierarchy is produced by incrementally combining
clusters.
5
Simple Hierarchical Methods:
Single Link
Concept
x
x
x
x
x
x
x
x
x
x
x
x
Similarity between clusters is similarity
between most similar elements
6
Single Link
Single Link
A simple agglomerative method.
Initially, each element is its own cluster with one element.
At each step, calculate the similarity between each pair
of clusters as the most similar pair of elements that are
not yet in the same cluster. Merge the two clusters that
are most similar.
May lead to long, straggling clusters (chaining).
Very simple computation.
7
Similarities: Incidence array
D1:
D2:
D3:
D4:
alpha bravo charlie delta echo foxtrot golf
golf golf golf delta alpha
bravo charlie bravo echo foxtrot bravo
foxtrot alpha alpha golf golf delta
alpha bravo charlie delta
D1
1
D2
1
D3
8
1
1
n
3
1
1
foxtrot golf
1
1
1
D4
1
echo
1
1
1
1
2
2
3
1
2
1
1
1
3
3
Term similarity matrix
alpha
alpha
bravo
0.2
bravo
charlie
delta
charlie delta
echo
0.2
0.5
0.2
0.33
0.5
0.5
0.2
0.5
0.4
0.2
0.2
0.5
0.4
0.2
0.2
0.33
0.5
0.4
0.2
echo
foxtrot
golf
9
foxtrot golf
Using incidence matrix and dice weighting
0.33
Example -- single link
1
alpha
delta
golf
bravo
echo
charlie
foxtrot
Agglomerative: step 1
10
Example -- single link
2
1
alpha
delta
golf
bravo
echo
charlie
foxtrot
Agglomerative: step 2
11
Example -- single link
3
2
1
alpha
delta
golf
bravo
echo
charlie
foxtrot
Agglomerative: step 3
12
Example -- single link
6
5
4
3
2
1
alpha
delta
golf
bravo
echo
charlie
foxtrot
This style of diagram is called a dendrogram.
13
Simple Hierarchical Methods:
Complete Linkage
Concept
x
x
x
x
x
x
x
x
x
x
x
x
Similarity between clusters is similarity
between least similar elements
14
Complete linkage
Complete linkage
A simple agglomerative method.
Initially, each element is its own cluster with one element.
At each step, calculate the similarity between each pair
of clusters as the similarity between the least similar pair
of elements in the two clusters. Merge the two clusters
that are most similar.
Generates small, tightly bound clusters
15
Term similarity matrix
alpha
alpha
bravo
0.2
bravo
charlie
delta
charlie delta
echo
0.2
0.5
0.2
0.33
0.5
0.5
0.2
0.5
0.4
0.2
0.2
0.5
0.4
0.2
0.2
0.33
0.5
0.4
0.2
echo
foxtrot
golf
16
foxtrot golf
Using incidence matrix and dice weighting
0.33
Example – complete linkage
Cluster
a
elements
b
c
d
e
f
g
Least similar pair / distance
a
b
c
d
e
f
g
-
ab/.2
-
ac/.2
bc/.5
-
ad/.5
bd/.2
cd/.2
-
ae/.2
be/.5
ce/.5
de/.2
-
Step 1. Merge clusters {a} and {d}
17
af/.33
bf/.4
cf/.4
df/.33
ef/.4
-
ag/.5
bg/.2
cg/.2
dg/.5
eg/.2
fg/.33
-
Example – complete linkage
Cluster
a,d
elements
b
c
e
f
g
Least similar pair / distance
a,d
b
c
e
f
g
-
ab/.2
-
ac/.2
bc/.5
-
ae/.2
be/.5
ce/.5
-
df/.33
bf/.4
cf/.4
ef/.4
-
Step 2. Merge clusters {a,d} and {g}
18
ag/.5
bg/.2
cg/.2
eg/.2
fg/.33
-
Example – complete linkage
Cluster
a,d,g
elements
b
c
e
f
Least similar pair / distance
a,d,g
b
c
e
f
-
ab/.2
-
ac/.2
bc/.5
-
ae/.2
be/.5
ce/.5
-
af/.33
bf/.4
cf/.4
ef/.4
-
Step 3. Merge clusters {b} and {c}
19
Example – complete linkage
Cluster
elements
a,d,g
b,c
e
f
Least similar pair / distance
a,d,g
b,c
e
f
-
ab/.2
-
ae/.2
be/.5
-
af/.33
bf/.4
ef/.4
-
Step 4. Merge clusters {b,c} and {e}
20
Example -- complete linkage
Step 6
Step 5
Step 4
Step 3
Step 2
Step 1
alpha
21
delta
golf
bravo
charlie
echo
foxtrot
Non-Hierarchical Methods: K-means
1 Define a similarity measure between any two points in the
space (e.g., square of distance).
2 Choose k points as initial group centroids.
3 Assign each object to the group that has the closest
centroid.
4 When all objects have been assigned, recalculate the
positions of the k centroids.
5 Repeat Steps 3 and 4 until the centroids no longer move.
This produces a separation of the objects into groups from
which the metric to be minimized can be calculated.
22
K-means
• Iteration converges under a very general set of conditions
• Results depend on the choice of the k initial centroids
• Methods can be used to generate a sequence of solutions
for k increasing from 1 to n. Note that, in general, the
results will not be hierarchical.
23
Problems with cluster analysis in
information retrieval
 Selection of attributes on which items are
clustered
 Choice of similarity measure and algorithm
 Computational resources
 Assessing validity and stability of clusters
 Updating clusters as data changes
 Method for using the clusters in information
retrieval
24
Example 1: Concept Spaces for
Scientific Terms
Large-scale searches can only match terms specified by the
user to terms appearing in documents. Cluster analysis can
be used to provide information retrieval by concepts, rather
than by terms.
Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B.
Hardin, Ann P. Bishop (University of Illinois), Hsinchun Chen
(University of Arizona), Federating Diverse Collections of
Scientific Literature, IEEE Computer, May 1996. Federating
Diverse Collections of Scientific Literature
25
Concept Spaces: Methodology
Concept space:
A similarity matrix based on co-occurrence of terms.
Approach:
Use cluster analysis to generate "concept spaces"
automatically, i.e., clusters of terms that embrace a single
semantic concept.
Arrange concepts in a hierarchical classification.
26
Concept Spaces: INSPEC Data
Data set 1: All terms in 400,000 records from INSPEC, containing
270,000 terms with 4,000,000 links.
computer-aided instruction
see also education
UF teaching machines
BT educational computing
TT computer applications
RT education
RT teaching
[24.5 hours of CPU on 16-node Silicon Graphics supercomputer.]
27
Concept Space: Compendex Data
Data set 2:
(a) 4,000,000 abstracts from the Compendex database covering all
of engineering as the collection, partitioned along classification
code lines into some 600 community repositories.
[ Four days of CPU on 64-processor Convex Exemplar.]
(b) In the largest experiment, 10,000,000 abstracts, were divided
into sets of 100,000 and the concept space for each set generated
separately. The sets were selected by the existing classification
scheme.
28
Objectives
29
•
Semantic retrieval (using concept spaces for term
suggestion)
•
Semantic interoperability (vocabulary switching across
subject domains)
•
Semantic indexing (concept identification of document
content)
•
Information representation (information units for
uniform manipulation)
Use of Concept Space: Term Suggestion
30
Future Use of Concept Space:
Vocabulary Switching
"I'm a civil engineer who designs bridges. I'm interested in
using fluid dynamics to compute the structural effects of
wind currents on long structures. Ocean engineers who
design undersea cables probably do similar computations
for the structural effects of water currents on long
structures. I want you [the system] to change my civil
engineering fluid dynamics terms into the ocean
engineering terms and search the undersea cable literature."
31
Example 2: Visual thesaurus for
geographic images
Methodology:
• Divide images into small regions.
• Create a similarity measure based on properties of these images.
• Use cluster analysis tools to generate clusters of similar images.
• Provide alternative representations of clusters.
Marshall Ramsey, Hsinchun Chen, Bin Zhu, A Collection of
Visual Thesauri for Browsing Large Collections of Geographic
Images, May 1997.
http://ai.bpa.arizona.edu/~mramsey/papers/visualThesaurus/visual
Thesaurus.html
32
33
The End
Return objects
Return
hits
Browse content
Scan results
Search index
34