Clustering… in General

Download Report

Transcript Clustering… in General

Clustering… in General


In vector space, clusters are vectors found within e
of a cluster vector, with different techniques for
determining the cluster vector and e.
Clustering is unsupervised pattern classification.


Unsupervised means no correct answer or feedback.
Patterns typically are samples of feature vectors or
matrices.

Classification means collecting the samples into groups
of similar members.
Clustering Decisions

Pattern Representation



Pattern proximity


feature selection (e.g., stop word removal, stemming)
number of categories
distance measure on pairs of patterns
Grouping

characteristics of clusters (e.g., fuzzy, hierarchical)
Clustering algorithms embody different assumptions
about these decisions and the form of clusters.
Formal Definitions



Feature vector x is a single datum of d
measurements.
Hard clustering techniques assign a
class label to each cluster; members of
clusters are mutually exclusive.
Fuzzy clustering techniques assign a
fractional degree of membership to
each label for each x.
Proximity Measures


Generally, use Euclidean distance or mean
squared distance.
In IR, use similarity measure from
retrieval (e.g., cosine measure for TFIDF).
[Jain, Murty & Flynn]
Taxonomy of Clustering
Clustering
Hierarchical
Single
Link
Complete
Link
HAC
Partitional
Square
Error
k-means
Graph
Theoretic
Mixture
Resolving
Expectation
Minimization
Mode
Seeking
Clustering Issues
Agglomerative: begin
with each sample in its
own cluster and merge
Hard: mutually exclusive
cluster membership
Divisive: begin with
single cluster and split
Fuzzy: degrees of
membership in clusters
Deterministic
Stochastic
Incremental: samples
Batch: clusters created
may be added to clusters over entire sample space
Hierarchical Algorithms
C1,3,2,4

0.00
C1,3,2
0.29
C1,3
0.99
C1

C3
C2
C4

D1
D3
D2
D4
Produce hierarchy of
classes (taxonomy)
from singleton
clusters to just one
cluster.
Select level for
extracting cluster set.
Representation is a
dendrogram.
Complete-Link Revisited


1.
2.
3.
4.
Used to create statistical thesaurus
Agglomerative, hard, deterministic, batch
Start with 1 cluster/sample
Find two clusters with lowest distance
Merge two clusters and add to hierarchy
Repeat from 2 until termination criterion
or until all clusters have merged
Single-Link

Like Complete-Link except…


use minimum of distances between all pairs
of samples in the two clusters (complete-link
uses maximum).
Single-link has chaining effect with
elongated clusters, but can construct
more complex shapes.
Example:Plot
50
45
40
35
30
25
20
15
10
5
0
0
10
20
30
40
50
Example: Proximity Matrix
21,15 26,25 29,22 31,15 21,27 23,32 29,26 33,21
21,15 0
11.2
10.6
10.0
12.0
17.1
13.6
13.4
26,25
0
4.2
11.1
5.4
7.6
3.2
8.1
0
7.3
9.4
11.7
4.0
4.1
0
15.6
18.8
11.2
6.3
0
5.4
8.1
13.4
0
8.5
14.9
0
6.4
29,22
31,15
21,27
23,32
29,26
33,21
0
Complete-Link Solution
C15
C13
C14
C11
C12
C8
C4
1,28
9,16
4,9
13,18
C7
C5
21,15
C10
C6
C1
21,27
C2
29,26
23,32
26,25
C9
31,15
C3
46,30
29,22
33,21
35,35
45,42
42,45
C15
Single-Link Solution
C14
C12
C11
C13
C8
C7
C10
C3
C9
C5
1,28
9,16
4,9
13,18
C2
C6
21,15
C4
C1
29,26
23,32
21,27
26,25
46,30
29,22
31,15
33,21
35,35
45,42
42,45
Hierarchical Agglomerative
Clustering (HAC)

1.
2.
3.


Agglomerative, hard, deterministic, batch
Start with 1 cluster/sample and compute a
proximity matrix between pairs of clusters.
Merge most similar pair of clusters and update
proximity matrix.
Repeat 2 until all clusters merged.
Difference is in how proximity matrix is updated.
Ability to combine benefits of both single and
complete link algorithms.
HAC for IR
Intra-cluster Similarity
Sim ( X ) 
 cos(
d ,c)
d X
c
1
S
d
dS
where S is TFIDF vectors
for documents, c is
centroid of cluster X,
and d is a document.
Proximity is similarity of all
documents to the cluster
centroid.
 Select pair of clusters that
produces the smallest decrease
in similarity, e.g., if
merge(X,Y)=>Z, then
max[Sim(Z)-(Sim(X)+Sim(Y))]

HAC for IR- Alternatives
Centroid Similarity
UPGMA
Sim ( X , Y )  cos( c X , c Y )
c
1
S
d
dS
cosine similarity
between the centroid
of the two clusters
 cos( d
Sim ( X , Y ) 
d 1  X , d 2 Y
X *Y
1
,d2)
Partitional Algorithms


Results in set of unrelated clusters.
Issues:



how many clusters is enough?
how to search space of possible partitions?
what is appropriate clustering criterion?
K Means



Number of clusters is set by user to be k.
Non-deterministic
Clustering criterion is squared error:
e(S , L ) 
K
nj

xi  c j
j
2
j 1 i 1
where S is document set, L is a clustering, K
is number of clusters, x is ith document in
jth cluster and c is centroid of jth cluster.
k-Means Clustering Algorithm
1.
2.
3.
4.
Randomly select k samples as cluster
centroids.
Assign each pattern to the closest cluster
centroid.
Recompute centroids.
If convergence criterion (e.g., minimal
decrease in error or no change in cluster
composition) is not met, return to 2.
Example:K-Means Solutions
50
45
40
35
30
25
20
15
10
5
0
0
10
20
30
40
50
k-Means Sensitivity to
Initialization
F
C
B
G
D E
A
K=3, red started w/A, D, F; yellow w/A, B, C
k-Means for IR



Update centroids incrementally
Calculate centroid as with hierarchical
methods.
Can refine into a divisive hierarchical
method by starting with single cluster
and splitting using k-means until forms
k clusters with highest summed
similarities. (bisecting k-means)
Other Types of Clustering
Algorithms
Graph Theoretic: construct minimal spanning
tree and delete edges with largest lengths
Expectation Minimization (EM): assume clusters
are drawn from distributions, use maximum
likelihood to estimate parameters of
distributions.
Nearest Neighbors: iteratively assign each
sample to the cluster of its nearest labelled
neighbor, so long as distance is below a set
threshold.
Comparison of Clustering
Algorithms [Steinbach et al.]




Implement 3 versions of HAC and 2 versions of kMeans
Compare performance on documents hand
labelled as relevant to one of a set of classes.
Well known data sets (TREC)
Found that UPGMA is best of hierarchical, but
bisecting k-means seems to do better if
considered over many runs.
M. Steinbach, G. Karypis, V.Kumar. A Comparison of Document Clustering
Techniques, KDD Workshop on Text Mining, 2000.
Evaluation Metrics 1
Evaluation: how to measure cluster quality?

Entropy:
E j    p ij log( p ij )
i
E CS 
m
nj * Ej
j 1
n

where pij is probability that a member of cluster j
belongs to class i, nj is size of cluster j, m is
number of clusters, n is number of docs and CS
is a clustering solution.
Comparison Measure 2

F measure: combines precision and recall

treat each cluster as the result of a query and
each class as the relevant set of docs
Recall ( i , j )  n ij / n i
Precision ( i , j )  n ij / n j
F (i, j ) 
F 

i
ni
n
2 * Recall ( i , j ) * Precision ( i , j )
Precision ( i , j )  Recall ( i , j )
max[ F ( i , j )]
nij is # of members
of class i in cluster j,
nj is # in j,
ni is # in i,
n is # of docs.