Clustering 2
Download
Report
Transcript Clustering 2
AMCS/CS229: Machine
Learning
Clustering 2
Xiangliang Zhang
King Abdullah University of Science and Technology
Cluster Analysis
1. Partitioning Methods + EM algorithm
2. Hierarchical Methods
3. Density-Based Methods
4. Clustering quality evaluation
5. How to decide the number of clusters ?
6. Summary
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
2
The quality of Clustering
• For supervised classification we have a variety of measures
to evaluate how good our model is
– Accuracy, precision, recall
• For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
3
Measures of Cluster Validity
Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following two types:
External Index: Used to measure the extent to which
cluster labels match externally supplied class labels.
• Purity, Normalized Mutual Information
Internal Index: Used to measure the goodness of a
clustering structure without respect to external
information.
• Sum of Squared Error (SSE)
• Cophenetic correlation coefficient, silhouette
4
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
Cluster Validity: External Index
The class labels are externally supplied (q classes)
Purity:
Larger purity values indicate better clustering solutions.
• Purity of each cluster Cr of size nr
P (C r )
1
nr
i
max n r
i
• Purity of the entire clustering
k
Purity ( C )
or
nr
r 1
n
1
k
k
P (C r )
P (C r )
r 1
5
Cluster Validity: External Index
Purity:
1 k
Purity(C) = å P(Cr )
k r=1
1 k
1
12
Purity = å P(Cr ) = ´ (5 + 4 + 3) =
k r=1
17
17
6
Cluster Validity: External Index
The class labels are externally supplied (q classes)
NMI (Normalized Mutual Information) :
I(C,T )
NMI(C,T ) =
(H(C) + H(T )) / 2
where I is mutual information
and H is entropy
k
| Cr | | Cr |
H(C) = -å
log
N
r=1 N
q
| Tl |
| Tl |
H(T ) = -å
log
N
l=1 N
7
Cluster Validity: External Index
NMI (Normalized Mutual Information) :
Larger NMI values indicate better clustering solutions.
I(X,Y )
NMI(X,Y ) =
(H(X) + H(Y )) / 2
8
Internal Measures: SSE
Internal Index: Used to measure the goodness of a
clustering structure without respect to external information
SSE is good for comparing two clustering results
• average SSE
• SSE curves w.r.t. various K
Can also be used to estimate the number of clusters
10
9
6
8
4
7
6
SSE
2
0
5
4
-2
3
2
-4
1
-6
0
5
10
15
2
5
10
15
K
20
25
30
9
Internal Measures: Cophenetic correlation coefficient
Cophenetic correlation coefficient:
a measure of how faithfully a dendrogram preserves the
pairwise distances between the original data points.
Compare two hierarchical clusterings of the data
2.50
1.41
1.00
0.71
0.5
D
F
E
C
A
B
Compute the correlation
coefficient between Dist
and CP
r X,Y =
E[(X - m X )(Y - mY )]
dXdY
10
Matlab functions: cophenet
Cluster Analysis
1. Partitioning Methods + EM algorithm
2. Hierarchical Methods
3. Density-Based Methods
4. Clustering quality evaluation
5. How to decide the number of clusters ?
6. Summary
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
11
Internal Measures: Cohesion and Separation
• Cluster cohesion measures how closely related are objects
in a cluster
= SSE or the sum of the weight of all links within a cluster.
• Cluster separation measures how distinct or well-separated
a cluster is from other clusters
= sum of the weights between nodes in the cluster and nodes outside
the cluster.
cohesion
separation
12
Internal Measures: Silhouette Coefficient
• Silhouette Coefficient combines ideas of both cohesion and
separation
• For an individual point, i
Calculate a = average distance of i to the points in its cluster
Calculate b = min (average distance of i to points in another cluster)
The silhouette coefficient for a point is then given by
Si =1-
ai
bi
(bi - ai )
or Si =
max(ai , bi )
b
a
o Typically between 0 and 1.
o The closer to 1 the better.
• Can calculate the Average Silhouette width for a cluster or a
clustering
Matlab functions: silhouette
13
Determine number of clusters by Silhouette Coefficient
compare different clusterings by the average silhouette values
K=3
mean(silh) = 0.526
K=4
mean(silh) = 0.640
K=5
mean(silh) = 0.527
Determine the number of clusters
1. Select the number K of clusters as the one maximizing
averaged silhouette value of all points
2. Optimizing an objective criterion
–
Gap statistics of the decreasing of SSE w.r.t. K
3. Model-based method:
• optimizing a global criterion (e.g. the maximum likelihood of data)
4. Try to use clustering methods which need not to set K,
e.g., DbScan,
5. Prior knowledge…..
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
15
Cluster Analysis
1. Partitioning Methods + EM algorithm
2. Hierarchical Methods
3. Density-Based Methods
4. Clustering quality evaluation
5. How to decide the number of clusters ?
6. Summary
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
16
Clustering VS Classification
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
17
Problems and Challenges
• Considerable progress has been made in scalable clustering
methods
Partitioning: k-means, k-medoids, CLARANS
Hierarchical: BIRCH, ROCK, CHAMELEON
Density-based: DBSCAN, OPTICS, DenClue
Grid-based: STING, WaveCluster, CLIQUE
Model-based: EM, SOM
Spectral clustering
Affinity Propagation
Frequent pattern-based: Bi-clustering, pCluster
• Current clustering techniques do not address all the
requirements adequately, still an active area of research
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
18
Cluster Analysis
Open issues in clustering
1. Clustering quality evaluation
2. How to decide the number of clusters ?
19
What you should know
• What is clustering?
• How does k-means work?
• What is the difference between k-means and k-mediods?
• What is EM algorithm? How does it work?
• What is the relationship between k-means and EM?
• How to define inter-cluster similarity in Hierarchical
clustering? What kinds of options do you have ?
• How does DBSCAN work ?
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
20
What you should know
• What are the advantages and disadvantages of DbScan?
• How to evaluate the clustering results ?
• Usually how to decide the number of clusters ?
• What are the main differences between clustering and
classification?
Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning
21