Topic 6 Clustering and Unsupervised Learning credits: Padhraic Smyth lecture notes Hand, et al Chapter 9 David Madigan lecture notes Data Mining - Volinsky - 2011

Download Report

Transcript Topic 6 Clustering and Unsupervised Learning credits: Padhraic Smyth lecture notes Hand, et al Chapter 9 David Madigan lecture notes Data Mining - Volinsky - 2011

Topic 6
Clustering and Unsupervised Learning
credits:
Padhraic Smyth lecture notes
Hand, et al Chapter 9
David Madigan lecture notes
Data Mining - Volinsky - 2011 - Columbia University
Clustering Outline
•
•
•
•
•
Introduction to Clustering
Distance measures
k-means clustering
hierarchical clustering
probabilistic clustering
Data Mining - Volinsky - 2011 - Columbia University
Clustering
• “automated detection of group structure in data”
– Typically: partition N data points into K groups (clusters)
such that the points in each group are more similar to
each other than to points in other groups
– descriptive technique (contrast with predictive)
– Identify “natural” groups of data objects - qualitatively
describe groups of the data
• often useful, if a bit reductionist
– for real-valued vectors, clusters can be thought of as
clouds of points in p-dimensional space
– Also called unsupervised learning
Data Mining - Volinsky - 2011 - Columbia University
Clustering
Sometimes easy
Sometimes impossible
and usually in between
Data Mining - Volinsky - 2011 - Columbia University
What is Cluster Analysis?
• A good cluster analysis results in
– Similar (close) to one another within the same cluster
– Dissimilar (far) from the objects in other clusters
• In other words
– high intra-cluster similarity (low intra-cluster variance)
– low inter-cluster similarity (high inter-cluster variance)
Data Mining - Volinsky - 2011 - Columbia University
Example
Data Mining - Volinsky - 2011 - Columbia University
Example
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
Data Mining - Volinsky - 2011 - Columbia University
3.9
4
Why is Clustering useful?
•
“Discovery” of new knowledge from data
– Contrast with supervised classification (where labels are known)
– Can be very useful for summarizing large data sets
• For large n and/or high dimensionality
• Applications of clustering
– WWW
• Clustering of documents produced by a search engine (google news)
– Customer Segmentation
– Spatial data Analysis
• geographical clusters of events: cancer rates, sales, etc.
– Clustering of genes with similar expression profiles
– many more
Data Mining - Volinsky - 2011 - Columbia University
General Issues in Clustering
• No golden truth!
– answer is often subjective
• Cluster Representation:
– What types or “shapes” of clusters are we looking for? What defines
a cluster?
• Other issues
– Distance function, D[x(i),x(j)] critical aspect of clustering, both
• distance of individual pairs of objects
• distance of individual objects from clusters
– How is K selected?
Data Mining - Volinsky - 2011 - Columbia University
Clustering Outline
•
•
•
•
•
Introduction to Clustering
Distance measures
k-means clustering
hierarchical clustering
probabalistic clustering
Data Mining - Volinsky - 2011 - Columbia University
Distance Measures
•
•
In order to cluster, we need some kind of
“distance” between points.
Sometimes distances are not obvious, but we can
create them
case
sex
glasses
Moustache
smile
hat
1
0
1
0
1
0
2
1
0
0
1
0
3
0
1
0
0
0
4
0
0
0
0
0
5
0
0
0
1
0
6
0
0
1
0
1
7
0
1
0
1
0
8
0
0
0
1
0
9
0
1
1
1
0
10
1
0
0
0
0
11
0
0
1
0
0
12
1
0
0
0
0
Data Mining - Volinsky - 2011 - Columbia University
Some Distances
• Euclidean distance (L2): d(x,y) =
– The most common notion of “distance.”
• Manhattan distance (L1)
– distance if you had to travel along coordinates only.
Data Mining - Volinsky - 2011 - Columbia University
Examples of Euclidean Distances
y = (9,8)
L2:
dist(x,y) =
(42+32)
=5
5
3
4
x = (5,5)
Data Mining - Volinsky - 2011 - Columbia University
L1:
dist(x,y) =
4+3 = 7
Non-Euclidean Distances
• Some observations are not appropriate for Euclidian
distance:
•
•
•
•
Binary Vectors: 10011 vs. 11000
Strings: “Statistics” vs. “sadistics”
Ordinal variables: “M.S” vs. “B.A.”
Categorical: blue vs. green
How to calculate distances for variables like these?
Data Mining - Volinsky - 2011 - Columbia University
Distances for Binary Vectors
• A=101110; B=100111
– Hamming distance: # of changes to get from A to B
• Hamming(A,B) = 2
• Can be normalized by length of string: 2/6
– Jaccard Similarity: intersection over union
•
•
•
•
Intersection: # of 1s in common =3
Union: # of spaces with at least one 1 = 5
Jaccard similarity = 3/4.
Jaccard distance = 1-3/4 = ¼
– Both of these are metrics => satisfy triangle inequality
Data Mining - Volinsky - 2011 - Columbia University
Cosine Distance (similarity)
• Think of a point as a vector from the origin
(0,0,…,0) to its location.
• Two points’ vectors make an angle, the cosine of
this angle is a measure of similarity
– Recall cos(0) = 1; cos(90)=0
– Also: the cosine is the normalized dot-product of the
vectors:
– Example p1 = 00111; p2 = 10011.
– cos() = 2/3;  is about 48 degrees.
Data Mining - Volinsky - 2011 - Columbia University
Cosine-Measure Diagram
p1

p1.p2
||p2||
Data Mining - Volinsky - 2011 - Columbia University
p2
Edit Distance for strings
• Hamming distance for strings: the number of inserts
and deletes of characters needed to turn one into
the other.
• Equivalently: d(x,y) = |x| + |y| - 2|LCS(x,y)|.
– LCS = longest common subsequence = longest string obtained
both by deleting from x and deleting from y.
Data Mining - Volinsky - 2011 - Columbia University
Example
• x = statistics; y = sadistic.
• Turn x into y by deleting t, deleting t, then
inserting d, and deleting s.
– Edit-distance = 4.
• Or, LCS(x,y) = saistic.
• |x| + |y| - 2|LCS(x,y)| = 10+8-14=4.
Data Mining - Volinsky - 2011 - Columbia University
Categorical Variables
• A generalization of the binary variable in that it can take more than
2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables
d (i , j ) 
pm
p
Hat
Coat
Shoes
Belt
Alice
Brown
Black
Black
brown
Bob
Brown
Gray
Red
Red
Craig
None
Black
Black
brown
Dave
None
Black
Brown
None
– Distance(Alice,Craig) = 4-3/4 = 1/4
Data Mining - Volinsky - 2011 - Columbia University
Ordinal Variables
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Pretend they are interval scaled
– replacing xif by their rank
– map the range of each variable onto [0, 1]:
– compute the dissimilarity using methods using Euclidean or other distance
Data Mining - Volinsky - 2011 - Columbia University
Clustering Methods
• enough about distances!
• Now we have a matrix (n x n) of distances.
• Two major types of clustering algorithms:
– partitioning
• Partitions the set into clusters with defined boundaries
• place each point in its nearest cluster
– hierarchical
• agglomerative: each point is in its own cluser, iteratively combine
• divisive: all data in one cluser, iteratively dissect
Data Mining - Volinsky - 2011 - Columbia University
Clustering Outline
•
•
•
•
•
Introduction to Clustering
Distance measures
k-means Clustering
Hierarchical clustering
Probabalistic clustering
Data Mining - Volinsky - 2011 - Columbia University
k –Means Algorithm(s)
• Assumes Euclidean space.
• Start by picking k, the number of clusters.
• Initialize clusters by picking one point per
cluster.
– typically, k random points
Data Mining - Volinsky - 2011 - Columbia University
K-Means Algorithm
1.
2.
Arbitrarily select K objects from the data (e.g., K customer)
to be each a cluster center
For each of the remaining objects: Assign each object to the
cluster whose center it is most close to
10
9
8
7
6
Cluster center
5
Cluster center
4
3
2
1
0
0
1
2
3
4
5
6
Data Mining - Volinsky - 2011 - Columbia University
7
8
9
10
K-Means Algorithm
Then Repeat the following 3 steps until clusters
converge (no change in clusters):
1. Compute the new center of the current
clusters.
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
Data Mining - Volinsky - 2011 - Columbia University
6
7
8
9
10
K-Means Algorithm
2. Assign each object to the cluster whose center
it is most close to.
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
3. Go back to Step 1, or stop if center do not
change.
Data Mining - Volinsky - 2011 - Columbia University
The K-Means Clustering Method
• Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
Data Mining - Volinsky - 2011 - Columbia University
2
3
4
5
6
7
8
9
10
K-means
1. Decide on clusters.
(e.g. K=5)
(Example is courtesy of
Andrew Moore, CMU)
Data Mining - Volinsky - 2011 - Columbia University
K-means
1. Decide on Clusters.
(e.g. K=5)
2. Randomly guess K
cluster Center
locations
Data Mining - Volinsky - 2011 - Columbia University
K-means
1. Decide on clusters.
(e.g. K=5)
2. Randomly guess K
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)
Data Mining - Volinsky - 2011 - Columbia University
K-means
1. Decide on clusters.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
Data Mining - Volinsky - 2011 - Columbia University
K-means
1. Decide on clusters.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
5. New Centers =>
new boundaries
6. Repeat until no
change
Data Mining - Volinsky - 2011 - Columbia University
K-Means Example
• Given: {2,4,10,12,3,20,30,11,25}, k=2
• Randomly assign means: m1=3,m2=4
• Solve for the rest ….
Data Mining - Volinsky - 2011 - Columbia University
K-Means Example
•
•
•
•
•
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
• K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
• Stop as the clusters with these means are the same.
Data Mining - Volinsky - 2011 - Columbia University
Getting k Right
• Hard! Often done subjectively (by feel)
• Try different k, looking at the change in the average
distance to centroid, as k increases.
• Looking for a balance between within-cluster
variance and between-cluster variance.
– Calinski Index =
• Average falls rapidly until right k, then changes little.
Average
distance to
centroid
Best value
of k
k
Data Mining - Volinsky - 2011 - Columbia University
Comments on the K-Means Method
• Strength
– Relatively efficient: Easy to implement - often comes up with
good, if not best, solutions
– intuitive
• Weakness
–
–
–
–
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Quite sensitive to initial starting points - will find a local
optimum. Do it several times and see how much the results
change
Data Mining - Volinsky - 2011 - Columbia University
Variations on k-means
•
Make it more robust by using k-modes or k-mediods
–
•
K-Medoids: medoids are the most centrally located object in a cluster.
Make the initialization better
–
Take a small random sample and cluster to find a starting point
–
Pick k points on a grid,
–
Do several runs with different starting points
Data Mining - Volinsky - 2011 - Columbia University
Clustering Outline
•
•
•
•
•
Introduction to Clustering
Distance measures
k-means clustering
Hierarchical clustering
Probabalistic clustering
Data Mining - Volinsky - 2011 - Columbia University
Simple example of hierarchical clustering
Data Mining - Volinsky - 2011 - Columbia University
Hierarchical Clustering
• Does not require the number of clusters k as an input.
• Two extremes
– All data in one cluster
– Each data point in its own cluster
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
divisive
Step 3
Step 2 Step 1 Step 0
Data Mining - Volinsky - 2011 - Columbia University
Hierarchical Clustering
• Representation: tree of nested clusters
• Greedy algorithm
– Find two most similar points
– Join them
– Repeat
• Can also run backwards
– divisive
• Effective visualization via “dendrograms”
– shows nesting structure
– merges or splits = tree nodes
• Algorithm requires a distance metric for distance between
clusters, or between a point and a cluster
Data Mining - Volinsky - 2011 - Columbia University
Distances Between Clusters
• Single Link:
– smallest distance between points
– Nearest neighbor
– can be outlier sensitive
• Complete Link:
– largest distance between points
– enforces “compactness”
• Average Link:
– mean - gets average behavior
– centroid - more robust
• Ward’s measure
– Merge clusters that minimize increase in within-cluster
distances:
– D(i,j) = ( (SS(Ci+j) - SS(Cj) - SS(Ci) )
Data Mining - Volinsky - 2011 - Columbia University
Dendrograms
• By cutting the dendrogram at the desired level, then each
connected component forms a cluster.
ABC
A
B
DEF
C
D
E
G
F
G
HI
H
I
Old Faithful data
Data Mining - Volinsky - 2011 - Columbia University
Can make plots so the
height of the cross-bar
shows the change in
within-cluster SS
Data Mining - Volinsky - 2011 - Columbia University
Dendrogram Using Single-Link Method
Old Faithful Eruption Duration vs Wait Data
Notice how single-link
tends to “chain”.
dendrogram y-axis = crossbar’s distance score
Data Mining - Volinsky - 2011 - Columbia University
Dendogram Using Ward’s SSE Distance
Old Faithful Eruption Duration vs Wait Data
Data Mining - Volinsky - 2011 - Columbia University
More balanced than
single-link.
Hierarchical Clustering
• Pros
– don’t have to specify k beforehand
– visual representation of various cluster
characteristics from dendogram
• Cons
– different linkage options get very different results
Data Mining - Volinsky - 2011 - Columbia University
Clustering Outline
•
•
•
•
•
Introduction to Clustering
Distance measures
k-means clustering
Hierarchical clustering
Probabalistic clustering
Data Mining - Volinsky - 2011 - Columbia University
Estimating Probability Densities
• Using Probability densities is one way to describe data.
• Finite mixtures of probability densities can be viewed as
clusters
• Because we have a probability model, log-likelihood can be
used to evaluate:
n
S L ( )    log f ( x i )
i 1

Data Mining - Volinsky - 2011 - Columbia University
Mixture Models
weekly credit card usage
( 1 ) e
x
f ( x)  p
 1
 (1  p )
x!
( 2 )
52  x
 2
( 52  x )!
Two stage model:
•Assign data points to clusters
•Assess fit of the model
e
Data Mining - Volinsky - 2011 - Columbia University
Mixture Models and EM
• How do we find the models to mix over?
• EM (Expectation / Maximization) is a widely used technique that
converges to a solution for finding mixture models.
• Assume multivariate normal components. To apply EM:
– take an initial solution
– calculate the probability that each point comes from each
component and assign it (E-step)
– re-estimate parameters for the components based on the new
assignments (M-step)
– repeat until convergance.
• Results in probabilistic membership to clusters
•Can be slow to converge; can find local maxima
Data Mining - Volinsky - 2011 - Columbia University
The E (Expectation) Step
n data
points
Current K clusters
and parameters
E step: Compute p(data point i is in group k)
Data Mining - Volinsky - 2011 - Columbia University
The M (Maximization) Step
n data
points
New parameters for
the K clusters
M step: Compute , given n data points and memberships
Data Mining - Volinsky - 2011 - Columbia University
Comments on Mixtures and EM Learning
• Probabilistic assignment to clusters…not a partition
• K-means is a special case of EM
– Gaussian mixtures with isotropic (diagonal, equi-variance) k ‘s
– Approximate the E-step by choosing most likely cluster (instead
of using membership probabilities)
– EM can be used more broadly to estimate generic distributions
Data Mining - Volinsky - 2011 - Columbia University
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Data Mining
- 2011
- Columbia
University
Red- Volinsky
Blood
Cell
Volume
3.8
3.9
4
EM ITERATION 1
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Volinsky - 2011 - Columbia University
3.8
3.9
4
EM ITERATION 3
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Volinsky - 2011 - Columbia University
3.8
3.9
4
EM ITERATION 5
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Volinsky - 2011 - Columbia University
3.8
3.9
4
EM ITERATION 10
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Volinsky - 2011 - Columbia University
3.8
3.9
4
EM ITERATION 15
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Volinsky - 2011 - Columbia University
3.8
3.9
4
EM ITERATION 25
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Volinsky - 2011 - Columbia University
3.8
3.9
4
ANEMIA DATA WITH LABELS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Volinsky - 2011 - Columbia University
3.8
3.9
4
LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS
490
480
Log-Likelihood
470
460
450
440
430
420
410
400
0
5
10
15
EM Iteration
Data Mining - Volinsky - 2011 - Columbia University
20
25
Selecting K in mixture models
• cannot just choose K that maximizes likelihood
– Likelihood L() is always larger for larger K
• Model selection alternatives for choosing k:
– 1) In-sample: penalizing complexity
• e.g., BIC = 2L() – d log n , d = # parameters
• Easy to implement: asymptotically correct
(Bayesian information criterion)
– 2) Bayesian: compute posteriors p(k | data)
• P(k|data) requires computation of p(data|k) = marginal likelihood
• Can be tricky to compute for mixture models
– 3) Out-of-sample : (cross) validation:
• split data into train and validate sets
• Score different models by likelihood of test data log p(Xtest | )
• Can be noisy on small data (logL is sensitive to outliers)
Data Mining - Volinsky - 2011 - Columbia University
Example of BIC Score for Red-Blood Cell
Data
Data Mining - Volinsky - 2011 - Columbia University
Example of BIC Score for Red-Blood Cell
Data
True number
of classes (2)
selected by BIC
Data Mining - Volinsky - 2011 - Columbia University
Model Based Clustering
f(x) = k=1…K wk fk(x;)
• Mixture of k
multivariate
Gaussians
• Optimal
complexity fit
via in-sample
penalties
• EM or Bayesian
methods used
to fit clusters
• ‘mclust’ in R
Name
Distribution Volume
Shape
Orientation
EII
Spherical
equal
equal
NA
VII
Spherical
variable
equal
NA
EEI
Diagonal
equal
equal
coordinate
azxes
VEI
Diagonal
variable
equal
coordinate
axes
VVI
Diagonal
variable
variable
coordinate
axes
EEE
Ellipsoidal
equal
equal
equal
EEV
Ellipsoidal
equal
equal
variable
VEV
Ellipsoidal
variable
equal
variable
VVV
Ellipsoidal
variable
var
variable
Data Mining - Columbia University
mclust output
Data Mining - Volinsky - 2011 - Columbia University