A few notes on cluster analysis Basics of clustering • Data structuring tool generally used as exploratory rather than confirmatory tool. • Organizes data.

Download Report

Transcript A few notes on cluster analysis Basics of clustering • Data structuring tool generally used as exploratory rather than confirmatory tool. • Organizes data.

A few notes on cluster analysis
Basics of clustering
• Data structuring tool generally
used as exploratory rather
than confirmatory tool.
• Organizes data into
meaningful taxonomies in
which groups are relatively
homogeneous with respect to
a specified set of attributes.
• Means, maximizes the
association between objects in
the same group while
minimizing the association
between groups.
• Two major types: hierarchical
and partitioning
Basics of clustering
• Based on the concepts of
dissimilarity or distance in
n dimensional space
• In multi-dimensional
attribute space, distance
refers to how dissimilar
the attributes of an
observation are from
another
• Classic example>>remote
sensing bands (see image)
Distance
• When attribute variables are
in different numeric scales, it
is often required to
standardize the data so that
no one variable is overly
weighted.
• Distance can be measured as
Euclidean (straight line)
distance (Eq. 1), squared
Euclidean distance (Eq. 2),
Manhattan (city block)
distance (Eq. 3), and
Chebychev distance (Eq. 4),
among many other
approaches.
•
•
•
•
Acura Legend
Mazda 929
Toyota Cressida
Buick Le Sabre
Ford LTD Crown Victoria
Mitsubishi Sigma
Nissan Maxima
Buick Century
Subaru XT
Nissan 240SX
Chrysler Le Baron Coupe
Ford Aerostar
Chevrolet Caprice
Nissan Van
Dodge Grand Caravan
Eagle Premier
Ford Thunderbird
Mazda MPV
Mitsubishi Wagon
Chevrolet Lumina APV
Honda Prelude
Nissan Axxess
Chevrolet Beretta
Pontiac Grand Am
Chrysler Le Baron
Mitsubishi Galant
Plymouth Laser
Ford Probe
Nissan Stanza
Dodge Daytona
Ford Tempo
Honda Civic CRX
Subaru Loyale
Eagle Summit
Ford Escort
Mazda Protege
Toyota Tercel
Subaru Justy
0
15000
This approach is good for a
posteriori data explorations,
allowing user to interpret
cluster relationships based
on patterns of branching
Are either divisive (dividing)
or agglomerative
(aggregating)
Agglomerative clustering
starts by treating every data
observation as a separate
group in itself and then
groups those observations
into larger groups until there
is a single group;
sequentially lowers the
threshold for uniqueness
Dissimilarity or similarity is
represented in the “height”
axis
Height
5000
10000
Hierarchical methods
Partitioning
• A priori approach when
some prior expectations.
• The clusters can then be
analyzed for systematic
differences in the
distributions of the
variables.
• Example: marketing
clusters for over-50
consumers, based on
responses to survey
Source:
http://www.utalkmarketing.com/Pages/Article.aspx?ArticleID=1920&Title=Jo_Rigby:_Understanding_the
_over_50%27s_consumer
Partitioning
•
•
•
•
Partitioning iteratively creates clusters by
assigning observations to the cluster
centroid that is nearest.
The most common partitioning method is k
means, popularized by Hartigan, which
works by randomly generating k clusters,
determining the location of the cluster
center, assigning each point to the nearest
cluster center, and then iteratively
recomputing the new cluster centers until
convergence occurs, which is generally
signaled by when point-cluster
assignments are no longer changing.
Another method is Partitioning Around
Medoids
Rather than minimizing distances, as k
means does, PAM minimizes dissimilarities,
a more robust measure of difference
measured in a dissimilarity matrix. The
dissimilarity matrix allows PAM to perform
clustering with respect to any distance
metric and allows for flexible definition of
distance.
PAM: Advantages
• The use of medoids rather than
centroids makes PAM less
sensitive to outliers.
• Plots show you how well the data
cluster based on the model and
the variables used.
• In the silhouette plot, each
silhouette represents a cluster
(composed of a horizontal line
representing each observation),
and their width represents the
strength of each observation’s
membership in a cluster.
• They are sorted according to their
width, hence those with narrow
widths fall near the next cluster,
which also means they don’t
clearly belong to one cluster or
another.
-0.4
-0.2
0.0
0.2
0.4
Silhouette width
Average silhouette width : 0.53
0.6
0.8
1.0
PAM: Silhouette Plots
•
•
•
Those to the left of the bar are
recognized as clearly misclassified, or
outliers that can’t be classified.
A zero value means it falls between two
observations.
The average silhouette width, given at
the bottom, represents the overall
strength of group membership. The
table below shows the overall rules of
thumb for determining structure. The
clusplot (click on the page 1 tab of the
plots) shows the cluster overlap, using
two principal components that explain
most of the variance. The more overlap
there is, the less clear structure there is
to the clustering. Ideally, you’d like to
have the cluster far apart, but when
working with a large data set like this,
that’s unlikely to happen.
-0.4
-0.2
0.0
0.2
0.4
Silhouette width
Average silhouette width : 0.53
0.6
0.8
1.0
Rule of thumb for silhouette score
RANGE OF
INTERPRETATION
SC
0.71-1.0
A strong structure has been found
0.51-0.70
A reasonable structure has been found
0.26-0.50
The structure is weak and could be
artificial. Try additional methods of
data analysis.
 0.25
No substantial structure has been
found
Also produces a “Clusplot”
0
-2
-4
Component 2
2
Areas where membership is ambiguous
-2
0
Component 1
These two components explain 97.03 % of the point variability.
2
New approaches
•
•
•
Artificial Neural Networks: NN essentially uses a nonlinear and flexible regression
technique which does not require prior assumptions of the distribution of the data
to classify data. NN methods have the advantage of evaluating similarities based
on a set of multi-dimensional criteria, as opposed to traditional clustering
algorithms which generally use a single measure of dissimilarity.
Multivariate Divisive Partitioning (MDP): an analyst chooses a dependent variable
or behavior they wish to model and then conducts a stepwise process to
determine which variables, and which breaks in the values of those variables, best
divides a single segment into two segments with the greatest difference in that
behavior. Splitting then continues iteratively until a threshold of similarity in the
dependent variable is reached
PCA often used as a data reduction tool in highly complex clustering and
segmentation. These factors are represented using standardized linear
combinations of the original variables. Generally, most of the original variation in
the variables is explained in the first principal component. Because each
component is orthogonal, each subsequent component should be uncorrelated
with the previous one, and hence explain less variance. Thus, while the number of
principal components is equal to the number of variables, only a few of the
principal components need be used because they explain most of the variance.