DSS Chapter 1 - Hossam Faris

Transcript DSS Chapter 1 - Hossam Faris

Cluster Analysis for Data Mining






5-1
Used for automatic identification of
natural groupings of things
Part of the machine-learning family
Employ unsupervised learning
Learns the clusters of things from past
data, then assigns new instances
There is not an output variable
Also known as segmentation
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Cluster Analysis for Data Mining

Clustering results may be used to





5-2
Identify natural groupings of customers
Identify rules for assigning new cases to
classes for targeting/diagnostic purposes
Provide characterization, definition,
labeling of populations
Decrease the size and complexity of
problems for other data mining methods
Identify outliers in a specific domain (e.g.,
rare-event detection)
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Advantages and disadvantages




5-3
Clustering allows a user to make groups of data to determine
patterns from the data.
Advantage: when the data set is defined and a general pattern
needs to be determined from the data.
You can create a specific number of groups, depending on your
business needs. One defining benefit of clustering over
classification is that every attribute in the data set will be used
to analyze the data.
Disadvantage: the user is required to know ahead of time how
many groups he wants to create. For a user without any real
knowledge of his data, this might be difficult
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Cluster Analysis for Data Mining

Analysis methods






5-4
Statistical methods
Neural
Fuzzy logic (e.g., fuzzy c-means algorithm)
Genetic algorithms
Divisive : all items start in one cluster and are broken
apart.
Agglomerative: all items start in individual clusters
and the clusters are joined together.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Cluster Analysis for Data Mining

k-Means Clustering Algorithm

k : pre-determined number of clusters

Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as
initial cluster centers
Step 2: Assign each point to the nearest cluster
center
Step 3: Re-compute the new cluster centers
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually that the
assignment of points to clusters becomes stable)
5-5
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
K-Means
Predetermined
number of cluster
Start with seed
clusters of one
element
Seeds
5-6
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
6
Assign Instances to Clusters
5-7
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
7
Find New Centroids
5-8
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
8
New Clusters
5-9
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
9
Cluster Analysis for Data Mining k-Means Clustering Algorithm
Step 1
5-10
Step 2
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Step 3
Example - BMW dealership



5-11
The dealership has kept track of how people walk
through the dealership and the showroom, what cars
they look at, and how often they ultimately make
purchases.
Need to mine this data by finding patterns and by
using clusters to determine if certain behaviors in
their customers emerge.
There are 100 rows of data in this sample, and each
column describes the steps that the customers
reached in their BMW experience, with a column
having a 1 (they made it to this step or looked at this
car), or 0 (they didn't reach this step)
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Example - BMW dealership
5-12
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Weka results
5-13
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall

DSS Chapter 1 - Hossam Faris

Transcript DSS Chapter 1 - Hossam Faris

Directory