Clustering Case Study Thanks to AT&T Intern: Chao Han Virginia Tech University Data Mining - Volinsky - 2011 - Columbia.

Download Report

Transcript Clustering Case Study Thanks to AT&T Intern: Chao Han Virginia Tech University Data Mining - Volinsky - 2011 - Columbia.

Clustering Case Study
Thanks to AT&T Intern:
Chao Han
Virginia Tech University
Data Mining - Volinsky - 2011 - Columbia
1
Intro
• Study of AT&T mobile customers
– Can we characterize their usage to learn about
them?
• Apply clustering algorithms to find user
groups which share similar patterns of usage
intensity over time.
• All data is analyzed anonymously
Data Mining - Volinsky - 2011 - Columbia
2
Data
• Anonymized CDRs collected over a 60 days
period (11/29/09~01/27/10) from 35 cell
towers located within 5 miles of the center
of Morristown.
• 152178 users have transactions in both
months.
• Aggregate voice and SMS usage separately
into bins. Each bin represents a particular
hour of the day and day of the week.
• Aggregate daily bins by weekdays (M-F) and
weekends (Sat/Sun), resulting in
24(hours)×2(weekday vs.
weekend)×2(Voice vs. SMS)=96 bins.
August 11, 2011
Page 3
Data Mining - Volinsky - 2011 - Columbia
Voice
Voice
SMS
SMS
Goal
 Assume data set D is an n×p matrix:
D  {x1, x2 , , xn} 
•
where data point
xi
the data set.
p
has p dimensions, n is the number of objects in
 To partition the data set D into K groups, the integer K (1≤K≤n) is the
number of clusters
D  {C1, C2 , , CK }
 Each of the K groups C is a cluster such that the data points which
k
belong to the same cluster are more similar in usage patterns than
data in different clusters.
 Unsupervised learning: there is no a-priori knowledge of any
individual’s cluster membership.
August 11, 2011
Data Mining - Volinsky - 2011 - Columbia
Page 4
Algorithms used
• K-means
• PAM (partition around mediods)
– K-means using medians, so robust to outliers
• Trimmed k-means
– Another robust method
• Hierarchical clustering
– Average, complete, and ward
• DBSCAN
– Density-Based Spatial Clustering of Applications with Noise
– Can find arbitrarily shaped clusters
• Other ‘soft’ clustering methods
– MCLUST
– Fuzzy C-means
Data Mining - Volinsky - 2011 - Columbia
5
Evaluating Cluster solutions
• Calinski Index
– Higher Calinski index indicates better cluster separation
– Used to compare different clustering techniques
• How many clusters?
– Silhouette
– A(i) – avg. dissimilarity of i to other points in same cluster
– b(i) - avg. dissimilarity of i to other points in nearest cluster
• Other methods:
– Gap, Hartigan, DB,Dunn, etc
Data Mining - Volinsky - 2011 - Columbia
6
Determining the number of clusters K
KL(4)
Weighted Gap(4)
Fig2:Validity Index vs.
Cluster Number k.
Index Name (Suggested k)
Gap(7)
DB(3)
August 11, 2011
Silhouette(4)
Dunn(7)
Data Mining - Volinsky - 2011 - Columbia
Hartigan(5)
Calinski(4)
Page 7
• No good answer!
– we have 4 big relatively well separated clusters,
but there are more interesting smaller clusters
inside the bigger ones.
Data Mining - Volinsky - 2011 - Columbia
8
Determining the number of clusters K (Cont’d)
Fig3:Heat Maps of Cluster Centers based
on K-means for High Volume Users .
Cluster 5: Users have many Voice and
SMS activities with the heaviest usage just
after work hours.
Cluster 6: Late night callers.
Cluster 7: Before business hour callers.
November 6, 2015
Data Mining - Volinsky - 2011 - Columbia
Page 9
How many clusters (cont)
•
Another idea…look at the cluster centers using MDS to see if they are well separated
August 11, 2011
Data Mining - Volinsky - 2011 - Columbia
Page 10
Evaluation of Clustering Algorithms
Even though there is no ‘response’ – we can still do X-validation!
Two fold cross validation:
Randomly sample 50% of the data to be used as training set for clustering.
For each user in the test set, we want to see if the cluster they are assigned to in
December matches the cluster they are assigned to in January.
Each vector is assigned to clusters according to their proximity to cluster centers
derived on the clustering of the training set.
Training Set (50%) Clusters:
A user in the Test Set (50%):
C1
C2
C3
Dec usage Xi-Dec
C4
Concordance:
Jan
August 11, 2011
Dec
C6
C7
Jan usage Xi-Jan
Agreement measure:
Agreement
Counts
C5
20  15
 0.7
50
Prob(Yes by chance):
Prob(No by chance):
Yes
No
YES
20
5
25
NO
10
15
25
30
20
50
Data Mining - Volinsky - 2011 - Columbia
Cohen’s Kappa
30 25

 0.3
50 50
20 25

 0.2
50 50
0.7  (0.3  0.2)
 0.4
1  (0.3  0.2)
Page 11
Evaluation of Clustering Algorithms (cont’d)
 Slicing: rank users by their usage volume and then split the users into
10 slices, with around 15,000 users in each slice.
Fig5:Cohen’s Kappa is a way to measure agreement between two categorical vectors.
>17.7hr
November 6, 2015
>7.5hr
>3.9hr
>1.4hr >0.9hr
>0.5hr
>2.2hr
Data Mining - Volinsky - 2011 - Columbia
>0.3hr
>0.2hr
<0.2hr
Page 12
Low Volume Users
 Low volume users generate very few calls or text messages, their usage
vectors are quite sparse, typically consisting of isolated spikes.
Fig6:Kmeans clustering for the users from the bottom 10% quantile of the usage distribution.
 The Euclidean metric does not take into account the correlations between the
bins, which is valuable information
 Solutions:
o
Dimension Reduction: cluster on the Principal Components
o
Change Distance Metric: Cosine Distance, Earth Mover’s Distance (EMD)
August 11, 2011
Data Mining - Volinsky - 2011 - Columbia
Page 13
Earth Mover’s Distance
Voice
SMS
• EMD allows us to define a topology on the bins,
introducing the notion of nearness of bins.
Consider neighboring bins in time as ‘close’.
• EMD=amount mass moved x effort it takes to
move it
• Effort usually a function of distance
Data Mining - Volinsky - 2011 - Columbia
14
EMD Performance
Fig8:Kappa for Kmeans (Euc), Kmeans (Cosine), Hierarchical (EMD), ORCLUS, HDDC and Kmeans (PC31).
Fig9:Hierarchical (EMD)
for the users from the
bottom 10% quantile of the
usage distribution.
August 11, 2011
Data Mining - Volinsky - 2011 - Columbia
Page 15