Document 7396207

Transcript Document 7396207

CACTUS
Clustering Categorical Data
Using Summaries
Venkatesh Ganti
joint work with
Johannes Gehrke and Raghu Ramakrishnan
(University of Wisconsin-Madison)
1
Introduction
 Most research on clustering focused on
n-dimensional numeric data
 e.g., BIRCH [ZRL96], CURE [GRS98], Clustering
framework [BFR98], WaveCluster [SCZ98] etc.
 Data also consists of categorical
attributes
 e.g., the UC-Irvine collection of datasets
 Problem: similarity functions are not
defined for categorical data
2
CACTUS
 Goal: Fast scalable algorithm for
discovering well-defined clusters
 Similarity: use attribute value cooccurrence (STIRR[GKR98])
 Speed and scalability: exploit the
small domain sizes of categorical
attributes
3
Preliminaries and Notation
 Set of n categorical attributes with
domains D1,…,Dn
 A tuple consists of a value from each
domain, e.g., (a1,b2,c1)
 Dataset: a set of tuples
A
B
C
a1
b1
c1
a2
a3
a4
b2
b3
b4
c2
c3
c4
Note: Sizes of D1,…,Dn are
typically very small
4
Similarity, between attributes
A
B
a1
b1
a2
a3
b2
b3
a4
b4
C
c1
c2
c3
c4
Not strongly connected
“similarity’’ between a1 and b1
support(a1,b1)=#tuples containing (a1,b1)
a1 and b1 are strongly connected
support(a1,b1) is higher than expected;
if
{a1,a2,a3,a4} and {b1,b2} are strongly
connected if all pairs are;
5
Similarity, within an attribute
 simA(b1,b2): number of values of A
which are strongly connected with
both b1 and b2
A
B
a1
b1
a2
a3
a4
b2
b3
b4
C
c1
c2
c3
c4
sim*(B)
(b1,b2)
(b1,b3)
(b1,b4)
(b2,b3)
(b2,b4)
thru A
4
0
0
0
0
thru C
2
2
0
2
0
6
Definitions
 Support(ai,bk) is the number of tuples
that contain both ai and bk
 ai and bk are strongly connected if
support(ai,bk) >> expected value
 Si and Sk are strongly connected if
every pair of values in Si x Sk is
strongly connected.
7
An Example
A
B
C
a1
b1
a2
a3
b2
b3
c1
c2
c3
a4
b4
c4
Intuitively, a cluster is a
high-density region
Region: {a1,a2} x {b1,b2} x {c1,c2}
Note: Dense regions lead to strongly connected sets
8
Cluster Definition


Region: a cross-product of sets of
attribute values: C1 x … x Cn
C=C1 x … x Cn is a cluster iff
1. Ci and Cj are strongly connected, for all i,j
2. Ci is maximal, for all i
3. Support(C) >> expected
Ci: cluster projection of C on Ai
9
CACTUS: Outline
 Idea: compute and use data summaries for
clustering
 3 phases
 Summarization
 Compute summaries of data
 Clustering
 Using the summaries to compute
candidate clusters
 Validation
 Validate the set of candidate clusters
from the clustering phase
10
Summaries
Two types of summaries
 Inter-attribute summaries
 Intra-attribute summaries
11
Inter-Attribute Summaries
 Supports of all strongly connected attribute
value pairs from different attributes
 Similar in nature to “frequent’’ 2-itemsets
 So is the computation
A
B
a1
b1
a2
a3
a4
b2
b3
b4
C
c1
c2
c3
c4
IJ(A,B)
IJ(A,C) IJ(B,C)
(a1,b1)
(a1,c1)
(b1,c1)
(a1,b2)
(a1,c2)
(b1,c2)
(a2,b1)
(a2,c1)
(b2,c1)
(a2,b2)
(a2,c2)
(b2,c2)
(a3,b1)
…
(b3,c1)
…
12
Intra-attribute summaries
 simA(B): similarities thru A of
attribute value pairs of B
A
B
a1
b1
a2
a3
a4
b2
b3
b4
C
c1
c2
c3
c4
sim*(B)
(b1,b2)
(b1,b3)
(b1,b4)
(b2,b3)
(b2,b4)
thru A
4
0
0
0
0
thru C
2
2
0
2
0
13
Computing Intra-attribute Summaries
 SQL query to compute simA(B)
 Select T1.B, T2.B, count(*)
From IJ(A,B) as T1(A,B), IJ(A,B) as T2(A,B)
Where T1.B T2.B and T1.A=T2.A
Group By T1.A, T2.A
Having count > 0;

 Note: Inter-attribute summaries are
sufficient
 Dataset is not accessed!
14
Memory Requirements for Summaries
 Attribute domains are small
 Typically less than 100
 E.g., the largest attribute value domain in
the UC-Irvine collection is 100 (Pendigits
dataset)
 50 attributes, domain sizes 100, and
100 MB of main memory
 Only one scan of the dataset for
computing inter-attribute summaries
15
CACTUS
 Summarization
 Clustering Phase
 Validation
16
Clustering Phase
1. Compute cluster projections on each
attribute
2. Join cluster projections across
attributes—candidate cluster
generation
Identify the cluster projections:
{a1,a2}, {b1,b2}, {c1,c2};
Then identify the cluster:
{a1,a2} x {b1,b2} x {c1,c2}
a1
b1
a2
a3
a4
b2
b3
b4
c1
c2
c3
c4
17
Computing Cluster Projections
From A, B : SBA  {a1 , a2 , a3 , a4 }
From A, C : S  {a1 , a2 }
C
A
{a1 , a2 }  S  S
B
A
C
A
 {a1 , a2 , a3 , a4 }  {a1 , a2 }
a1
b1
a2
a3
b2
b3
a4
b4
c1
c2
c3
c4
Lemma: Computing all projections of
clusters on attribute pairs is NP-complete
18
Distinguishing Set Assumption
 Each cluster projection Ci on Ai is
distinguished by a small set of
attribute values
 Distinguishing set is bounded by k
(distinguishing number)
 Values for k are typically small
19
Distinguishing Set Assumption
a1
b1
a2
a3
a4
b2
b3
b4
c1
c2
c3
c4
• Cluster: {a1,a2} x {b1,b2} x {c1,c2};
{a1} (or {a2}) distinguishes {a1,a2};
• Approach: Compute distinguishing sets
and extend them to cluster projections
20
Candidate Cluster Generation
 Cluster projections S1,…,Sn on A1,…,An
 Cross product S1 x … x Sn
 Level-wise synthesis: S1 x S2, prune,
then add S3 and so on.
 May contain some dubious clusters!
C’
C1
C2
C3
S1={C’,C1}; S2={C2}; S3={C3};
C’ x C2 x C3: not a cluster
21
The CACTUS Algorithm
 Summarize
 inter-attribute summaries: scans dataset
 intra-attribute summaries
 Clustering phase
 Compute cluster projections
 Level-wise synthesis of cluster projections
to form candidate clusters
 Validation
 Requires a scan of the dataset
22
STIRR [GKR98]
 An iterative dynamical system
 Weighted nodes in the graph
 In each iteration, weights are propagated
between connected nodes (determined by
tuples in the dataset)
 Each iteration requires a dataset scan
 Iteration stops when the fixed point
is reached
 Similar nodes have similar weights
23
Experimental Evaluation
 Compare CACTUS with STIRR
 Synthetic datasets
 Quasi-random data [GKR98:STIRR]
 Fix domain of each attribute
 Randomly generate tuples from these
domains
 Identify clusters and plant additional (5%)
data within the clusters
24
Synthetic Datasets: Cactus and STIRR
{0,…9} x {0,…9}
{10,…,19} x {10,…,19}
0
9
10
Both CACTUS and STIRR identified
the two clusters exactly
19
20
…
99
25
Synthetic Dataset (contd.)
{0,…,9} x {0,…,9} x {0,…,9}
{10,…,19} x {10,…,19} x {10,…,19}
{0,…,9} x {10,…,19} x {10,…,19}
0
9
10
19
20
…
Cactus identifies the 3 clusters
STIRR returns:
{0,…,9} x {0,…,19} x {0,…,9}
{10,…,19} x {0,…,19} x {10,…,19}
99
26
Scalability with #Tuples
Time vs. #Tuples
Time (in seconds)
2500
2000
1500
1000
500
0
1
2
3
4
5
#Tuples (in millions)
CACTUS
STIRR
#Attributes: 10
Domain Size: 100
CACTUS is 10 times faster
27
Scalability with #Attributes
Time vs. #Attributes
5000
4500
Time (in seconds)
4000
3500
3000
2500
2000
1500
1000
500
0
4
6
8
10
20
30
#Attributes
CACTUS
STIRR
40
50
1 million tuples
Domain size: 100
28
Scalability with Domain Size
Time vs. Domain Size
Time (in seconds)
250
200
150
100
50
0
50
100
200
400
600
800
1000
#Attribute Values
CACTUS
STIRR
1 million tuples
#attributes: 4
29
Bibliographic Data



Database and theory bibliographic entries
[Wie]—38500 entries
Attributes: first author, second author,
conference/journal, and year
Example cluster projections on the
conference attribute
(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod Record
(2). ACMTG, CompGeom, FOCS, Geometry, ICALP, IPL, JCSS, …
(3). PODS, Algorithmica, FOCS, ICALP, INFCTRL, IPL, JCSS, …
30
Conclusions
 Formal definition of a cluster
 A scalable fast summarization-based
clustering algorithm for categorical
data
 Outperforms an earlier algorithm
(STIRR) by almost an order of magnitude
 Subspace clustering
31
on
32
Extensions
 Dealing with large attribute value
domains
 In some rare cases, the inter-attribute
or intra-attribute summaries may not fit
in main memory
 Clusters in subspaces when the
number of attributes is large
33
Related Work
 Conceptual Clustering (e.g., [Fisher87]),
EM [DLR77]
 Assume that datasets fit in main memory
 Recent scalable clustering algorithms
for clustering categorical data
 STIRR [GKR98]
 ROCK [GRS99]
 the definition of clusters is not clear
34
Limitations
 The cluster definition may be too
strong for certain applications
 That we require every pair of attribute
values across attributes to be strongly
connected
 Consequence: a large number of
clusters
35
Outline of the talk






Notion of similarity
Cluster Definition
The CACTUS Algorithm
Experimental Evaluation
Extensions to CACTUS
Conclusions
36
Validation
 Scan the dataset once more
 Compute supports of candidate
clusters
 Retain only those with significantly
high support
37
Computing Cluster Projections:
Algorithm
 For the attribute A1, compute cluster
projections from clusters on (A1,A2),
2
n
(A1,A3),…,(A1,An): S1 ,..., S1
2
n
 Intersection join on S1 ,..., S1
S1  {s : (s'S  s  s' ),..., (s'S  s  s' )}
2
1
n
1
38
Computing Cluster Projections
From A, B : SBA  {a1 , a2 , a3 , a4 }
From A, C : S  {a1 , a2 }
C
A
{a1 , a2 }  S  S
B
A
C
A
 {a1 , a2 , a3 , a4 }  {a1 , a2 }
a1
b1
a2
a3
b2
b3
a4
b4
c1
c2
c3
c4
 Lemma: C=C1 x … x Cn be a cluster.
Then Ci is the intersection of {Ci’:
(Ci’,Ck) is a cluster on Ai, Ak}
39