Document 7396207
Download
Report
Transcript Document 7396207
CACTUS
Clustering Categorical Data
Using Summaries
Venkatesh Ganti
joint work with
Johannes Gehrke and Raghu Ramakrishnan
(University of Wisconsin-Madison)
1
Introduction
Most research on clustering focused on
n-dimensional numeric data
e.g., BIRCH [ZRL96], CURE [GRS98], Clustering
framework [BFR98], WaveCluster [SCZ98] etc.
Data also consists of categorical
attributes
e.g., the UC-Irvine collection of datasets
Problem: similarity functions are not
defined for categorical data
2
CACTUS
Goal: Fast scalable algorithm for
discovering well-defined clusters
Similarity: use attribute value cooccurrence (STIRR[GKR98])
Speed and scalability: exploit the
small domain sizes of categorical
attributes
3
Preliminaries and Notation
Set of n categorical attributes with
domains D1,…,Dn
A tuple consists of a value from each
domain, e.g., (a1,b2,c1)
Dataset: a set of tuples
A
B
C
a1
b1
c1
a2
a3
a4
b2
b3
b4
c2
c3
c4
Note: Sizes of D1,…,Dn are
typically very small
4
Similarity, between attributes
A
B
a1
b1
a2
a3
b2
b3
a4
b4
C
c1
c2
c3
c4
Not strongly connected
“similarity’’ between a1 and b1
support(a1,b1)=#tuples containing (a1,b1)
a1 and b1 are strongly connected
support(a1,b1) is higher than expected;
if
{a1,a2,a3,a4} and {b1,b2} are strongly
connected if all pairs are;
5
Similarity, within an attribute
simA(b1,b2): number of values of A
which are strongly connected with
both b1 and b2
A
B
a1
b1
a2
a3
a4
b2
b3
b4
C
c1
c2
c3
c4
sim*(B)
(b1,b2)
(b1,b3)
(b1,b4)
(b2,b3)
(b2,b4)
thru A
4
0
0
0
0
thru C
2
2
0
2
0
6
Definitions
Support(ai,bk) is the number of tuples
that contain both ai and bk
ai and bk are strongly connected if
support(ai,bk) >> expected value
Si and Sk are strongly connected if
every pair of values in Si x Sk is
strongly connected.
7
An Example
A
B
C
a1
b1
a2
a3
b2
b3
c1
c2
c3
a4
b4
c4
Intuitively, a cluster is a
high-density region
Region: {a1,a2} x {b1,b2} x {c1,c2}
Note: Dense regions lead to strongly connected sets
8
Cluster Definition
Region: a cross-product of sets of
attribute values: C1 x … x Cn
C=C1 x … x Cn is a cluster iff
1. Ci and Cj are strongly connected, for all i,j
2. Ci is maximal, for all i
3. Support(C) >> expected
Ci: cluster projection of C on Ai
9
CACTUS: Outline
Idea: compute and use data summaries for
clustering
3 phases
Summarization
Compute summaries of data
Clustering
Using the summaries to compute
candidate clusters
Validation
Validate the set of candidate clusters
from the clustering phase
10
Summaries
Two types of summaries
Inter-attribute summaries
Intra-attribute summaries
11
Inter-Attribute Summaries
Supports of all strongly connected attribute
value pairs from different attributes
Similar in nature to “frequent’’ 2-itemsets
So is the computation
A
B
a1
b1
a2
a3
a4
b2
b3
b4
C
c1
c2
c3
c4
IJ(A,B)
IJ(A,C) IJ(B,C)
(a1,b1)
(a1,c1)
(b1,c1)
(a1,b2)
(a1,c2)
(b1,c2)
(a2,b1)
(a2,c1)
(b2,c1)
(a2,b2)
(a2,c2)
(b2,c2)
(a3,b1)
…
(b3,c1)
…
12
Intra-attribute summaries
simA(B): similarities thru A of
attribute value pairs of B
A
B
a1
b1
a2
a3
a4
b2
b3
b4
C
c1
c2
c3
c4
sim*(B)
(b1,b2)
(b1,b3)
(b1,b4)
(b2,b3)
(b2,b4)
thru A
4
0
0
0
0
thru C
2
2
0
2
0
13
Computing Intra-attribute Summaries
SQL query to compute simA(B)
Select T1.B, T2.B, count(*)
From IJ(A,B) as T1(A,B), IJ(A,B) as T2(A,B)
Where T1.B T2.B and T1.A=T2.A
Group By T1.A, T2.A
Having count > 0;
Note: Inter-attribute summaries are
sufficient
Dataset is not accessed!
14
Memory Requirements for Summaries
Attribute domains are small
Typically less than 100
E.g., the largest attribute value domain in
the UC-Irvine collection is 100 (Pendigits
dataset)
50 attributes, domain sizes 100, and
100 MB of main memory
Only one scan of the dataset for
computing inter-attribute summaries
15
CACTUS
Summarization
Clustering Phase
Validation
16
Clustering Phase
1. Compute cluster projections on each
attribute
2. Join cluster projections across
attributes—candidate cluster
generation
Identify the cluster projections:
{a1,a2}, {b1,b2}, {c1,c2};
Then identify the cluster:
{a1,a2} x {b1,b2} x {c1,c2}
a1
b1
a2
a3
a4
b2
b3
b4
c1
c2
c3
c4
17
Computing Cluster Projections
From A, B : SBA {a1 , a2 , a3 , a4 }
From A, C : S {a1 , a2 }
C
A
{a1 , a2 } S S
B
A
C
A
{a1 , a2 , a3 , a4 } {a1 , a2 }
a1
b1
a2
a3
b2
b3
a4
b4
c1
c2
c3
c4
Lemma: Computing all projections of
clusters on attribute pairs is NP-complete
18
Distinguishing Set Assumption
Each cluster projection Ci on Ai is
distinguished by a small set of
attribute values
Distinguishing set is bounded by k
(distinguishing number)
Values for k are typically small
19
Distinguishing Set Assumption
a1
b1
a2
a3
a4
b2
b3
b4
c1
c2
c3
c4
• Cluster: {a1,a2} x {b1,b2} x {c1,c2};
{a1} (or {a2}) distinguishes {a1,a2};
• Approach: Compute distinguishing sets
and extend them to cluster projections
20
Candidate Cluster Generation
Cluster projections S1,…,Sn on A1,…,An
Cross product S1 x … x Sn
Level-wise synthesis: S1 x S2, prune,
then add S3 and so on.
May contain some dubious clusters!
C’
C1
C2
C3
S1={C’,C1}; S2={C2}; S3={C3};
C’ x C2 x C3: not a cluster
21
The CACTUS Algorithm
Summarize
inter-attribute summaries: scans dataset
intra-attribute summaries
Clustering phase
Compute cluster projections
Level-wise synthesis of cluster projections
to form candidate clusters
Validation
Requires a scan of the dataset
22
STIRR [GKR98]
An iterative dynamical system
Weighted nodes in the graph
In each iteration, weights are propagated
between connected nodes (determined by
tuples in the dataset)
Each iteration requires a dataset scan
Iteration stops when the fixed point
is reached
Similar nodes have similar weights
23
Experimental Evaluation
Compare CACTUS with STIRR
Synthetic datasets
Quasi-random data [GKR98:STIRR]
Fix domain of each attribute
Randomly generate tuples from these
domains
Identify clusters and plant additional (5%)
data within the clusters
24
Synthetic Datasets: Cactus and STIRR
{0,…9} x {0,…9}
{10,…,19} x {10,…,19}
0
9
10
Both CACTUS and STIRR identified
the two clusters exactly
19
20
…
99
25
Synthetic Dataset (contd.)
{0,…,9} x {0,…,9} x {0,…,9}
{10,…,19} x {10,…,19} x {10,…,19}
{0,…,9} x {10,…,19} x {10,…,19}
0
9
10
19
20
…
Cactus identifies the 3 clusters
STIRR returns:
{0,…,9} x {0,…,19} x {0,…,9}
{10,…,19} x {0,…,19} x {10,…,19}
99
26
Scalability with #Tuples
Time vs. #Tuples
Time (in seconds)
2500
2000
1500
1000
500
0
1
2
3
4
5
#Tuples (in millions)
CACTUS
STIRR
#Attributes: 10
Domain Size: 100
CACTUS is 10 times faster
27
Scalability with #Attributes
Time vs. #Attributes
5000
4500
Time (in seconds)
4000
3500
3000
2500
2000
1500
1000
500
0
4
6
8
10
20
30
#Attributes
CACTUS
STIRR
40
50
1 million tuples
Domain size: 100
28
Scalability with Domain Size
Time vs. Domain Size
Time (in seconds)
250
200
150
100
50
0
50
100
200
400
600
800
1000
#Attribute Values
CACTUS
STIRR
1 million tuples
#attributes: 4
29
Bibliographic Data
Database and theory bibliographic entries
[Wie]—38500 entries
Attributes: first author, second author,
conference/journal, and year
Example cluster projections on the
conference attribute
(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod Record
(2). ACMTG, CompGeom, FOCS, Geometry, ICALP, IPL, JCSS, …
(3). PODS, Algorithmica, FOCS, ICALP, INFCTRL, IPL, JCSS, …
30
Conclusions
Formal definition of a cluster
A scalable fast summarization-based
clustering algorithm for categorical
data
Outperforms an earlier algorithm
(STIRR) by almost an order of magnitude
Subspace clustering
31
on
32
Extensions
Dealing with large attribute value
domains
In some rare cases, the inter-attribute
or intra-attribute summaries may not fit
in main memory
Clusters in subspaces when the
number of attributes is large
33
Related Work
Conceptual Clustering (e.g., [Fisher87]),
EM [DLR77]
Assume that datasets fit in main memory
Recent scalable clustering algorithms
for clustering categorical data
STIRR [GKR98]
ROCK [GRS99]
the definition of clusters is not clear
34
Limitations
The cluster definition may be too
strong for certain applications
That we require every pair of attribute
values across attributes to be strongly
connected
Consequence: a large number of
clusters
35
Outline of the talk
Notion of similarity
Cluster Definition
The CACTUS Algorithm
Experimental Evaluation
Extensions to CACTUS
Conclusions
36
Validation
Scan the dataset once more
Compute supports of candidate
clusters
Retain only those with significantly
high support
37
Computing Cluster Projections:
Algorithm
For the attribute A1, compute cluster
projections from clusters on (A1,A2),
2
n
(A1,A3),…,(A1,An): S1 ,..., S1
2
n
Intersection join on S1 ,..., S1
S1 {s : (s'S s s' ),..., (s'S s s' )}
2
1
n
1
38
Computing Cluster Projections
From A, B : SBA {a1 , a2 , a3 , a4 }
From A, C : S {a1 , a2 }
C
A
{a1 , a2 } S S
B
A
C
A
{a1 , a2 , a3 , a4 } {a1 , a2 }
a1
b1
a2
a3
b2
b3
a4
b4
c1
c2
c3
c4
Lemma: C=C1 x … x Cn be a cluster.
Then Ci is the intersection of {Ci’:
(Ci’,Ck) is a cluster on Ai, Ak}
39