MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Celepcikay, Christian Guisti, and

Download Report

Transcript MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Celepcikay, Christian Guisti, and

MOSAIC: A Proximity Graph Approach
for Agglomerative Clustering
Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi
Celepcikay, Christian Guisti, and Christoph F. Eick
Department of Computer Science, University of Houston
Organization
1. Motivation
 Scope of the research
– Region Discovery
– Traditional Clustering
 Clustering with Plug-In Fitness Functions
 Shape-aware Clustering Algorithms
 Ideas of MOSAIC
2. Background
3. The MOSAIC Algorithm
4. Experimental Evalution
5. Related Work
6. Conclusion and Future Work
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.1 Motivation: Examples of Region Discovery
Application 1: Hot-spot Discovery [EVDW06]
Application 2: Find Interesting Regions with respect to a Continuous Variable
Application 3: Find “representative” regions (Sampling)
Application 4: Regional Co-location Mining
Application 5: Regional Association Rule Mining [DEWY06]
Application 6: Regional Association Rule Scoping [EDYKN07]
b=1.01
RD-Algorithm
b=1.04
Wells in Texas:
Green: safe well with respect to arsenic
Red: unsafe well
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Region Discovery Framework
The algorithms we currently investigate solve the following problem:
Given:
A dataset O with a schema R
A distance function d defined on instances of R
A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:
q(X)= cX reward(c)=cX interestingness(c)*size(c)b with b>1
Objective:
Find c1,…,ck  O such that:
1. cicj= if ij
2. X={c1,…,ck} maximizes q(X)
3. All cluster ciX are contiguous
4. c1,…,ck  O
5. c1,…,ck are usually ranked based on the reward each cluster receives,
and low reward clusters are frequently not reported
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.2 Clustering with Plug-In Fitness Functions
Clustering
algorithms
No fitness function
DBSCAN
Hierarchical
Clustering
Implicit Fitness
Function
K-Means
Fixed
Fitness
Function
PAM
Provides plug-in
fitness function
CHAMELEON
MOSAIC
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.3 Shape-aware Clustering
• Shape is a significant characteristic in traditional
clustering and region discovery
• Examples
Fig. 1: some chain-like
patterns in Volcano dataset
Fig.2: arbitrary shape of regions of
high (low) arsenic concentration in Texas wells
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
1.4 Ideas Underlying MOSAIC
• MOSAIC provides a generic framework that
integrates representative-based clustering,
agglomerative clustering, and proximity graphs,
and which approximates arbitrary shape clusters
using unions of small convex polygons
(a) input
(b) output
Fig. 6: An illustration of MOSAIC’s approach
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Talk Organization
1.
2.
3.
4.
5.
6.
Motivation
Background
 Representative-based clustering
 Agglomerative clustering
 Proximity Graphs
The MOSAIC Algorithm
Experimental Evaluation
Related Work
Conclusion and Future Work
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
2.1 Representative-based Clustering
Attribute1
2
1
3
4
Attribute2
Objective: Find a set of objects OR such that the clustering X
obtained by using the objects in OR as representatives minimizes q(X).
Properties: Cluster shapes are convex polygons
Popular Algorithms: K-means, K-medoids, SCEC
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
2.2 MOSAIC and Agglomerative Clustering
Advantages MOSAIC over traditional agglomerative
clustering:
• Wider search—considers all neighboring clusters
• Plug-in fitness function
• Clusters are always contiguous
• Expensive algorithm is only run for 20-1000 iterations
• Highly generic algorithm
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
2.3 Proximity Graphs
• How to identify neighboring clusters for
representative-based clustering
algorithms?
NNG  MST  RNG  GG  DT
• Proximity graphs provide various
definitions of “neighbour”
NNG = Nearest Neighbour Graph
MST = Minimum Spanning Tree
RNG = Relative Neighbourhood Graph
GG = Gabriel Graph
DT = Delaunay Triangulation (neighbours of a 1NN-classifier)
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Proximity Graphs: Delaunay
• The Delaunay Triangulation is the
dual of the Voronoi diagram
• Three points are each others
neighbours if their tangent sphere
contains no other points
• Complete: captures all neighbouring
clusters
• Expensive to compute in high
dimensions
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Proximity Graphs: Gabriel
• The Gabriel graph is a
subset of the Delaunay
Triangulation (some
decision boundary might
be missed)
• Points are neighbours only
if their (diametral) sphere
of influence is empty
• Can be computed more
efficiently: O(k3)
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
3. MOSAIC
Fig. 10: Gabriel graph for clusters generated by
a representative-based clustering algorithm
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Pseudo Code MOSAIC
1. Run a representative-based clustering algorithm to create a
large number of clusters.
2. Read the representatives of the obtained clusters.
3. Create a merge candidate relation using proximity graphs.
4. WHILE there are merge-candidates (Ci ,Cj) left
BEGIN
Merge the pair of merge-candidates (Ci,Cj), that
enhances fitness function q the most, into a new cluster C’
Update merge-candidates:
C Merge-Candidate(C’,C)  Merge-Candidate(Ci,C)
Merge-Candidate(Cj,C)
END
RETURN the best clustering X found.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Complexity MOSAIC
Let
n be the number of objects in the dataset
k be the number of clusters returned by the representativebased algorithm
Complexity MOSAIC: O(k3 + k2*O(q(x)))
Remarks:
• The above formula assumes that fitness is computed from
the scratch when a new clustering is obtained
• Lower complexities can be obtained with incrementally
reusing results of previous fitness computations
• Our current implementation assumes that only additive
fitness functions are used
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
4. Experimental Evaluation for Traditional Clustering
• Compared MOSAIC with DBSCAN and K-means
• Used silhouette as q(X) when running MOSAIC;
Silhouette considers cohesion and separation
(measured as the distance to the nearest cluster).
• Used 9-Diamonds, Volcano, Diabetes, Ionosphere,
and Vehicle datasets in the experimental
evaluation
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Experimental Results
• Finding good parameter setting for DBSCAN
turned out to be problematic for the 9-Diamonds
and Volcano spatial datasets.
• Neither DBSCAN nor MOSAIC were able to obtain
to identify all chain-like patterns in the Volcano
dataset.
• We compared MOSAIC and K-means for the
Ionosphere, Diabetes, and Vehicle highdimensional datasets. Cluster quality was
measured using Silhouette. MOSAIC outperformed
K-means on these datasets.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Volcano Dataset Result MOSAIC
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Volcano Dataset Result DBSCAN
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Open Issues: What is a Good Fitness Function
for Traditional Clustering?
• The use plug-in fitness functions within traditional
clustering algorithms is not very common.
• Use existing cluster evaluation measures as
fitness function, such as cohesion, separation,
and silhouette, does not lead to very good
clustering when confronted with arbitrary shape
clusters [Choo07].
Question: Can we find better cluster evaluation
measures or is finding good evaluation measures
for traditional clustering a hopeless project?
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
5. Related Work
• CURE integrates a partitioning algorithm with an
agglomerative hierarchical algorithm [GRS98].
• CHAMELEON [KHK99] provides a sophisticated
two-phased clustering algorithm: a multilevel
graph partitioning algorithm and agglomerative
clustering algorithm on knn sparse graph.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Related Work Continued
• Lin and Zhong [LC02 and ZG03] propose hybrid
clustering algorithms that combine representativebased clustering and agglomerative clustering
methods.
• Surdeanu [STA05] proposes a hybrid clustering
approach that combines agglomerative clustering
algorithm with the Expectation Maximization (EM)
algorithm.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
6. Conclusion
• A new clustering algorithm was introduced that
approximates arbitrary shape clusters through unions of
convex polygons
• The algorithm performs a wider search by considering “all”
neighboring clusters as merge candidates. Gabriel graphs
are used to determine neighboring clusters
• The algorithm is generic in that it can be used with any
initial merge candidate relation, any fitness function, and
any representative-based algorithms
• MOSAIC can also be seen as a generalization of
agglomerative grid-based clustering algorithms.
• We mainly use MOSAIC in the region discovery project
mentioned earlier.
Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007
Future Work: Learn fitness function based on feedback
Idea: employs machine learning techniques to learn a
fitness function by using the feedback of a domain
expert.
– Pros:
– It provides more adaptive approach to give the changes to tailor the
fitness function based on the domain expert’s requirements.
– The process of finding an appropriate fitness function is automatic.
– Cons:
– features selection is non-trivial
– Learning the function is a difficult machine learning task