Development Of Graph Clustering Algorithms By Sushrut S. Karanjkar

Transcript Development Of Graph Clustering Algorithms By Sushrut S. Karanjkar

Development Of Graph
Clustering Algorithms
By
Sushrut S. Karanjkar
Overview
•
•
•
•
•
•
Problem Definition
Methodology
Objective Functions
Developed Schemes
Experimental Results and Discussion
Conclusions
Problem Definition
• Given a graph G = (V,E) and an input parameter input , we
want to find a clustering solution such that the number of
clusters in the solution is minimized and if C is a cluster
then
|C|
v  C ,
 input
| adj(v)  C |
• Applications
 Data Mining
 CAD-VLSI ( synthesis, partitioning, …)
Methodology
Multilevel Paradigm
• Proven Paradigm For Graph Partitioning Problems
• Schemes based on multilevel paradigm produce
high-quality partitions.
• Multilevel partitioning schemes are extremely fast.
• Incorporate both global and local information.
• E.g. Metis is an extremely fast, high-quality graph
partitioning algorithm based on multilevel
paradigm.
Ingredients Of Multilevel Paradigm
• Coarsening
– Sequence of smaller graphs
is constructed each with
fewer vertices than the
previous.
– A set of vertices from the
finer graph are collapsed
together to form a single
vertex of the coarser graph.
– Collapsing Criterion should
facilitate in attaining the
desired objective.
Ingredients Of Multilevel Paradigm
• Initial Processing
– Coarsest and more
manageable graph is
processed in a desired
fashion.
– Results in the initial
solution for the
problem.
• Uncoarsening &
Refinement
– Initial solution is
projected through the
sequence of finer
graphs.
– At each stage finer
optimizations are done
to improve the quality
of the solution.
Use of Multilevel Paradigm For Clustering
• Coarsening Scheme
– Obtain the sequence of coarser graphs.
• Initial Clustering
– Process the coarsest graph to get initial clustering
solution.
• Refinement Of Clusters
– Project the initial clustering solution through a
sequence of finer graphs with finer optimizations at
each stage to improve the quality of clustering solution.
– Quality of Clustering solution is determined by the
desired objective function.
Clustering Objectives
• Primary Objective
– Based on
input
– If C is a cluster then
|C|
v  C ,
 input
| adj(v)  C |
• Other Objectives
– Based on the
information about
edges
– Objective 1
– Objective 2
Objective #1
• Maximize the connectivity of items within the clusters.
• If Ck = { C1,C2, …, Ck} is the clustering solution, then
Maximize
 avg (C k ) 
  (v p )
 avg (C i ) 
v p Ci
| Ci |

k
 (C )
i 1 avg i
and
k
where,
| adj(v)  C i |
 (v ) 
| Ci |
Objective #2
• Maximize the weighted connectivity of items within the
clusters.
• If Ck = { C1,C2, …, Ck} is the clustering solution, then
Maximize
k
 Wtavg (C
k

(Ci )

i 1 Wtavg
)
k
where,
  Wt (v p )
 Wtavg (C i ) 
v p Ci
| Ci |
and
Weight (adj(v)  C i )
 Wt (v) 
| Ci |
Coarsening Scheme
• Compute Matching
• Make an entry for a pair of directly connected nodes only if
after collapsing, the  for each node ( of the original graph ) in
the resultant cluster  input.
• Sort all the entries based on avg for the resultant cluster to form
a list. Break ties using the edge weight.
• Construct a matching by traversing this list serially.
• Collapse all the matched pairs of nodes in the graph to
generate the finer graph.
– Only pairs of nodes are allowed for collapsing.
• Continue coarsening till no more collapsing can be done.
Initial Clustering Phase
• No processing is done.
• The set of vertices of the coarsest phase is
the initial clustering solution.
Uncoarsening Scheme
• Projection Phase
– Project the solution of the coarser level back to
the finer level.
• Refinement Phase
– Minimization Phase
• Minimize the number of clusters in the solution.
– Maximization Phase
• Maximize the clustering objective.
Minimization Phase
• Compute a matching for the existing clusters.
– Make an entry for a pair of parent clusters of directly connected
nodes for merging, only if  for each node ( of the original graph )
in the resultant cluster  input.
– Sort all the entries based on avg of the resultant cluster to form a
list. Break ties using the edge weight.
– Construct a matching by traversing this list serially.
• Collapse all the pairs of matched clusters.
Maximization Phase
• For all possible moves calculate gain based on objective
function.
• Gain1 ( v, Cm, Cn) =
 avg (Cm  {v})   avg (Cm )   avg (Cn  {v})   avg (Cn )
• Gain2 ( v, Cm, Cn) = Wtavg (Cm {v}) Wtavg (Cm )  Wtavg (Cn  {v}) Wtavg (Cn )
• Make move only if the gain is positive.
Refinement Phase
• Structure of a Refinement Phase
– Minimization of Number of Clusters
– Maximization of Objective Function
• Repeat Minimization followed by
Maximization till no changes occur.
• Results in an improved clustering solution
for that level.
Problem With Objective Function #2
• Race For Strong Edges.
• Strong Edges are pulled by cluster with light edges more
than those with strong edges.
• May potentially lead to bad solutions.
• Solution
– Hierarchical Approach
– Introduce Edge Weights uniformly to the clustering
scheme.
Hierarchical Clustering
• Introduce the edges in the clustering
problem in an uniform way.
• Why does it solve the problem ?
– Strongly connected nodes gets a preference for
being in the same cluster over not so strongly
connected nodes.
– Results in “Strong” Clusters.
– Balanced Competition for edges
Scheme A
• Non-hierarchical Scheme
• Minimize the number of clusters
• Maximize avg of the resulting solution (
Objective Function #1 )
• Maximization Phase based on gain1
function.
Scheme B
• Hierarchical Scheme
• Introduce edges at at rate of 25 % each
time.
• Maximize Wtavg of the clustering solution (
Objective Function #2 )
• Maximization Phase based on gain2
function.
Experimental Results and Discussion
• Experimental Setup
– Data Sets
• Web Documents Data
• S & P 500 Stocks Data
• Evaluation of results
– Graph Theoretical Context
– Data Mining Context
• Conclusions
Graph Theoretical Context
• Parameters of Interest
–
–
–
–
avg for the clustering solution
Wtavg for the clustering solution
Number of Non-trivial clusters (|C|  3 )
Percentage of nodes spanned by non-trivial clusters
• Desired Behavior of the parameters ( for an input)
– Smallest number of non-trivial clusters and highest percentage of
spanned nodes.
– Highest avg( for Scheme A) & Highest Wtavg( for Scheme B)
Discussion Of Results : Graph
Theoretical Context
• Does refinement improve the quality of the
clustering solution ?
– Number of Non-trivial Clusters before and after
refinement
– Percentage of nodes spanned by non-trivial
clusters before and after refinement
• Comparison Of Scheme A with Scheme B
1.4
1.2
Coarsening
Results
1
0.8
0.6
0.4
0.4
0.2
0.6
0
m
in
1
m
in
no 2
rm
al
j9
j8
j7
k1
j6
j4
j3
j2
0.8
j1
Ratio of Number of Non trivial Clusters
Scheme A : Number Of Non-trivial Clusters
1
Scheme A : Percentage Of Nodes Spanned by
Non-trivial Clusters
Ratio of Percentage of nodes covered
by non trivial clusters
1.15
1.1
1.05
Coarsening
Results
1
0.95
0.4
0.6
0.9
0.8
j1
j2
j3
j4
j6
k1
j7
j8
j9
m
1
in
m
2
in
n
m
or
al
1
Scheme B : Number Of Non-trivial Clusters
Ratio of Number of Non trivial Clusters
1.8
1.6
1.4
1.2
1
Coarsening
Results
0.8
0.6
0.4
0.4
0.2
0.6
0
j1
j2
j3
j4
j6
k1
j7
j8
j9
m
1
in
m
2
in
rm
o
n
al
0.8
1
Scheme B : Percentage Of Nodes Spanned by
Non-trivial Clusters
1.15
1.1
1.05
Coarsening
Results
1
0.95
0.9
0.4
0.6
0.85
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0.8
j1
Ratio of Percentage of nodes covered
by non trivial clusters
1.2
1
Comparison : Number of Non-trivial
Clusters
Ratio of Number of Non trivial Clusters
1.8
1.6
1.4
1.2
Scheme A
Results
Scheme A
Results
1
0.8
0.6
0.4
0.4
0.6
0.2
0.8
1
0
j1
j2
j3
j4
j6
k1
j7
j8
j9
m in1
m in2 norm al
1.2
1
Scheme
Scheme
AA
Results
Results
0.8
0.6
0.4
0.2
0.4
0.6
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0
j1
Ratio of Percentage of nodes covered by non trivial
clusters
Comparison : Percentage Of Nodes
Spanned by Non-trivial Clusters
0.8
1
Data Mining Context
• Labels of the data items known a-priori.
• Types Of Clusters
– Pure clusters : Non-trivial Clusters consisting of data items of only
one label.
– Almost pure clusters : Non-trivial Clusters consisting of data items
of at the most two labels.
– Impure clusters : Clusters other than pure and almost pure clusters.
• Parameters of Interest
– Number of Pure + Number of Almost Pure Clusters
– Percentage of nodes spanned by pure + almost pure clusters.
– Entropy For Clustering Solution
Discussion Of Results : Data Mining
Context
• Does refinement improve the quality of the
clustering solution ?
– Number of Pure + Almost Pure Clusters before
and after refinement
– Percentage of nodes spanned by Pure + Almost
Pure Clusters clusters before and after
refinement
• Comparison Of Scheme A with Scheme B
1.4
1.2
1
Coarsening
Results
0.8
0.6
0.4
0.2
0.4
0.6
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0
j1
Ratio of Number of Pure + Almost Pure Clusters
Scheme A : Number Of Pure + Almost Pure
Clusters
0.8
1
Ratio of Percentage of Pure + Almost Pure Clusters
Scheme A : Percentage Of Nodes Spanned by
Pure + Almost Pure Clusters
1.4
1.2
1
Coarsening
Results
0.8
0.6
0.4
0.2
0.4
0
0.6
j1
j2
j3
j4
j6
k1
j7
j8
j9
m
1
in
m
2
in
n
m
or
al
0.8
1
1.6
1.4
1.2
Coarsening
Results
1
0.8
0.6
0.4
0.4
0.2
0.6
0
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0.8
j1
Ratio of Number of Pure + Almost Pure Clusters
Scheme B : Number Of Pure + Almost Pure
Clusters
1
1.6
1.4
1.2
Coarsening
Results
1
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
0.6
j4
0
j3
0.4
j2
0.2
j1
Ratio of Percentage of Pure + Almost Pure Clusters
Scheme B : Percentage Of Nodes Spanned by
Pure + Almost Pure Clusters
0.8
1
Comparison : Number of Pure + Almost
Pure Clusters
2.5
2
1.5
Scheme A
Scheme
ResultsA
Results
1
0.5
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
0.6
j2
0
j1
Ratio of Number of Non trivial Clusters
3
0.8
1
2
1.8
1.6
1.4
1.2
Scheme A
Scheme A
Results
Results
1
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
0.6
j4
0
j3
0.4
j2
0.2
j1
Ratio of Percentage of nodes covered by non trivial
clusters
Comparison : Percentage Of Nodes
Spanned by Pure + Almost Pure Clusters
0.8
1
Entropy
• Measure for cohesiveness among items in clusters.
m
cluster _ entropy(C i )  
j 1
k
entropy   (
i 1
(
counti (item j )
Total (C i )
 log 2 (
counti (item j )
Total (C i )
))
Total (C i )
 cluster _ entropy(C i ))
TOTAL
• Lower Entropy indicates better cohesiveness.
• Misleading Measure
– Entropy goes up as the number of clusters go down.
Comparison Based On Entropy
• Entropy calculations done only over the set of
non-trivial clusters.
• Comparison of Scheme A with Scheme B
• Comparison of Scheme A and Scheme B with
hMetis and a two-level refinement scheme ( 2-LR
Scheme)
Scheme A : Entropy
1.6
1.4
1.2
Coarsening
Results
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
j6
j4
0.6
k1
0
j3
0.4
j2
0.2
j1
Ratio of Entropy
1
0.8
1
Scheme B : Entropy
1.4
1.2
Coarsening
Results
1
0.6
0.4
0.2
0.4
0.6
al
rm
no
in
2
m
in
1
m
j9
j8
j7
j6
j4
k1
j3
j2
0
j1
Ratio of Entropy
0.8
0.8
1
Comparison Based on Entropy : A Vs. B
1.8
1.6
1.4
Scheme A
Results
1
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
j6
j4
0.6
k1
0
j3
0.4
j2
0.2
j1
Ratio of Entropy
1.2
0.8
1
Comparison Based On Entropy : A Vs.
Best of hMetis and 2-LR Scheme
1.8
1.6
Ratio of Entropy
1.4
1.2
Best Best
Results
Result
1
0.8
0.6
0.4
0.4
0.2
0.6
0
0.8
j4
j6
j7
j8
j9
1
Comparison Based On Entropy : B Vs.
Best of hMetis and 2-LR Scheme
1.6
Ratio of Entropy
1.4
1.2
Best
Best
Results
Result
1
0.8
0.6
0.4
0.4
0.2
0.6
0
0.8
j4
j6
j7
j8
j9
1
Conclusions
• Two schemes were developed for clustering based on
multilevel paradigm.
• Results analyzed in graph theoretical and data mining
context.
• Scheme B seems to perform better in most of the cases in
both the contexts.
• Need For
– Better Tie Breaking Strategies
– Better Definition of a cluster
– Better Clustering Objectives