Development Of Graph Clustering Algorithms By Sushrut S. Karanjkar

Download Report

Transcript Development Of Graph Clustering Algorithms By Sushrut S. Karanjkar

Development Of Graph
Clustering Algorithms
By
Sushrut S. Karanjkar
Overview
•
•
•
•
•
•
Problem Definition
Methodology
Objective Functions
Developed Schemes
Experimental Results and Discussion
Conclusions
Problem Definition
• Given a graph G = (V,E) and an input parameter input , we
want to find a clustering solution such that the number of
clusters in the solution is minimized and if C is a cluster
then
|C|
v  C ,
 input
| adj(v)  C |
• Applications
 Data Mining
 CAD-VLSI ( synthesis, partitioning, …)
Methodology
Multilevel Paradigm
• Proven Paradigm For Graph Partitioning Problems
• Schemes based on multilevel paradigm produce
high-quality partitions.
• Multilevel partitioning schemes are extremely fast.
• Incorporate both global and local information.
• E.g. Metis is an extremely fast, high-quality graph
partitioning algorithm based on multilevel
paradigm.
Ingredients Of Multilevel Paradigm
• Coarsening
– Sequence of smaller graphs
is constructed each with
fewer vertices than the
previous.
– A set of vertices from the
finer graph are collapsed
together to form a single
vertex of the coarser graph.
– Collapsing Criterion should
facilitate in attaining the
desired objective.
Ingredients Of Multilevel Paradigm
• Initial Processing
– Coarsest and more
manageable graph is
processed in a desired
fashion.
– Results in the initial
solution for the
problem.
• Uncoarsening &
Refinement
– Initial solution is
projected through the
sequence of finer
graphs.
– At each stage finer
optimizations are done
to improve the quality
of the solution.
Use of Multilevel Paradigm For Clustering
• Coarsening Scheme
– Obtain the sequence of coarser graphs.
• Initial Clustering
– Process the coarsest graph to get initial clustering
solution.
• Refinement Of Clusters
– Project the initial clustering solution through a
sequence of finer graphs with finer optimizations at
each stage to improve the quality of clustering solution.
– Quality of Clustering solution is determined by the
desired objective function.
Clustering Objectives
• Primary Objective
– Based on
input
– If C is a cluster then
|C|
v  C ,
 input
| adj(v)  C |
• Other Objectives
– Based on the
information about
edges
– Objective 1
– Objective 2
Objective #1
• Maximize the connectivity of items within the clusters.
• If Ck = { C1,C2, …, Ck} is the clustering solution, then
Maximize
 avg (C k ) 
  (v p )
 avg (C i ) 
v p Ci
| Ci |

k
 (C )
i 1 avg i
and
k
where,
| adj(v)  C i |
 (v ) 
| Ci |
Objective #2
• Maximize the weighted connectivity of items within the
clusters.
• If Ck = { C1,C2, …, Ck} is the clustering solution, then
Maximize
k
 Wtavg (C
k

(Ci )

i 1 Wtavg
)
k
where,
  Wt (v p )
 Wtavg (C i ) 
v p Ci
| Ci |
and
Weight (adj(v)  C i )
 Wt (v) 
| Ci |
Coarsening Scheme
• Compute Matching
• Make an entry for a pair of directly connected nodes only if
after collapsing, the  for each node ( of the original graph ) in
the resultant cluster  input.
• Sort all the entries based on avg for the resultant cluster to form
a list. Break ties using the edge weight.
• Construct a matching by traversing this list serially.
• Collapse all the matched pairs of nodes in the graph to
generate the finer graph.
– Only pairs of nodes are allowed for collapsing.
• Continue coarsening till no more collapsing can be done.
Initial Clustering Phase
• No processing is done.
• The set of vertices of the coarsest phase is
the initial clustering solution.
Uncoarsening Scheme
• Projection Phase
– Project the solution of the coarser level back to
the finer level.
• Refinement Phase
– Minimization Phase
• Minimize the number of clusters in the solution.
– Maximization Phase
• Maximize the clustering objective.
Minimization Phase
• Compute a matching for the existing clusters.
– Make an entry for a pair of parent clusters of directly connected
nodes for merging, only if  for each node ( of the original graph )
in the resultant cluster  input.
– Sort all the entries based on avg of the resultant cluster to form a
list. Break ties using the edge weight.
– Construct a matching by traversing this list serially.
• Collapse all the pairs of matched clusters.
Maximization Phase
• For all possible moves calculate gain based on objective
function.
• Gain1 ( v, Cm, Cn) =
 avg (Cm  {v})   avg (Cm )   avg (Cn  {v})   avg (Cn )
• Gain2 ( v, Cm, Cn) = Wtavg (Cm {v}) Wtavg (Cm )  Wtavg (Cn  {v}) Wtavg (Cn )
• Make move only if the gain is positive.
Refinement Phase
• Structure of a Refinement Phase
– Minimization of Number of Clusters
– Maximization of Objective Function
• Repeat Minimization followed by
Maximization till no changes occur.
• Results in an improved clustering solution
for that level.
Problem With Objective Function #2
• Race For Strong Edges.
• Strong Edges are pulled by cluster with light edges more
than those with strong edges.
• May potentially lead to bad solutions.
• Solution
– Hierarchical Approach
– Introduce Edge Weights uniformly to the clustering
scheme.
Hierarchical Clustering
• Introduce the edges in the clustering
problem in an uniform way.
• Why does it solve the problem ?
– Strongly connected nodes gets a preference for
being in the same cluster over not so strongly
connected nodes.
– Results in “Strong” Clusters.
– Balanced Competition for edges
Scheme A
• Non-hierarchical Scheme
• Minimize the number of clusters
• Maximize avg of the resulting solution (
Objective Function #1 )
• Maximization Phase based on gain1
function.
Scheme B
• Hierarchical Scheme
• Introduce edges at at rate of 25 % each
time.
• Maximize Wtavg of the clustering solution (
Objective Function #2 )
• Maximization Phase based on gain2
function.
Experimental Results and Discussion
• Experimental Setup
– Data Sets
• Web Documents Data
• S & P 500 Stocks Data
• Evaluation of results
– Graph Theoretical Context
– Data Mining Context
• Conclusions
Graph Theoretical Context
• Parameters of Interest
–
–
–
–
avg for the clustering solution
Wtavg for the clustering solution
Number of Non-trivial clusters (|C|  3 )
Percentage of nodes spanned by non-trivial clusters
• Desired Behavior of the parameters ( for an input)
– Smallest number of non-trivial clusters and highest percentage of
spanned nodes.
– Highest avg( for Scheme A) & Highest Wtavg( for Scheme B)
Discussion Of Results : Graph
Theoretical Context
• Does refinement improve the quality of the
clustering solution ?
– Number of Non-trivial Clusters before and after
refinement
– Percentage of nodes spanned by non-trivial
clusters before and after refinement
• Comparison Of Scheme A with Scheme B
1.4
1.2
Coarsening
Results
1
0.8
0.6
0.4
0.4
0.2
0.6
0
m
in
1
m
in
no 2
rm
al
j9
j8
j7
k1
j6
j4
j3
j2
0.8
j1
Ratio of Number of Non trivial Clusters
Scheme A : Number Of Non-trivial Clusters
1
Scheme A : Percentage Of Nodes Spanned by
Non-trivial Clusters
Ratio of Percentage of nodes covered
by non trivial clusters
1.15
1.1
1.05
Coarsening
Results
1
0.95
0.4
0.6
0.9
0.8
j1
j2
j3
j4
j6
k1
j7
j8
j9
m
1
in
m
2
in
n
m
or
al
1
Scheme B : Number Of Non-trivial Clusters
Ratio of Number of Non trivial Clusters
1.8
1.6
1.4
1.2
1
Coarsening
Results
0.8
0.6
0.4
0.4
0.2
0.6
0
j1
j2
j3
j4
j6
k1
j7
j8
j9
m
1
in
m
2
in
rm
o
n
al
0.8
1
Scheme B : Percentage Of Nodes Spanned by
Non-trivial Clusters
1.15
1.1
1.05
Coarsening
Results
1
0.95
0.9
0.4
0.6
0.85
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0.8
j1
Ratio of Percentage of nodes covered
by non trivial clusters
1.2
1
Comparison : Number of Non-trivial
Clusters
Ratio of Number of Non trivial Clusters
1.8
1.6
1.4
1.2
Scheme A
Results
Scheme A
Results
1
0.8
0.6
0.4
0.4
0.6
0.2
0.8
1
0
j1
j2
j3
j4
j6
k1
j7
j8
j9
m in1
m in2 norm al
1.2
1
Scheme
Scheme
AA
Results
Results
0.8
0.6
0.4
0.2
0.4
0.6
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0
j1
Ratio of Percentage of nodes covered by non trivial
clusters
Comparison : Percentage Of Nodes
Spanned by Non-trivial Clusters
0.8
1
Data Mining Context
• Labels of the data items known a-priori.
• Types Of Clusters
– Pure clusters : Non-trivial Clusters consisting of data items of only
one label.
– Almost pure clusters : Non-trivial Clusters consisting of data items
of at the most two labels.
– Impure clusters : Clusters other than pure and almost pure clusters.
• Parameters of Interest
– Number of Pure + Number of Almost Pure Clusters
– Percentage of nodes spanned by pure + almost pure clusters.
– Entropy For Clustering Solution
Discussion Of Results : Data Mining
Context
• Does refinement improve the quality of the
clustering solution ?
– Number of Pure + Almost Pure Clusters before
and after refinement
– Percentage of nodes spanned by Pure + Almost
Pure Clusters clusters before and after
refinement
• Comparison Of Scheme A with Scheme B
1.4
1.2
1
Coarsening
Results
0.8
0.6
0.4
0.2
0.4
0.6
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0
j1
Ratio of Number of Pure + Almost Pure Clusters
Scheme A : Number Of Pure + Almost Pure
Clusters
0.8
1
Ratio of Percentage of Pure + Almost Pure Clusters
Scheme A : Percentage Of Nodes Spanned by
Pure + Almost Pure Clusters
1.4
1.2
1
Coarsening
Results
0.8
0.6
0.4
0.2
0.4
0
0.6
j1
j2
j3
j4
j6
k1
j7
j8
j9
m
1
in
m
2
in
n
m
or
al
0.8
1
1.6
1.4
1.2
Coarsening
Results
1
0.8
0.6
0.4
0.4
0.2
0.6
0
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
j2
0.8
j1
Ratio of Number of Pure + Almost Pure Clusters
Scheme B : Number Of Pure + Almost Pure
Clusters
1
1.6
1.4
1.2
Coarsening
Results
1
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
0.6
j4
0
j3
0.4
j2
0.2
j1
Ratio of Percentage of Pure + Almost Pure Clusters
Scheme B : Percentage Of Nodes Spanned by
Pure + Almost Pure Clusters
0.8
1
Comparison : Number of Pure + Almost
Pure Clusters
2.5
2
1.5
Scheme A
Scheme
ResultsA
Results
1
0.5
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
j4
j3
0.6
j2
0
j1
Ratio of Number of Non trivial Clusters
3
0.8
1
2
1.8
1.6
1.4
1.2
Scheme A
Scheme A
Results
Results
1
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
k1
j6
0.6
j4
0
j3
0.4
j2
0.2
j1
Ratio of Percentage of nodes covered by non trivial
clusters
Comparison : Percentage Of Nodes
Spanned by Pure + Almost Pure Clusters
0.8
1
Entropy
• Measure for cohesiveness among items in clusters.
m
cluster _ entropy(C i )  
j 1
k
entropy   (
i 1
(
counti (item j )
Total (C i )
 log 2 (
counti (item j )
Total (C i )
))
Total (C i )
 cluster _ entropy(C i ))
TOTAL
• Lower Entropy indicates better cohesiveness.
• Misleading Measure
– Entropy goes up as the number of clusters go down.
Comparison Based On Entropy
• Entropy calculations done only over the set of
non-trivial clusters.
• Comparison of Scheme A with Scheme B
• Comparison of Scheme A and Scheme B with
hMetis and a two-level refinement scheme ( 2-LR
Scheme)
Scheme A : Entropy
1.6
1.4
1.2
Coarsening
Results
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
j6
j4
0.6
k1
0
j3
0.4
j2
0.2
j1
Ratio of Entropy
1
0.8
1
Scheme B : Entropy
1.4
1.2
Coarsening
Results
1
0.6
0.4
0.2
0.4
0.6
al
rm
no
in
2
m
in
1
m
j9
j8
j7
j6
j4
k1
j3
j2
0
j1
Ratio of Entropy
0.8
0.8
1
Comparison Based on Entropy : A Vs. B
1.8
1.6
1.4
Scheme A
Results
1
0.8
0.6
0.4
al
rm
no
in
2
m
in
1
m
j9
j8
j7
j6
j4
0.6
k1
0
j3
0.4
j2
0.2
j1
Ratio of Entropy
1.2
0.8
1
Comparison Based On Entropy : A Vs.
Best of hMetis and 2-LR Scheme
1.8
1.6
Ratio of Entropy
1.4
1.2
Best Best
Results
Result
1
0.8
0.6
0.4
0.4
0.2
0.6
0
0.8
j4
j6
j7
j8
j9
1
Comparison Based On Entropy : B Vs.
Best of hMetis and 2-LR Scheme
1.6
Ratio of Entropy
1.4
1.2
Best
Best
Results
Result
1
0.8
0.6
0.4
0.4
0.2
0.6
0
0.8
j4
j6
j7
j8
j9
1
Conclusions
• Two schemes were developed for clustering based on
multilevel paradigm.
• Results analyzed in graph theoretical and data mining
context.
• Scheme B seems to perform better in most of the cases in
both the contexts.
• Need For
– Better Tie Breaking Strategies
– Better Definition of a cluster
– Better Clustering Objectives