Transcript Full Talk
Community Detection Algorithm and Community Quality Metric
Mingming Chen & Boleslaw K. Szymanski
Department of Computer Science
Rensselaer Polytechnic Institute
Community Structure
Many networks display community structure
Groups of nodes within which connections are
denser than between them
Community detection algorithms
Community quality metrics
Two Related Community Detection Topics
Community detection algorithm
LabelRank: a stabilized label propagation
community detection algorithm Xie and Symanski, 2013.
LabelRankT: extended algorithm for dynamic
networks based on LabelRank Xie, Chen, and Symanski, 2013.
A new community quality metric solving two
problems of Modularity
M. E. J. Newman, 2006;
Newman and Girvan, 2004.
LabelRank Algorithm
Four operators applied to the labels
No
Label propagation operator
Inflation operator
Cutoff operator
Conditional update operator
No
2
1
No
1
1
1
3
Question: NP=P ?
Node 1: No;
Node 2: No;
Node 3: No;
Node 4: Yes.
197
Yes
4
PP
1 (No)=3/100;
1 (No)=3/4;
PP
1 (Yes)=97/100.
1 (Yes)=1/4.
Node 1: Yes.
No.
Label Propagation Operator
W P
where W is the n x n weighted adjacent matrix. P is the
n x n label probability distribution matrix which is
composed of n (1 x n) row vectors Pi, one for each node
Each element Pi(c) holds the current estimation of
probability of node i observing label c C , where C is
the set of labels (here, suppose C={1, 2, …, n})
Ex. Pi=(0.1, 0.2, …, 0.05, …)
To initialize P, each node is assigned a distribution of
probabilities of all incoming edges
Pi (c )
wic
k Nb ( i )
wik
, c C s.t. wic 0.
Label Propagation Operator
Each node receives the label probability distribution
from its neighbors and computes the new distribution
P (c )
jNb ( i )
wij Pj (c )
i
k Nb ( i )
wik
, c C.
P3= (0.25, 0, 0.25, 0, 0, 0, 0.25, 0.25, 0, 0)
P1= (0.25, 0.25, 0.25, 0.25, 0, 0, 0, 0, 0, 0)
P1= (0.25, 0.125, 0.125, 0.125, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625)
P2= (0.25, 0.25, 0, 0, 0.25, 0.25, 0, 0, 0, 0)
P4= (0.25, 0, 0, 0.25, 0, 0, 0, 0, 0.25, 0.25)
Inflation Operator
Each element Pi(c) rises to the inth power:
in Pi (c )
Pi (c )in
in
P
(
j
)
i
jC
It increases probabilities of labels with high probability
but decreases that of labels with low probabilities during
label propagation.
P1= (0.25, 0.125, 0.125, 0.125, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625)
in (in 2)
P1= (0.129, 0.0323, 0.0323, 0.0323, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806)
Cutoff Operator
The cutoff operator r on P removes labels that are
below the threshold r [0,1] with the help from Inflation
Operator that decreases probabilities of labels with low
probabilities during propagation.
r efficiently reduces the space complexity from
quadratic to linear.
P1= (0.129, 0.0323, 0.0323, 0.0323, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806)
r (r 0.1)
P1= (0.129)
With r = 0.1, the average
number of labels in each
node is less than 3.
Conditional Update Operator
At each iteration, it updates a node i only when it is
significantly different from its incoming neighbors in
terms of labels:
jNb ( i )
isSubset (Ci* , C *j ) qki ,
where Ci* is the set of maximum probability labels at
node i at the last step. isSubset ( s1 , s2 ) returns 1 if s1 s2
and 0 otherwise. ki is the node degree and q∈ [0,1].
isSubset can be viewed as a measure of similarity
between two nodes.
Effect of Conditional Update Operator
Running time of LabelRank
O(Tm): m is the number of edges and T is the number
of iterations.
LabelRank is a linear algorithm
Performance of LabelRank
LabelRankT
It is a LabelRank with one extra conditional update rule
by which only nodes involved changes will be updated.
Changes are handled by comparing neighbors of node i
at two consecutive steps, Nbt 1 (i ) and Nbt (i ) .
Two Problems of Modularity Maximization
Split large communities
Favor small communities
Resolution limit problem
Modularity optimization may fail to discover
communities smaller than a scale even in cases
where communities are unambiguously defined.
This scale depends on the total number of edges in
the network and the degree of interconnectedness
of the communities.
Favor large communities
Fortunato et al, 2008; Li et al, 2008; Arenas et al, 2008; Berry et al, 2009;
Good et al, 2010; Ronhovde et al, 2010; Fortunato, 2010; Lancichinetti et
al, 2011; Traag et al, 2011; Darst et al, 2013.
Modularity
Modularity (Q): the fraction of edges falling within
communities minus the expected value in an equivalent
network with edges placed at random
ki k j
1
Q
Aij
ci ,c j ,
2 | E | ij
2 | E |
c ,c
i
j
1
0
M. E. J. Newman, 2006.
if nodes i and j in the same community,
otherwise.
Equivalent definition
Newman and Girvan, 2004.
| E in | 2 | E in | | E out | 2
ci
ci
ci
Q
,
2| E |
ci | E |
| Ecini |: the number of intra edges of Community ci ;
|c |
| Ecout
|: the number of inter edges of Community ci .
i
Modularity with Split Penalty
Modularity (Q): the modularity of the community
detection result
ki k j
1
Q
Aij
ci ,c j .
2 | E | ij
2 | E |
Split penalty (SP): the fraction of edges that connect
nodes of different communities
1
SP
Aij (1 ci ,c j ).
2 | E | ij
Qs = Q – SP: solving the problem, favoring small
communities, of Modularity
ki k j
1
1
Qs Q SP
A
Aij (1 ci ,c j ).
ij
ci ,c j
2 | E | ij
2 | E |
2 | E | ij
Qs with Community Density
Resolution limit: Modularity optimization may fail to
detect communities smaller than a scale
Intuitively, put density into Modularity and Split Penalty
to solve the resolution limit problem
ki k j 2
1
1
Qds
d ci ci ,c j
Aij d ci ,c j (1 ci ,c j )
Aij d ci
2 | E | ij
2| E |
2 | E | ij
d ci
| Ecini |
| ci | (| ci | 1) / 2
d ci ,c j
| Eci ,c j |
| ci || c j |
Equivalent definition
in
2
in
out
|C | | E
|
|
E
|
2
|
E
|
|
E
|
ci , c j
c
ci
ci
Qds i d ci
d ci
d ci ,c j
|E|
2| E |
ci
c cj c 2 | E |
j
i
|c|
Example of Two Well-Separated Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.5
0
0.5
0.5
1 community
0
0
0
0.245
Example of Two Weakly Connected Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.357
0.143
0.214
0.339
1 community
0
0
0
0.25
Ambiguity between One and Two Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.3
0.2
0.1
0.263
1 community
0
0
0
0.249
Ambiguity between One and Two Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.25
0.25
0
0.188
1 community
0
0
0
0.245
Example of One Well Connected Community
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.167
0.333
-0.167
0.0417
1 community
0
0
0
0.23
Example of One Very Well Connected Community
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.0455
0.455
-0.409
-0.239
1 community
0
0
0
0.168
Example of One Complete Graph
Community Quality on a complete graph with 8 nodes
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities -0.0714
0.571
-0.643
-0.643
1 community
0
0
0
0
Modularity Has Nothing to Do with #Nodes
12 13 2
Q (clique) Q(tree) 2 *
0.4231;
26 26
12 13 2
1
Qs (clique) Qs (tree) 2 *
0.3462;
26
26 26
2
12
13
1
1
Qds (clique) 2 * *1
*1
*
0.4183;
26 4 * 4
26
26
12 2 13 2 2
1
1
Qds (tree) 2 * *
*
*
0.2214.
26
7
26
7
26
7
*
7
5-clique Example
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
30 communities 0.8758
0.09091
0.7848
0.8721
15 communities 0.8879
0.04545
0.8424
0.4305
∆Qs=(0.8424-0.7848)=0.0576 >
∆Q=(0.8879-0.8758)=0.0121
Thanks!
Q&A
Example of Two Weakly Connected Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.309
0.25
0.0586
0.264
1 community
-0.00586
0.125
-0.131
0.202