Detecting Community Structure in Network

Download Report

Transcript Detecting Community Structure in Network

2004 summer
intensive studies
on complex networks
Detecting Community
Structure in Network
Seung Woo Son
KAIST
2004. 8. 11.
http://cnrl.snu.ac.kr/
Clustering of data

Partitional clustering methods



Important technique in data analysis
Divide the data according to natural classes
Pattern recognition, learning, astrophysics, and
network analysis
N multivariable data points

xi
i  1,2,, N
metric
o
D-dimensional vector space
On network
N vertices (nodes)
No prior information
Only know the edge
(link) connectivity
: Structural information
How can we divide the network into several parts?
= How can we find the “community” structure?
applications
Web page having same topic, hidden social relationship,
distribute processes to processors in a parallel computer, etc.
Community, cluster

Functional modules in cellular and genetic network




Cultural society or important source of a person’s identity
in social network




P. Holme, M. Huss, and H. Jeong, Bioinformatics 19, 532 (2003).
D. Wilkinson and B. A. Huberman, Proc. Natl. Acad. Sci. USA
10.1073/pnas.0307740100 (2004).
A. Vespignani, Nature Genetics 35, 118 (2003).
J. Scott, Social Network Analysis: A Handbook, Sage Publications 2nd ed.
(2000).
A bundle of web pages on common topics etc.
Community, module, (cohesive) subgroup, cluster, clique
etc.
Computer science, mathematics, sociology, biology, and
physics are related in this community finding problem.
Structural definition of community

Groups of vertices within which connections
are dense, but between which connections
are sparser.

Because we don’t have any prior information
about network.
Modularity
Q   eii   eij eki  Tr e  e
i
2
ijk
eij : the fraction of edges in the original network
that connect ve rtices in group i to those in group j.
Key points (Highlight)

What property or measure of network is used in this
Physical meaning?
algorithm or method?





eigenvalue and eigenvector, spectrum of adjacency matrix.
Edge betweenness, information centrality.
Distance, dissimilarity index, edge clustering coefficient, etc.
Agglomerative or divisive?
What is the required prior information here?


Whether there is community or not.
How many modules are there.

Performance of partitioning results and computational
complexity.

We will review about 11 different methods recently studied.
 If you are boring, ask me a question.
1. Spectral bisection (old one)
M. Fiedler, Czrch. Math. J. 23, 298 (1973)
A. Pothen, H. Simon, and K.-P. Liou, SIAM J. Matrix Anal. Appl. 11, 430 (1990)
F. R. K. Chung, Spectral Graph Theory, Amer. Math. Soc. (1997)
http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html
Laplacian L of n-vertex undirected graph G
LDA
- D is the diagonal matrix of vertex degree k.
- A is the adjacency matrix.
1
2
3
4
5
1  1,1,1,1,1
 1 1 0 0 0 
 1 1 0 0 0 


L   0 0 2  1  1


0
0

1
1
0


 0 0  1 0 1 
 1
 2
 1


2

V 0

 0


 0

0
0
0
0
1
3
1

3
1

3

0

1
2
1
2

1
2
1
2
0
0
0

0 

0 

2 


6
1 
6 
1 

6 
E  0,0,1,2,3
is always eigenvector with eigenvalue 0.
Bisect !
1
2
3
4
5
 2 1 1 0 0 
 1 1 0 0 0 


L   1 0 3  1  1


0
0

1
1
0


 0 0  1 0 1 
0
 0.70 0.44 
0.45 0.34
0.45 0.70

0
0
.
54

0
.
14



V  0.45  0.20
0
 0.32  0.81


0
.
45

0
.
42

0
.
71
0
.
24
0
.
26


0.45  0.42 0.71 0.24
0.26 
E  0.00, 0.52, 1.00, 2.31, 4.17
The eigenvector corresponding to the lowest eigenvalue
must haveboth positive and negative elements.
The spectral bisection method is reasonably fast.
Algebraic connectivity
: How good the split is,
with smaller values
corresponding to
better splits.
General n by n matrix case, O(n3) time complexity.
However, sparse matrix case, Lancozos method
reduces it to approximately
m
.
3  2
G. H. Golub and C. F. Van Loan,
Matrix computations.
Johns Hopkins University Press,
Baltimore, MD (1989)
2. The Kernighan-Lin (KL) algorithm
B. W. Kernighan and S. Lin, Bell System Technical Journal 49, 291 (1970)
http://www.cs.berkeley.edu/~demmel/cs267/lecture18/lecture18.html
Benefit function Q
The number of edges that lie within the two groups
minus the number that lie between them.
A
Bisect !
B 1. We should Specify the size of the two groups. N(A), N(B)
2. Calculate the ∆Q for all possible exchange pair from A and B.
3. Choose the pair that maximizes the change of Q.
(greedy algorithm)
4. Repeat 2 & 3 until all vertices have been swapped once.
(any vertex that has been swapped is never swapped. )
5. Go back over the sequence of swaps and find the highest Q.
- This algorithm requires a priori what the size of the groups will be.
- It runs moderately quickly, in worst case time O(n2). However,
if we don’t know the size, It will increase to O(n3).
- The best values of Q are always achieved for very asymmetric trivial division.
3. Newman fast algorithm
M. E. J. Newman, cond-mat/0309508 (PRE in press)
Modularity
Q   eii   eij eki  Tr e  e 2
i
Maximize Q by greedy algorithm !
ijk
Generally the number of ways to divide n vertices into g non-empty
groups is given by the Stirling number of the second kind S(n,g),
n
and hence the number of distinct community divisions is
 S (n, g ) .
2
n
g 1
1. Separate each vertex solely into n community.
2. Calculate the increase of Q for all possible community pairs.
3. Choose the mergence of the greatest increase in Q.
4. Repeat 2 & 3 until the modularity Q reaches the maximal value.
Time Complexity - O(mn)
O(n2) on sparse graph.
Agglomerative hierarchical clustering method!
4. q-state Potts method or RB method
(Reihardt-Bornholdt method)
J. Reichardt and S. Bornholdt, cond-mat/0402349 (2004)
q-state Potts model on network
Hamiltonian :
H  J
  
( i , j )E
i,
j
Nearest neighbor ferromagnetic
interaction of the Potts model :
homogeneous distribution of spin
q = N/5 is reasonable for application.
ns (ns  1)
 
2
s 1
q
Diversity : global anti-ferromagnetic
interaction.
  c  p 
2m
N ( N  1)
Monte-Carlo heat-bath algorithm and simulated annealing
nmax
q 1
m N
with nmax  max  n1 , n2 ,, nq 
q 1
magnetization
q
q
Q   (ess  as ) with as   esi
2
s 1

128 nodes computer-generated
(proposed by Newman) network, 4
groups of 32 nodes each. Average
of 16 links ( zin+zout=16 )
N
T
i 1
m
2
 m
2

5. Hierarchical clustering
Dendrogram
metric
1. Measure of similarity xij between pairs (i,j) of vertices.
2. Single linkage, complete linkage, or average linkage.
Structural equivalence :
Two vertices are said to be structurally equivalent if they have the same set of
neighbours. How many same friends they have.
xij 
2
(
A

A
)
 ik ij
k i , j
Euclidean distance
xij 
1
(A ik  i )( A jk   j )

n k
Pearson correlation
 i j
K-components :
Two vertices in the same community have at least k independent paths between
them. The count of edge-independent path (max-flow) betweenvertices.
Time complexity Max(O(mn), O(n2logn) ) because of the sorting of n2 similarity.
6. Zhou dissimilarity index method
H. Zhou, Phys. Rev. E 67, 061901 (2003)
H. Zhou, Phys. Rev. E 67, 041908 (2003)
The distance dij from vertex i to vertex j is defined as the average
number of steps needed for a Brownian particle on this network to
move from vertex i to vertex j.
Transfer matrix
(jumping probability)
Distance
Pij 
Aij

N
l 1
~  (k )
Ail


1

  
l 1  I  B ( j )  il
N
di, j
Dissimilarity index
(i, j ) 

N
k i , j
d i ,k  d j ,k
N 2
2
I is N by N identity matrix.
B(j) is equals to P except
that Blj(j) = 0 for all l.
 d1, j  1

  
 I  B( j )       
 d  1
 N, j   
7. Girvan-Newman (GN) algorithm
A
M. Girvan and M. E. J. Newman, PNAS 99, 7821 (2002)
M. E. J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004)
B
The few edges that lie between communities can be thought
of as forming “bottlenecks” between the communities.
Betweenness and edge betweenness :
The number of geodesic (i.e., shortest) paths between vertex pairs
that run along the edge in question, summed over all vertex pairs.
Edge removal :
After calculating the betweenness of all edges in the network,
remove the one with highest betweenness.
Recalculate after edge removal and repeat it until the modularity Q is maximum.
Time complexity O(m2n)
8. Tyler-Wilkinson-Huberman (TWM) method
J. R. Tyler, D. M. Wilkinson, and B. A. Huberman, cond-mat/0303264 (2003)
Variation of Girvan-Newman algorithm to improve the calculating speed.
Tyler et al. suggest instead summing up over all node only a subset of vertices
i be summed over, giving partial betweenness score for all edges; if a random
sample is chosen, this will give a Monte Carlo estimate of betweenness.
The number of vertices sampled is chosen so as to make the betweenness of at
least one edge in the network greater than a certain threshold.
This stochastic approach reduces the time complexity from O(m2n) to O(m2)
9. RCCLP method or Parisi method
(Radicchi-Castellano-Cecconi-Loreto-Parisi method)
F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, PNAS 101, 2658 (2004)
Definition of community in a strong sense :
kiin (V )  kiout (V ) , i V .
Definition of community in a weak sense :
in
out
k
(
V
)

k
i
 i (V ) .
iV
Edge clustering coefficient
6
j
5
iV
Ci(,3j) 
zi(.3j)  1
min[ (ki  1), (k j  1)]
zi(,3j) : the number of triangles built on that edge.
Edge coefficient of order g :
i
Ci(,gj ) 
zi(,gj )  1
si(,gj )
i
6
j
5
Edge clustering coefficient is strongly
negatively correlated with edge betweenness.
Time complexity O(m4/n2) ~ O(n2)
This algorithm relies on the presence of triangles in the network. Clearly if a
network has few triangles in the first place, then the edge clustering coefficient
will be small for all edges, and the algorithm will be fail to find the communities.
10. Information centrality method
(Fortunato-Latora-Marchiori method)
S. Fortunato, V. Latora, and M. Marchiori, cond-mat/042522 (2004)
Network efficiency E
E[G ] 
Information centrality CI

i j
ij
N ( N  1)

1
1

N ( N  1) i  j d ij
'
E
[
G
]

E
[
G

E
I
k]
Ck 

E
E[G]
Time complexity O(m3n)
Iterative removal of the edges with the highest information centrality
64 nodes computer-generated network.
256 edges, 4 groups of 16 nodes each.
11. Flake’s max-flow method
(Flake-Lawrence-Giles-Coetzee method)
G. W. Flake, S. R. Lawrence, C. L. Giles, and F. M. Coetzee, IEEE Computer 35, 66 (2002)
Web community
Starting page or
seed Web sites
Find the boundary of community using max-flow
and min-cut.
Without the text information only link information.
Ex) Page Rank, Hyperlink Induced Topic Search(HIT)
Simple example of max-flow, min-cut
Spectral analysis :
eignevalue and eigenvector of
Laplacian or transfer matrix
Edge removal :
betweenness,
information centrality,
clustering coefficient, etc.
Optimization Approach :
Hamiltonian, benefit function,
or modularity Q
Hierarchical clustering :
metric ( Euclidian, correlation,
similarity, etc. )
12. ESMS method or K. Sneppen method
( Eriksen-Simonsen-Maslov-Sneppen method )
K. A. Eriksen, I. Simonsen, S. Maslov, and K. Sneppen, Phys. Rev. Lett. 90, 148701 (2003)
13. CSCC method or Capocci method
( Capocci-Servedio-Caldarelli-Colaiori method )
A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori, cond-mat/0402499
14. Donetti-Muñoz (DM) method
L. Donetti and M. A. Muñoz, cond-mat/0404652
15. Wu-Huberman (WH) method
F. Wu and B. A. Huberman, cond-mat/310600
16. Costa’s Hub-based flooding method
L. F. Costa, cond-mat/0405022 (2004)