Community Detection

Download Report

Transcript Community Detection

Community Detection
Definition: Community Detection
Girwan Newman Approach
Hierarchical Clustering
Community Detection
Community Detection: Communities are group of nodes which shows (1)
similar behavior with nodes inside the group and (2) de-similar behavior with node
outside the group or nodes which are member of different groups. Community
detection is very important in social network.
Another definition: Given a network (Graph G=(V,E)), A COMMUNITY is a
subgraph of a network whose nodes are more tightly connected with each other than
with nodes outside the subgraph
Some practical applications of community detection:
 Useful for tracking group dynamics in social networks.
 Helps in tracking functional units in neural networks.
 Helps in categorizing web pages in search engine
 Simplifies the visualization of network and helps us in analysis on complex
graphs
Community Detection: Girvan Newman Approach
Girvan Newman Approach: Girvan and Newman introduced a divisive
approach for community detection.
 It applies the strategy of removal of the edges depending on their
betweenness values.
 It uses the Network Modularity score to get an optimized division
Pseudo code:
Step 1. Calculate edge-betweenness for all edges.
Step 2. Remove the edge with highest betweenness.
Step 3. Recalculate betweenness.
Step 4. Repeat until all edges are removed, or modularity function is
optimized (depending on variation)
Time Complexity = On3 , where, ‘n’= #nodes in the graph
Girwan Newman Approach: Modularity
Modularity: To achieve the quality of partitioning, Girwan Newman algorithm
uses the concept of modularity. It is, also known as quality index for partitioning a
network into community. It can be given as:
 ls  d s  2 
Q     
s 1 L
 2L  

m
-- (1)
Where the sum is over the ‘m’ modules of the partition, ls is the number of
links inside module ‘s’, ‘L’ is the total number of links in the network, and d s is
the total degree of the nodes in module ‘s’
From Eq-1, it is clear that modularity of any partition depends upon the
number of links of the network. The higher modularity value indicates higher
similarity and tighter connection of nodes inside the community w.r.t. other nodes
outside the community.
Variations of Girwan Newman Algorithm
Vertex Betweenness Centrality (First proposed by Freeman )
Pseudo code:
- Classical vertex importance measure on a network
- Defined as the total number of shortest paths that pass through each vertex on the
network
- There is a possible ambiguity with this definition
Edge betweenness centrality
Pseudo code:
- G-N’s generalization of vertex betweenness
- Number of shortest paths that pass through a given edge
- “If there is more than one shortest path between a pair of vertices, each path is given
equal weight such that the total weight of all the paths is unity”
Hierarchical Clustering Algorithms
Important:
1. Clustering obtained by
cutting the dendo-gram
at a desired level.
2. Each
connected
component forms a
cluster.
3. Basically we use some
similarity threshold to
get the clusters of
desired quality.
Hierarchical Clustering Algorithms
Hierarchical Clustering Algorithm can be divided into (1) Agglomerative and (2)
Divisive hierarchical clustering algorithms.
Agglomerative (bottom-up): It starts with each node/point being a single cluster
and eventually all nodes/points belong to the same cluster.
Divisive (top-down): It start with all nodes/points belong to the same cluster and
eventually each node forms a cluster on its own.
NOTE: Both, (1) Agglomerative and (2) Divisive clustering algorithms does not
require the number of clusters k in advance. It uses a termination condition based
on some similarity threshold.
Hierarchical Agglomerative Clustering (HAC) Algorithm:
Pseudo Code:
Step 1. Start with all instances in their own cluster.
Step 2. Until there is only one cluster:
 Among the current clusters,
 determine the two clusters, ci and cj, that are most similar.
Step 3. Replace ci and cj with a single cluster Ci  C j 
Variations of Agglomerative clustering-1
Single Link Agglomerative Clustering:
1. Use maximum similarity of pairs i.e.
sim (ci ,c j )  max sim ( x, y )
xci , yc j
2. Suffers from chaning effect i.e. It generally results in long and thin
clusters.
3. After merging ci and cj, the similarity of the resulting cluster to another
cluster, ck, can be calculated as:
sim ((ci  c j ), ck )  max( sim (ci , ck ), sim (c j , ck ))
Figure: Single Link Clustering
Variations of Agglomerative clustering-2
Complete Link Agglomerative Clustering: (Important Points)
1. Use minimum similarity of pairs: sim (ci ,c j )  min sim ( x, y )
xci , yc j
2. Makes “tighter,” spherical clusters that are typically preferable.
3. After merging ci and cj, the similarity of the resulting cluster to another cluster,
ck, can be calculated as:
sim((ci  c j ), ck )  min( sim (ci , ck ), sim(c j , ck ))
Figure: Complete Link Agglomerative clustering
Note: Due to the use of minimum similarity of pairs, the system is highly sensitive to outliers.
Variations of Agglomerative clustering-3
Group Average Agglomerative Clustering (GACC): It uses average similarity
across all pairs within the merged cluster to measure the similarity of two
clusters. In this scheme average similarity between two clusters (say, ci and c j )
can be computed as:
1
 
sim (ci , c j ) 
sim
(
x
, y)




 
x

(
c

c
)
y

(
c

c
)
:
y

x
ci  c j ( ci  c j  1) i j
i
j
Where,

 

sim ( x, y) = count of co-occurring words in x and y
Important Points:
(1) Does not suffer the problem of elongated clusters, as happens with singlelink clustering.
(2) Reduce the affect of outliers as happens with complete link clustering.
Strength & Weaknesses
Strengths of Hierarchical Clustering:
• Do not require information regarding number of clusters. Any
desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level.
Weaknesses:
• Not efficient -- the complexity is O(n^2).
• Once a decision is made to combine two clusters, it cannot be
undone.
• No objective function is directly minimized.
Pseudo code – Agglomerative Clustering
 
Pseudo code: Naïve Algorithm having O N 3 time complexity
1: Set active = InputPoints
2: while( active.size() > 1) do {
3:
double bestD = infinity;
4:
Cluster left = null, right = null;
5:
foreach A in active do {
6:
foreach B in active do {
7:
if ((A != B) and (d(A, B) < bestD)) {
8:
bestD = d(A, B);
9:
left = A;
10:
right = B;
11:
}
12:
13:
}
}
14: active.remove(left);
15: active.remove(right);
16: active.add(new Cluster(left,right));
17:}
Source: Walter et. al.; Fast Agglomerative Clustering for Rendering
Pseudo code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
worklist = new Set(input_points);
kdtree = new KDTree(input_points);
while (true) {
for each Element p in worklist do {
if (/* p already clustered */) continue;
q = kdtree.findNearest(p);
if (q == null) break; // stop if p is last element
r = kdtree.findNearest(q);
if (p == r) { // create new cluster e that contains a and b
Element e = cluster(p,q);
newWork.add(e);
} else { // can't cluster yet, try again later
newWork.add(p); // add back to worklist
}
}
if (newWork.size() == 1) // we have a single cluster
break;
workList.addAll(newWork); //add new nodes to worklist
kdtree.clear();
kdtree.addAll(newWork);
newWork.clear();
}
Pseudo code for Agglomerative Clustering, based on KD-Tree
References
• P.-N. Tan, M. Steinbach, and V. Kumar, editors. "Introduction to
Data Mining." Pearson Addison Wesley, 2005.
• B. Walter, K. Bala, M. Kulkarni, and K. Pingali. "Fast
Agglomerative Clustering for Rendering." IEEE Symposium on
Interactive Ray Tracing, 2008.
• Girvan M. and Newman M. E. J., Community structure in social
and biological networks , Proc. Natl. Acad. Sci. USA 99, 7821–
7826 (2002)
• Andrea Lancichinetti and Santo Fortunato (2011). "Limits of
modularity maximization in community detection". Physical
Review E 84: 066122. arXiv:1107.1155
• Survey article Communities in Networks (Notices of the
American Mathematical Society.