2nd SNA-KDD Workshop

Download Report

Transcript 2nd SNA-KDD Workshop

Overcoming Resolution Limits in MDL Community Detection

L. Karl Branting The MITRE Corporation

2nd SNA-KDD Workshop 24 Aug 2008

Outline

Utility functions in community detection

Resolution limits

MDL-based community detection

– –

Previous: RB and AP New: SGE

Experimental Evaluation

Lessons 2

2nd SNA-KDD Workshop 24 Aug 2008

Utility functions in community detection

Two components of community detection algorithms

– –

Utility function – quality criterion to be optimized Search strategy – procedure for finding optimal partition

Examples

Garvin & Newman (2003)

Utility function: modularity

Search strategy: greedy divisive hierarchical clustering (iteratively remove highest betweenness edge)

Newman (2003)

Utility function: modularity

Search strategy: greedy agglomerative hierarchical clustering (iteratively choose highest modularity merge)

Tasgin & Bingol (2006)

Utility function: modularity

Search strategy: genetic algorithm

2nd SNA-KDD Workshop 24 Aug 2008

3

Utility functions in community detection

Other search strategies used with modularity

Rattigan, Maier, Jensen (2007)

Utility function: modularity

Search strategy: Greedy divisive hierarchical clustering using a Network Structured Index to approximation edge betweenness

Donetti & Munoz (2004)

Utility function: modularity

Search strategy: greedy agglomerative hierarchical clustering with spectral division 4

2nd SNA-KDD Workshop 24 Aug 2008

Utility functions in community detection

Statistical Approaches

Zhang, Qiu, Giles, Foley, & Yen (2007)

Utility function: log-likelihood (LDA parameters)

Search strategy: fixed-point iteration

Compression-Based Approaches

Rosvall & Bergstrom (2007)

Utility function: Minimum Description Length

Search strategy: simulated annealing

Chakrabarti (2004)

Utility function: Minimum Description Length

Search strategy: exhaustive search for k, hill-climbing given k

Utility function implicit in search strategy

– –

Raghavan, Albert, & Kumara (2007) – marker passing Cliques, cores, etc.

2nd SNA-KDD Workshop 24 Aug 2008

5

Modularity

1 

i

 

m w

(

D ii

) 

l l i l

2  – –

W(D ii ) = number of edges internal to group i l i = number of edges incident to vertices in group I

l = total number of edges Intuitive – expresses intuition that ratio of internal to external edges is greater for groups than for non-groups

Popular

Imperfect

Fortunato & Barthelemy (2007) Resolution limit: groups conflated if number of vertices less than

2

l

Rosvall & Bergstrom (2007) Biased towards same-sized groups

2nd SNA-KDD Workshop 24 Aug 2008

6

Resolution Limit

Ring graph R 15,4

15 communities

4 nodes per community

Community structure that maximizes modularity conflates groups 7

2nd SNA-KDD Workshop 24 Aug 2008

Approaches to modularity’s resolution limit

Apply recursively to large communities (Ruan & Zhang 2007)

Apply locally (Clauset 2005)

Choose a different utility function 8

2nd SNA-KDD Workshop 24 Aug 2008

Description Length

Utility of community structure is sum of bits needed to represent

– –

Community structure + Graph given community structure

Search strategy attempts to minimize description length

There is no unique bit count

Undecidability of Kolmogorov complexity

Previous approaches

Rosvall & Bergstrom (2007): RB

Handles group size skew better than modularity

– –

Chakrabarti (2004): AP Comparison

Similar breakdown of bits

Different calculation

2nd SNA-KDD Workshop 24 Aug 2008

9

Components of Description

 

2.

3.

4.

5.

Components (details in paper) 1.

Bits to represent number of nodes in graph

ignored because not specific to community structure Bits to represent number of groups Bits to represent mapping between nodes and groups Bits needed for number of group-to-group edges Bits needed for adjacencies between nodes

– –

Purpose 2, 3, 4: represent group structure 1, 5: represent graph as a whole

2nd SNA-KDD Workshop 24 Aug 2008

10

Surprising Experimental Result

RB, AP, and modularity compared as utility functions

Applied to ring graphs R m,c for 4 ≤ m ≤ 16 and 3 ≤ c ≤ 9

Search strategy: greedy divisive hierarchical clustering (iteratively remove highest betweenness edge)

Unsurprising result. Modularity led to conflated groups for:

– – – –

m > 8 and c = 3 m > 10 and c = 4 m > 11 and c = 5 m > 13 and c = 6,7

Surprising result.

Both RB and AP conflated at least one pair of groups in every R m,c !

11

2nd SNA-KDD Workshop 24 Aug 2008

Hypothesis

 

Both RB and AP require at least one bit per pair of groups in term 4

– –

Perhaps this estimation causes group conflation Term 4 grows as the square of the number of groups If graph is sparse, conflating groups may save more in term 4 reduction than it costs in term 5 increase Components 1.

2.

3.

4.

5.

Bits to represent number of nodes in graph

ignored because not specific to community structure Bits to represent number of groups Bits to represent mapping between nodes and groups

Bits needed for number of group-to-group edges

Bits needed for adjacencies between nodes

2nd SNA-KDD Workshop 24 Aug 2008

12

SGE (Sparse Graph Encoding)

Components 1.

2.

3.

4.

5.

Bits to represent number of nodes in graph

Ignored, as in RB and AP Bits to represent number of groups

Follows RB Bits to represent mapping between nodes and groups

Similar to AP

Bits needed for number of group to group edges

  -

Split into 2 terms Which pairs of groups are connected (much less than one bit per pair if pairs sparsely or densely connected) Number of edges between connected groups Grows as number of connected pairs, not total number of pairs

Bits needed for adjacencies between nodes

Follows RB

2nd SNA-KDD Workshop 24 Aug 2008

13

Performance of SGE on Ring Graphs

   – –

Correct community structure found for every R m,c for 4 ≤ m ≤ 16 and 3 ≤ c ≤ 9 except R 4,3 R 13,3 Results confirm hypothesis that resolution limit in RB and AP is result of over-counting term 4: the bits needed for group-to-group edges

– –

Significance Ring graphs rare in real world How does SGE compare on more realistic graphs? 14

2nd SNA-KDD Workshop 24 Aug 2008

Uniform random graph

15

Similar to graphs in Rosvall & Bergstrom (2007)

Test set

– – – –

32 vertices 4 groups average degree 6 size ratio

{1.0,1.25,1.5,1.75,2.0}

Proportion internal edges

{0.6,0.75,0.9}

Example:

– – – – –

32 vertices 4 groups average degree 6 size ratio 1.25

Proportion internal edges 0.67

Embedded Barabasi-Albert Graphs

Test set

4 communities separately generated by preferential attachment

In each community

4 initial vertices

2-4 edges added per time step

20 time steps

Example

– – –

4 communities 4 initial vertices 3 edges added per time step

20 time steps

2nd SNA-KDD Workshop 24 Aug 2008

16

Evaluation Criteria

Rand index (Rand 1971)

Adjusted Rand index (Hubert & Arabie 1985)

F-measure – based on same-cluster pairs

Recall =

|

proposedPa irs

actualPair s

| |

actualPair s

| –

Precision =

|

proposedPa irs

actualPair s

| |

proposedPa irs

| –

F-measure =

2 *

recall recall

 *

precision precision

17

2nd SNA-KDD Workshop 24 Aug 2008

Results: Uniform random graph

18

2nd SNA-KDD Workshop 24 Aug 2008

Results: Uniform random graph

19

2nd SNA-KDD Workshop 24 Aug 2008

Results: Uniform random graph

20

2nd SNA-KDD Workshop 24 Aug 2008

Results: Embedded Barabasi-Albert

21

2nd SNA-KDD Workshop 24 Aug 2008

Summary of Evaluation

Random graphs

Community structure is weak

Group sizes are balanced – modularity is best

Group sizes are imbalanced – RS is best (as per Rosvall & Bergstrom 2007)

Community structure is strong

Group sizes are balanced – not much difference

Group sizes are imbalanced – modularity is particularly bad (as per Rosvall & Bergstrom 2007), SGE slightly better than RS and AP

EBA graphs

– –

Sparse – AP and SGE weaker than modularity and RS Dense – essentially identical accuracy

2nd SNA-KDD Workshop 24 Aug 2008

22

Conclusion

Narrow

– –

Conflation of groups by MDL in sparse graphs (e.g., ring graphs) can be avoided by adjusting group-to-group edge counts.

This change doesn’t hurt performance in more common types of graphs.

Compression-based clustering works well, but requires tinkering

Modularity detects weak structure well when graph not too big and groups not too imbalanced

Broad

– –

Still unclear what utility function is best overall Needed: theory relating graph typology to utility functions

2nd SNA-KDD Workshop 24 Aug 2008

23