Approximating the Community Structure of the Long Tail Problem Statement

Download Report

Transcript Approximating the Community Structure of the Long Tail Problem Statement

Approximating the Community
Structure of the Long Tail
Akshay Java, Anupam Joshi, Tim Finin
Problem Statement
Community Structure of the Long Tail
Original Adjacency Matrix
“Approximate the membership of the blogs in the long tail
using only a small portion of the entire graph”
Motivation
It is expensive to compute the
community structure for large
networks.
Blog graphs are large, but
extremely sparse and exhibit the
long tail/ core-periphery structure.
Head
Long Tail
CORE
PERIPHERY
Intuition
Communities emerge around the core (A). The
membership of the blogs in the long tail (B) can be
approximated by the structure of the core and how the
long tail links to the core.
Detecting Communities
Approximation Methods
A community is a set of nodes that has more connections
within the set than outside it.
Normalized Cut
 1
1 
NCut(A,B)  Cut(A,B)


Vol(A) Vol(B)
L  DW  I D

2
*W * D

Singular Value Decomposition
W  U r r VrT
(reduced rank)
Sampling-based Technique
By Nyström Approximation (Fowlkes et al.)
 A B   A
L   T   T
B C  B

The graph can be partitioned using the eigenspectrum of
the Laplacian. (Shi and Malik)
1
Sorted (degree) Adjacency Matrix
1

2
A  UU T
B   U 
T 1 T
 
U A U B
B T A 1B  B TU1 
Thus eigenvectors of L can be easily
approximated by sampling.
Heuristic Method
The second smallest eigenvector of the graph Laplacian

is the Fiedler vector.
Use the head of the distribution to find the
initial communities.
The graph can be recursively
partitioned using the sign of
the values in its Fielder vector.
Approximate membership of the tail by
using number of links to each community
in the head.
QuickTime™ and a
decompressor
are needed to see this picture.
Evaluation
Conclusions and Ongoing Research
Conclusion
Community structure of the entire network can be well
approximated by sampling and heuristic methods.
Advantage
Achieve fast, accurate approximation using much smaller
sample of the entire graph.
Modularity Score
Ongoing Research
A measure of the quality of community detection
Q   eii  a
i
2
i
Q = (fraction of intra-community edges)(expected fraction, disregarding communities)
Applying the approximation on temporal graphs
Clustering graph, tag information simultaneously