Discovering Overlapping Groups in Social Media

Download Report

Transcript Discovering Overlapping Groups in Social Media

Discovering Overlapping Groups
in Social Media
Xufei Wang, Lei Tang, Huiji Gao, and
Huan Liu
[email protected]
Arizona State University
Social Media
• Facebook
– 500 million active users
– 50% of users log on to Facebook everyday
• Twitter
– 100 million users
– 300, 000 new users everyday
– 55 million tweets everyday
• Flickr
– 12 million members
– 5 billion photos
3
Activities in Social Media
• Connect with others to form “Friends”
•
•
•
•
•
•
Interact with others (comment, discussion, messaging)
Bookmark websites/URLs (StumbleUpon, Delicious)
Join groups if explicitly exist (Flickr, YouTube)
Write blogs (Wordpress,Myspace)
Update status (Twitter, Facebook)
Share content (Flickr, YouTube, Delicious)
5
Community Structure
• Behavior Studying
– Individual ? Too many users
– Site level ? Lose too much details
– Community level. Yes, provide information with
vary granularity
6
Overlapping Communities
Neighbors
Colleagues
Family
8
Related Work
• Disjoint Community Detection
– Modularity Maximization
– Based on Link Structure, (how to understand ?)
• Overlapping Community Detection
– Soft Clustering (Clustering is dense)
– CFinder (Efficiency and Scalability)
• Co-clustering
– Disjoint
– Understanding groups by words (tags)
9
Problem Statement
• Given a User-Tag subscription matrix M, and
the number of clusters k, find k overlapping
communities which consist of both users and
u1
tags.
t1
u2
t2
u3
t3
u4
t4
u5
10
Our Contributions
• Extracting overlapping communities that better
reflect reality
• Clustering on a user-tag graph. Tags are
informative in identifying user interests
• Understanding groups by looking at tags within
each group
11
Edge-centric View
• Cluster edges instead of nodes into disjoint
groups
– One node can belong to multiple groups
– One edge belongs to one group
u1
t1
u2
u1
t1
u4
t3
u3
u2
t2
t2
u3
t3
u5
t4
u4
u5
t4
12
Edge-centric View
u2
u3
u4
• edge
In anu1Edge-centric
view
u5
t1
t2
t3
t4
e1
1
0
0
0
0
1
0
0
0
e2
1
0
0
0
0
0
1
0
0
e3
0
1
0
0
0
1
0
0
0
e4
0
1
0
0
0
0
1
0
0
e5
0
0
1
0
0
0
1
0
0
e6
0
0
1
0
0
0
0
1
0
e7
0
0
0
1
0
0
0
1
0
e8
0
0
0
1
0
0
0
0
1
e9
0
0
0
0
1
0
0
1
0
e10
0
0
0
0
1
0
0
0
1 13
Clustering Edges
• We can use any clustering algorithms (e.g.,
k-means) to group similar edges together
k
1
arg max   Sc ( x j , ci )
k i 1 x j Ci
C
• Different similarity schemes
14
Defining Edge Similarity
tq
ui
tp
uj
• Similarity between two edges e and e’ can be
defined, but not limited, by
Se (e, e' )  Su (ui , u j )  (1   )St (t p , tq )
• α is set to 0.5, which suggests the equal
importance of user and tag
• Define user-user and tag-tag similarity
15
Independent Learning
• Assume users are independent, tags are
independent
1
Se (e, e' )  ( (ui , u j )   (t p , tq ))
2
1, m  n
 (m, n)  
0, m  n
16
Normalized Learning
• Differentiate nodes with varying degrees by
normalizing each node with its nodal degree
1
1
e(ui , t p )  (0,...0,
,0,...,0,
,0,...,0)
dui
dt p
d t d t  (ui , u j )  d u du  (t p , tq )
Se (e, e' ) 
p
q
i
d d
2
ui
2
uj
j
d d
2
tp
2
tq
17
Correlational Learning
• Tags are semantically close
– Tags cars, automobile, autos, car reviews are used to
describe a blog written by sid0722 on BlogCatalog
uХt
uХk
• Compute user-user and tag-tag cosine
similarity in the latent space
~
~
~
~
t p  tq
1 ui  u j
S e (e, e' )  ( ~ ~  ~ ~ )
2 ui  u j
t p  tq
18
Spectral Clustering Perspective
• Graph partition can be solved by the Generalized
Eigenvalue problem
min Lz  Wz
z
  D1  M 
L

T

M

D
2

 0 M
W  T

0
M
U 
Z  
V 
19
Spectral Clustering Perspective
• Plug in L,W,Z, we obtain
  D1  M  U 
 D1 0  U 
 M T  D  V     0 D 2 V 

 
2  

T

M

(
1


)
V
D1U

 T
T

M

(
1


)
U
D2V

• U and V are the right and left singular vectors
corresponding to the top k largest singular
values of user-tag matrix M
20
Synthetic Data Sets
• Synthetic data sets
– Number of clusters, users, and tags
– Inner-cluster density and Inter-cluster density (1%
of total user-tag links)
– Normalized mutual Information
• Between 0 and 1
• The higher, the better
21
Synthetic Performance
• We fix the number of users, tags, and density,
but vary the number of clusters
22
Synthetic Performance
• We fixed the number of users, tags, and
clusters, but vary the inner-cluster density
23
Social Media Data Sets
• BlogCatalog
– Tags describing each blog
– Category predefined by BlogCatalog for each blog
• Delicious
– Tags describing each bookmark
– Select the top 10 most frequently used tags for
each person
24
Inferring Personal Interests
• Category information reveals personal
interests, view group affiliation as features to
infer personal interests via cross-validation
25
Connectivity Study
• The correlation between the number of cooccurrence of two users in different affiliations
and their connectivity in real networks.
• The larger the co-occurrence of two users, the
more likely they are connected
26
Understanding Groups via Tag Cloud
• Tag cloud for Category Health
27
Understanding Groups via Tag Cloud
• Tag cloud for Cluster Health
28
Understanding Groups via Tag Cloud
• Tag cloud for Cluster Nutrition
29
Conclusions and Future Work
• Overlapping communities on a User-Tag graph
• Propose an edge-centric view and define edge
similarity
– Independent Learning
– Normalized Learning
– Correlational Learning
• Evaluate results in synthetic and real data sets
• Many applications: link prediction, Scalability
30
References
•
•
•
•
•
•
•
•
I. S. Dhillon, “Co-clustering documents and words using bipartite spectral graph partitioning,”
in KDD ’01, NY, USA
L. Tang and H. Liu, “Scalable learning of collective behavior based on sparse social
dimensions,” in CIKM’09, NY, USA.
L. Tang and H. Liu, “Community Detection and Mining in Social Media,” Morgan & Claypool
Publishers, Synthesis Lectures on Data Mining and Knowledge Discovery, 2010.
G. Palla, I. Dernyi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure
of complex networks in nature and society,” Nature’05, vol.435, no.7043, p.814
K. Yu, S. Yu, and V. Tresp, “Soft clustering on graphs,” in NIPS, p. 05, 2005.
U. Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp.
395–416, 2007.
M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,”
Phys. Rev. E, vol. 69, no. 2, p. 026113, Feb 2004.
S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3-5, pp. 75 –
174, 2010.
31
Contact the Authors
• Xufei Wang
– [email protected]
– Arizona State University
• Lei Tang
– [email protected]
– Yahoo! Labs
32