Transcript slides - Minas Gjoka
Multigraph Sampling of Online Social Networks
Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou Multigraph sampling 1
Outline • Multigraph sampling – Motivation – Sampling method – Internet Measurements – Conclusion Multigraph sampling Minas Gjoka 2
Problem statement • Obtain a representative sample of OSN users by exploration of the social graph. Multigraph sampling
A B E F C D G
Minas Gjoka
H I
3
Motivation for multiple relations
• Principled methods for graph sampling – Metropolis Hastings Random Walk – Re-weighted Random Walk
“Walking in Facebook: A Case Study of Unbiased Sampling of OSNs,” INFOCOM ‘10
• But..graph characteristics affect mixing and convergence • fragmented social graph • highly clustered areas Multigraph sampling Minas Gjoka 4
Fragmented social graph Largest Connected Component Other Connected Components Friendship Event attendance Group membership Multigraph sampling Union 5
Highly clustered social graph Friendship Event attendance Multigraph sampling Minas Gjoka Union 6
Proposal • Graph exploration using multiple user relations – perform random walk – re-weighting at the end of the walk – online convergence diagnostics applicable • Theoretical benefits – faster mixing – discovery of isolated components • Open questions – how to combine relations – – implementation efficiency evaluation of sampling benefits in a realistic scenario Multigraph sampling Minas Gjoka 7
A B C D E F H G I J K
Friends
A B C D E F H G I J K
Events
B E D A C
Multigraph sampling
F G H I J K
Groups Minas Gjoka 8
A B C D E F H G I J K I I J J
Friends Events
B E D A C
Multigraph sampling
F G H I J K
Groups Minas Gjoka 9
deg(F, green) = 4 Combination of multiple relations
A B C D E F H G I J K
G
* = Friends + Events + Groups ( G
*
is a union multigraph )
B E D A C
Multigraph sampling
F G H I J K
G = Friends + Events + Groups Minas Gjoka ( G is a union graph ) 10
d d
* ( ( ) ) Multigraph sampling Implementation efficiency
p
(
Friends
) 1 / 8
p
(
Events
) 4 / 8
p
(
Groups
) 3 / 8 Degree information available without enumeration Multigraph sampling Take advantage of pages functionality Minas Gjoka 11
Multigraph sampling Internet Measurements • Last.fm, an Internet radio service – social networking features – multiple relations – fragmented graph components and highly clustered users expected • Last.fm relations used – Friends – – Groups Events – Neighbors Multigraph sampling Minas Gjoka 12
Data Collection Sampled node information • Crawling using Last.fm API and HTML scraping
userID country age registration time …
13 Multigraph sampling Minas Gjoka
Crawl type Friends Events Groups Neighbors Friends-Events Groups-Neighbors UNI Summary of datasets Last.fm - July 2010 # Total Users 5x50K 5x50K 5x50K 5x50K 5x50K 500K % Unique Users 71% 58% 74% 53% 76% 99% Multigraph sampling Minas Gjoka 15
Comparison to UNI % of Subscribers Multigraph sampling Minas Gjoka 16
Last.fm Charts Estimation Application of sampling Multigraph sampling Minas Gjoka 17
Last.fm Charts Estimation Artist Charts Multigraph sampling Minas Gjoka 18
Related Work • Fastest mixing Markov Chain – Boyd et al - SIAM Review 2004 • Sampling in fragmented graphs – Ribeiro et al. Frontier Sampling – IMC 2010 • Last.fm studies – – Konstas et al - SIGIR ‘09 Schifanella et al - WSDM ‘10 Multigraph sampling Minas Gjoka 19
Conclusion • Introduced multigraph sampling – simple and efficient – discovers isolates components – better approximation of distributions and means – multigraph dataset planned for public release • Future work on multigraph sampling – selection of relations – weighted relations Multigraph sampling Minas Gjoka 20
Thank you Questions?
Multigraph sampling Minas Gjoka 21