slides - Minas Gjoka

Download Report

Transcript slides - Minas Gjoka

Multigraph Sampling of Online Social Networks

Minas Gjoka, Carter Butts, Maciej Kurant, Athina Markopoulou Multigraph sampling 1

Outline • Multigraph sampling – Motivation – Sampling method – Internet Measurements – Conclusion Multigraph sampling Minas Gjoka 2

Problem statement • Obtain a representative sample of OSN users by exploration of the social graph. Multigraph sampling

A B E F C D G

Minas Gjoka

H I

3

Motivation for multiple relations

• Principled methods for graph sampling – Metropolis Hastings Random Walk – Re-weighted Random Walk

“Walking in Facebook: A Case Study of Unbiased Sampling of OSNs,” INFOCOM ‘10

• But..graph characteristics affect mixing and convergence • fragmented social graph • highly clustered areas Multigraph sampling Minas Gjoka 4

Fragmented social graph Largest Connected Component Other Connected Components Friendship Event attendance Group membership Multigraph sampling Union 5

Highly clustered social graph Friendship Event attendance Multigraph sampling Minas Gjoka Union 6

Proposal • Graph exploration using multiple user relations – perform random walk – re-weighting at the end of the walk – online convergence diagnostics applicable • Theoretical benefits – faster mixing – discovery of isolated components • Open questions – how to combine relations – – implementation efficiency evaluation of sampling benefits in a realistic scenario Multigraph sampling Minas Gjoka 7

A B C D E F H G I J K

Friends

A B C D E F H G I J K

Events

B E D A C

Multigraph sampling

F G H I J K

Groups Minas Gjoka 8

A B C D E F H G I J K I I J J

Friends Events

B E D A C

Multigraph sampling

F G H I J K

Groups Minas Gjoka 9

deg(F, green) = 4 Combination of multiple relations

A B C D E F H G I J K

G

* = Friends + Events + Groups ( G

*

is a union multigraph )

B E D A C

Multigraph sampling

F G H I J K

G = Friends + Events + Groups Minas Gjoka ( G is a union graph ) 10

d d

* ( ( ) ) Multigraph sampling Implementation efficiency

p

(

Friends

)  1 / 8

p

(

Events

)  4 / 8

p

(

Groups

)  3 / 8 Degree information available without enumeration Multigraph sampling Take advantage of pages functionality Minas Gjoka 11

Multigraph sampling Internet Measurements • Last.fm, an Internet radio service – social networking features – multiple relations – fragmented graph components and highly clustered users expected • Last.fm relations used – Friends – – Groups Events – Neighbors Multigraph sampling Minas Gjoka 12

Data Collection Sampled node information • Crawling using Last.fm API and HTML scraping

userID country age registration time …

13 Multigraph sampling Minas Gjoka

Crawl type Friends Events Groups Neighbors Friends-Events Groups-Neighbors UNI Summary of datasets Last.fm - July 2010 # Total Users 5x50K 5x50K 5x50K 5x50K 5x50K 500K % Unique Users 71% 58% 74% 53% 76% 99% Multigraph sampling Minas Gjoka 15

Comparison to UNI % of Subscribers Multigraph sampling Minas Gjoka 16

Last.fm Charts Estimation Application of sampling Multigraph sampling Minas Gjoka 17

Last.fm Charts Estimation Artist Charts Multigraph sampling Minas Gjoka 18

Related Work • Fastest mixing Markov Chain – Boyd et al - SIAM Review 2004 • Sampling in fragmented graphs – Ribeiro et al. Frontier Sampling – IMC 2010 • Last.fm studies – – Konstas et al - SIGIR ‘09 Schifanella et al - WSDM ‘10 Multigraph sampling Minas Gjoka 19

Conclusion • Introduced multigraph sampling – simple and efficient – discovers isolates components – better approximation of distributions and means – multigraph dataset planned for public release • Future work on multigraph sampling – selection of relations – weighted relations Multigraph sampling Minas Gjoka 20

Thank you Questions?

Multigraph sampling Minas Gjoka 21