Partitioning Social Networks for Time-dependent Queries Berenice Carrasco, Yi Lu and Joana M.

Download Report

Transcript Partitioning Social Networks for Time-dependent Queries Berenice Carrasco, Yi Lu and Joana M.

Partitioning Social Networks for
Time-dependent Queries
Berenice Carrasco, Yi Lu and Joana
M. F. da Trindade
- University of Illinois -
EuroSys11 – Workshop on Social Network Systems
My colleague’s facebook home page!
My colleague’s facebook home page!
• What is visible to
Joana?
– Messages in a twohop network
Joana
Adarsh
Nandana
Naseer
Jona
Why is partitioning important?
• Different types of queries in Social Networks
– photo tags, marketplace, news feed
Most common
query
• Retrieve small records (personalized content)
• Multiple records from different users
• Time-dependent
– Home page refresh at Facebook
Existing approaches
• Partition based on friendship solely (1-hop
network)
– Power-law degree distribution
• Highly interconnected data
• Small fraction of nodes with very large degrees
– General approach: Horizontal partitioning +
Replication
Existing approaches
• Hash-based horizontal partitioning
Joana
Adarsh
Nandana
Naseer
Jona
Joana
Jona
p1
Adarsh
 Multiple records in different
servers
 Bad response time
 Inefficient network usage
 High packet overhead
for such small data
Nandana
Naseer
p2
p3
Key: User
name
Existing approaches
• Replication
 Great amount of extra storage
Existing approaches
• Query-based partitioning
 Assume queries do not change with time
Curino et. al., “SCHISM: A workload-driven approach to database replication and partititioning”, 2010
The challenge for Social Networks
• Friendship or query-based do not work well
• Underlying network varies over time
– Added/deleted friends
– Interaction level changes
Only 30% of Facebook user
pairs interact consistently
from one month to the next
Our approach
• Partitioning not only the friendship network but
also along the time dimension
– Interaction: activity network
• weighted links: strong vs. weak
• power-law with much lighter tail
– Maximal degree around 100
– This partitioning results in:
• Fewer cross-edges
• Reduced need for replication
– Goal: Provide frequent users with high data locality
• Faster response to queries
Our algorithm
• Differentiate between: 1) period used for prediction and
2) current period to partition
• Look at the interaction and predict the strength of
relationship
• Then, look at this strength and determine what data can be
accessed together
1. Construct an
Activity
Prediction
Graph (APG)
2. Compute
cost of local
partitions
3. Partitioning
APG with
KMETIS
Identifies links from past
Assign
traces
a cost that will determine
and capture relationships
how costly
with it would be to cut one
strong activity
edge or another
4. Greedy
algorithm for
partitioning the
current period
Our algorithm
• We propose a way to compute weights in this
APG
• User nodes
• Message nodes
• Two-hop network
Our algorithm
• We propose a way to compute weights in this
APG
• Message node
weights
• User node weights
•Decay factor
•# msg exchanged
Our algorithm
• Cost of local partitions
• Message node
weights
• User node weights
• Edge weights
•
Partition 1
Partition 2
•
Msg accessible to
user X
Remote msg weights
Evaluation: Graph Partitioning
• Data set:
– Facebook New Orleans network
•
•
•
•
Jan2005 to Dec2006
8643 users and 69836 wall posts
APG: Jan2005 to Nov2006
Fixed period: Dec-2006, with 13948 wall posts
Evaluation of Data Locality
• We mimic real Facebook page downloads for
all wall posts in Dec2006
– Query requests 6 most recent wall posts in the
user’s two-hop network
• We compare our algorithm to two hashedbased horizontal partitioning algorithms
– Hash_p1
– Hash_p1_p2
• Number of partitions used: up to 20
Evaluation of Data Locality
• Proportion of queries that access only 1
partition
Evaluation of Data Locality
• Proportion of queries that access at most 3
partitions
Conclusion and Future Work
• Our algorithm partitions social network data
according to interaction levels at different
times
• Our activity prediction graph significantly
improved data locality compared to hashing
• Placement of data across different periods
Backup Slides
Existing approaches
• Hash-based horizontal partitioning
Gizzard
Cassandra
Dynamo
Range partitioning
Consistent hashing
Modified consistent
hashing
Our approach
• Replication with time-dependency
Our approach
• Replication with time-dependency
Greedy Algorithm
• Use an algorithm for messages corresponding
to the non-predicted month: Dec2006
– Initiator and receiver of the message exist in the
APG but no previous interaction
– Exactly one of the initiator and receiver of the
message exist in the APG
– Neither the initiator nor the receiver exists in the
APG