Transcript Slides

Partitioning Social Networks for
Fast Retrieval of Time-dependent Queries
Mindi Yuan, David Stein, Berenice Carrasco, Joana Trindade, Yi Lu
University of Illinois at Urbana-Champaign
Presenter:
Sameh Elnikety
Microsoft Research
Email questions, praise and complains to Prof. Yi Lu <[email protected]>
1
Online Social Networking (OSN)
• OSNs differ from traditional web applications significantly
– Handle highly personalized content
• Retrieved pages differ from user to user; also vary over short time intervals
when new activities occur
– Highly interconnected data
• Immediate neighborhood and extended neighborhood (friend’s friend)
– Community
• Strong community structure among end users, albeit varying with time
• Scaling OSNs is a problem
– Interconnected nature
– Astounding growth rate
• Twitter grew by 10 times in one month, forced to redesign and re-implement its
architecture several times
2
Existing Approaches
• Conventional vertical scaling
– Upgrade existing hardware
– Replicate fully on each machine
• Problems
– Expensive because of the cost of high performance servers
– Even infeasible
• Facebook requires multiple hundreds of Terabytes of memory
• Horizontal scaling
– Use a higher number of cheap commodity servers
– Partition the load among them
– Good for stateless traditional web applications
• Problematic with data back-end layer that maintains state
– OSN data cannot be partitioned into disjoint components
– Leads to costly inter-server communication
3
Inter-server communication
• Distributed queries are shown to reduce performance compared to
local queries [Pujol et. al. 2010][Curino et. al. 2010]
• Curino et. al. reported that local queries double throughput
• Question to consider:
How much can we minimize inter-server communication and at what
cost?
4
Overview
1. Problem
2. Previous work
3. Our approach
4. Evaluation
5
Two extremes of the spectrum
• Cassandra
–
–
–
–
–
Random partitioning with distributed hashing
Consistency is easy to maintain since only one replica is available
Inter-server traffic increases at retrieval
Slow down responses
Lead to “multi-get” hole problem with e.g. Facebook, where increasing the
number of servers actually reduces throughput
• Replicating all users’ data on multiple servers
–
–
–
–
Eliminates inter-server traffic but increases replication overhead
Impossible to hold all data in memory on each server
Delay in writing due to updates of all replicas
Network traffic for maintaining consistency across replicas
6
Little Engines
The little engine(s) that could: scaling online social networks. Pujol et. al. Sigcomm, 2010.
• Assumption
– Most of operations are based on data of a user and her neighbors
– The assumption is debatable: True for Twitter data, not true for Facebook data
where a two-hop network is accessed for most operations (expanded on next
slide)
• Guarantees that for all users in an OSN, their direct neighbors’ data are colocated on the same server
– Application developers can assume local semantics
– Scalability achieved with servers with low memory and network I/O
• Drawback
– Popular data replicated on all servers
– Extension to two-hop network will require a much larger number of replicas
7
Two-hop network
• SPAR proposed in Little Engines considers one-hop network
– Mainly focus on twitter-type models: a user follows all contents of a group
of senders
– Facebook messages were tested upon by SPAR, but they are still modeled
as a user following all his friends, instead of the real-world scenario where
a user can access related activities of friends’ friends
• Two-hop networks
– For instance, Joana can accesses all activities Between Jona and Nandana
Joana
Adarsh
Nandana
Naseer
Jona
8
Overview
1. Problem
2. Previous work
3. Our approach
4. Evaluation
10
Our Approach
• Goal: have most requests handled locally while having much fewer
replicas than the approach in “little engines”?
– To have an approach scalable with two-hop networks
– To reduce the cost of maintaining consistency among replicas
• Main idea:
– Instead of partitioning the social network, which contains all users that will
potentially send a message
– We focus on the activity network, which contains users actually sending
messages
– Partitioning is based on actual activities in the network so that most requests
can be served locally, which can be achieved with much fewer replicas than a
strong locality guarantee
11
Activity Network 1/2
• Contains much fewer edges than the social network
– More amenable to partitioning and replication
• In a two-hop neighborhood, even if a user does not participate in much
activity with his direct neighbors, there can still be abundant activity in
the two-hop neighborhood
• Facebook data for Dec 2006 New Orleans network
– While most user pairs only have 1 post, the top 8% of users receiver > 100
posts in their two-hop neighborhood
12
Activity Network 2/2
• There is strong time correlation among wall posts between the same
pairs of users
– This motivates a local, dynamic partitioning algorithm that adapts to the
temporal correlation
• Computation of auto-correlation
– Gather all messages in the period m(1), … m(k)
– Compute the auto-correlation function
13
Activity Prediction Graph
• The example has 6 users in the network
• There is a message node between each pair of user nodes
• Each edge is assigned a weight that reflects
– Read frequency of a user (assume data collected)
– The likelihood of retrieving a message from this particular edge when the
most recent N messages are retrieved from the two-hop neighborhood
14
Partitioning Algorithm
• Periodic partitioning
– Use K-Metis to repartition the graph at fixed time epochs
– In between epochs, new users and new interactions are added to the graph
to minimize partitioning cost
• Local adaptive partitioning
– Takes advantage of strong time correlation between messages
– Triggered when a retrieval results in remote accesses
– Keep movements of nodes across partitions small
15
Overview
1. Problem
2. Previous work
3. Our approach
4. Evaluation
16
Evaluation Scenario
• Each retrieval involves the most recent 6 messages in a two-hop
neighborhood
• We chose 6 as the dataset available is relatively small
– A total of 13948 messages with 8640 active users
• We consider algorithms
–
–
–
–
–
Random hashing based on sender (hash_p1)
Random hashing based on sender-receiver pair (hash_p2)
Periodic partitioning
Local adaptive partitioning
Retrospective algorithm: instead use the APG, which uses past data to predict
the activity for the current month, use the actual activity in the month. It is
the best one can do given the limitations of the partitioning algorithm
17
Results
• Proportion of retrievals that accesses only one partition for all 6 messages
• The periodic algorithm performs much better than random hashing
• Close to retro, so APG does provide relatively accurate prediction
18
Results
• Proportion of retrievals that accesses at most 3 partitions for all 6
messages
• Most retrievals touch at most 3 partitions
19
Results
• The local adaptive algorithm further improves over periodic partitioning
• At least 6.4 times more local queries than random hashing
• Without causing large movements of nodes as in the periodic case
20
Results
• We also compare the periodic partitioning on the APG vs a graph using
weights in one-hop neighborhood only
• The retrievals are performed in the two-hop neighborhood
• There is clearly an advantage in considering activities in the two-hop
neighborhood when constructing the graph
• The advantage is more significant as the number of partitions increases
21
Conclusion
• We proposed a dynamic, adaptive algorithm to predict activities on an
OSN and partition the network for fast retrieval of data
• The algorithm scales when information needs to be retrieved from the
two-hop neighborhood
• Further work is done to improve data locality for most retrievals with a
small number of replicas
22