Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha •2 Outline Introduction to UGC (User Generated Content) systems Analyzing statistics in UGC systems — Compare.

Download Report

Transcript Social Networks Seminar on Advanced Internet Applications and Services Ilana Dreizis Eyal Bellisha •2 Outline Introduction to UGC (User Generated Content) systems Analyzing statistics in UGC systems — Compare.

Social Networks
Seminar on Advanced Internet
Applications and Services
Ilana Dreizis
Eyal Bellisha
•2
Outline
Introduction to UGC (User Generated Content)
systems
Analyzing statistics in UGC systems
— Compare UGC systems with standard Vod systems
— Analyze the popularity distributions from various categories of
UGC services
— Measure the level of content aliasing and of illegal content
Efficiency proposals for UGC systems
— Utilizing P2P
— Using caches
•3
Introduction to UGC systems
In the past:
— Video-on-Demand (Vod) systems were supplied by a limited number of
producers
— Content popularity was controllable through professional marketing campaigns
— Niche products were hard to reach and often required a great deal of interest
and motivation, on the consumer’s part, to be accessed
Nowadays:
— Hundreds of millions of Internet users are self-publishing consumers
— UGC popularity is not well predicted using the traditional prediction models
— There is an enormous variability in video contents
Having a better understanding of the popularity characteristic, would help:
— Overcome a few of the bottlenecks residing in today’s networks (poor search
and recommendation engines)
— Affect the strategies of marketing, advertising and search engines
•4
Outline
Introduction to UGC (User Generated Content) systems
Analyzing statistics in UGC systems
— Compare UGC systems with standard Vod systems
— Analyze the popularity distributions from various categories of
UGC services
— Measure the level of content aliasing and of illegal content
Efficiency proposals for UGC systems
— Utilizing P2P
— Using caches
•5
Data Collection
Our dataset:
—
– The World’s largest UGC site (2 categories: ‘Entertainment’ &
‘Science & Technology’)
—
– The most popular UGC service in Korea (has streaming videos at
rates as high as 800kb/s)
—
– a popular online video rental store
—
– Europe’s largest online DVD rental store
—
– Online movie guide
•6
UGC vs. non-UGC
Content production rate:
— IMDB carries 1,597,407 titles of movies and TV episodes during 1888-2015.
— Youtube has 65,000 daily new uploads
It only takes 24 days in Youtube to produce the same number of videos!
The average number of posts per publisher is similar for UGC and non-UGC:
— 90% of film directors publish less than10 movies
— 90% of UGC publishers upload less than 30 videos in YouTube
Length of videos:
— Average length in Youtube is 2-4 minutes
— Average length of a film is 94 minutes
User participation:
— Strong linear correlation:
# rankings
 ~ 0.8%
# users _ who _ watched
•7
Power Law
The power law model has been increasingly used to explain
various statistics appearing in the computer science and
networking pplications
In systems where many people are free to choose between
many options, a small subset of the whole will get a
disproportionate amount of traffic (or attention, or income), even
if no members of the system actively work towards such an
outcome
The very act of choosing, spread widely enough and freely
enough, creates a power law distribution
•8
Power Law Examples
Distribution of inlinks to 100,000 random web pages:
•9
Power Law Examples
Several hundred blogs ranked by number of inbound links
Wealth of investors in the Forbes 400 list of 2003 vs. their ranks (rich-getricher)
•10
Power Law
Many distributions whose underlying mechanism is power law fail to show
power law patterns at the two ends of the distribution:
— Most popular items
— Least popular items
This could be the results of bottlenecks
The Netflix data shows a pattern which fits the power law distributions only for
ranks 1 to 100
In this case there is an information bottleneck due to the fact that the users
cannot easily discover niche contents because it is not properly categorized
•11
How niche-centric is Youtube?
10% of most popular videos account for almost 80% of views
Requests on Youtube are highly skewed towards popular videos
Suggestion: caching the 10% of the long-term popular videos can serve 80% of
the requests
•12
Popular Content Analysis
All 4 popularity distributions analyzed exhibit power law behavior across more
than 2 orders of magnitude (straight line)
Best fit: Power law with an exponential cutoff
•13
Popular Content Analysis
Most categories (such as Daum Food) showed power law distributions with an
exponential cutoff
Yule process in YouTube can explain the power law distribution:
If k users have already watched a video then the rate of the other users
watching the video will be proportional to k
What could account for the exponential cutoff in the most popular
videos?
Aging (network of actors)
— Unlikeable, since traces show that 80% of video requests on a given day are
older than a month
Information filtering: a user can only receive information from a limited number
of sources
— On the contrary, highly popular videos are prominently featured in oD services
to attract more viewers
“Fetch at most once” behavior
— Viewers are not likely to watch the same video multiple times
•14
“Fetch At Most Once” in YouTube?
R – Average number of requests per user
U – Fixed number of users
V – Number of videos
Tail truncation is affected by
R and the number of videos
per category
•15
The Long Tail - Intro
Problems with traditional retail:
— Average movie theater will not show a film unless it can attract at least 1,500
people over two weeks time
— An average record store needs to sell at least 4 copies of a CD per year to
make it worth the rent for 1.3” of shelf space
— Same goes for DVD rental shops, video-game stores, booksellers , radio
stations etc.
The reason for this problem is: a limited local population
In the previous century there was a clear solution to this problem: hits
Theaters focus on blockbusters, CD stores focus on the 100 top singes charts,
news stands focus on the top 30 newspapers and magazines etc.
•16
The Long Tail - Intro
Why is this a problem?
Today, for example, the average Barnes& Noble supertore carries around
100,000 titels. Yet, more than 25% of Amazon’s book sales come from outside
the top 100,000
20 of the Top Selling Albums of all times were produced somewhere between
1996-2000. The next 5 years produced only 2 ranked at: 92 & 95
Album sales between 2001-2005 dropped in 25%.
While hit album sales dropped in nearly 50% during the same years
•17
The Long Tail - Intro
Yet, another Example
If we take a look at Rhapsody’s (streaming service by RealNetworks) monthly
statistics, we get a demand curve that looks like the one of any record store:
(page 19)
All the action seems to appear in a tiny number of tracks on the left – The hits
•18
The Long Tail - Intro
After a century of staring at the left hand-side of the curve, let’s have a look at
the right hand-side: its not exactly zero!
Not only that, but songs are being downloaded at an average of 250 a month
Since there is so many of them. Their total sales quickly add up: 22 million DLs
– almost 25% of Rhapsodys total business
If we look even closer: 16 million DLs – a little more than 15% of DLs a month
•19
The Long Tail - Intro
•20
The Long Tail
Can UGC services benefit from the Long Tail?
The truncated tail best fits power law with exp. cutoff
A number of reasons for the truncated tail:
— Natural shape: most videos are of low interest to most users
— Sampling Biases or pre-filters: users tend to publish their most interesting
videos, leaving the private ones unreachable
— Information Filtering: search engines tend to favor a small number of popular
items
Removal of these bottlenecks would allow users to discover rare niche videos
and offer new potential business opportunities
•21
The Long Tail
Potential gains:
The benefit is reduced when the number of videos is smaller since videos can
be found easier
These benefits may not hold if truncation is due to natural user behavior
•22
Popularity Distribution vs. Age
After a day, 90% of videos are watched at least one
40% are watched over 10 times
If a video did not get enough requests during its first days, it is unlikely it will get
many requests in the future (very slow decay in the horiz. log-scale axis)
Is it possible to predict near-future popularity?
YouTube Sci trace
•23
Predicting Near-Future Probability
Main reasons
— Service providers may populate videos within multiple proxies or caches
— Content owners will have a fast feedback on their contents (e.g. movie trailers)
Correlation coefficient of video views in two snapshots
and the number of videos analyzed
Even 2 days old videos provide high correlation results after 3 months
How easy or hard is it for a video to become popular as a function
of its age?
•24
Popularity Shifts
Observations:
— Young videos can change many rank positions very fast (unlike older videos):
ranking classification for old videos is more stable
— Old videos are able to become popular after a long time (maybe good
recommendation engines)
— The gap between the max. and the top 99% reflects that only a few young
videos make large rank changes
— There is a consistent min. line at about -4000 across all age groups (these
videos did not receive any requests but were pushed back in ranking by popular
videos)
•25
Popularity Shifts
Observation:
— Videos that get many requests can get a minor rank change
— Videos that get very few requests can have a large rank change
Conclusion: considering the change in ranks is not enough
There still are drastic popularity shifts for young videos (log-scale)
Most old videos did not receive any significantly large number of requests
•26
Outline
Introduction to UGC (User Generated Content) systems
Analyzing statistics in UGC systems
— Compare UGC systems with standard Vod systems
— Analyze the popularity distributions from various categories of
UGC services
— Measure the level of content aliasing and of illegal content
Efficiency proposals for UGC systems
— Utilizing P2P
— Using caches
•27
Efficient UGC System Design
YouTube is estimated to carry 60% of all videos online
YouTube serves 100 million distinct videos daily
Goal: investigate the benefits for alternate distribution schemes: caching and
P2P
Data used: daily traces of 6 consecutive days for 263,847 YouTube Sci. videos
•28
Better Use Of Caching
Assumptions:
— Caches always redirect users to the right video
— No assumptions are made about the location or the size of the caches
— Each time a video is viewed the cache holds the full length video (even if the
user chose to watch it partially)
Types of caches:
— Static finite cache – at day 0 the cache is filled with long-term pop. Videos that
never changes
— Dynamic infinite cache – at day 0 the cache is filled with all videos requested
before day 0, and afterwards stores any additional video requests
— Hybrid finite cache – like the static cache, but can also hold the daily most pop.
Videos
The static and the hybrid caches hold at day 0 16% of the Sci. videos
The hybrid cache also holds extra space for the daily top 10,000 videos
The 6 day trace is replayed under each one of these caches, calculation hits
and misses
Cache size is determined by the number of videos cached (video length does
not vary much across files)
•29
Better Use Of Caching
Results:
— The static cache uses 84% less space then the dynamic cache, but saves about
75% of the server’s load
— The hybrid cache improves the static cache by about 10%
•30
P2P VoD
Consider a peer assisted VoD distribution where users stream videos form Vod
servers as well as from each other
P2P is effective only when there are enough online peers sharing content
How many files benefit from this approach?
How much server workload can be lowered?
•31
P2P VoD
P2P session cases:
— A user shares only while watching a
video
— A user shares a video for the entire
duration time he spends on YouTube
— A user shares for 1 extra hour after he
is done watching
— A user shares for 1 extra day
Average session time is currently 28
minutes
Assumption: within a single day
requests are exponentially distributed
Variables:
— Intensity of requests:

— System time of a user: t
— Number of concurrent users:
t
•32
Content Aliasing
In UGC there often exist multiple identical or very similar copies for a single pop.
event
Problem:
— Multiple copies of a certain video dilute the popularity of a single event
— This could affect the recommendation engine
Data: used a sample of 216 vids. From the top 10,000 of YouTube’s Ent.
Category
Most vids. Have 1-4 aliases, while the maximum is 89
•33
Content Aliasing
Time intervals of aliases:
There is little or no decrease in the number of views over time
•34
Illegal Uploads
Videos derived from copyrighted content raise a serious legal dilemm for UGC
service providers
Nearly 10% of videos in YouTube are uploaded without the permission of the
content owner according to Vidmeter’s report
In order to measure the extent of illegal video content, the same list of videos is
sampled at two different times: the discrepancy represents the deleted videos
From the first set of 1,687,506 videos (YouTube Ent.) only about 5% were
deleted due to a violation of the copyright law
A far smaller result than that of Vidmeter’s
•35
Appendix
Example for the network of actors:
•36
Papers
I Tube,You Tube, Everybody Tubes: Analyzing the
World’s Largest User Generated Content Video
System
— Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, YongYeol Ahn and Sue Moon
Classes Of Behavior Of Small-World Networks
— L. Amaral, A. Scala, M. Barthelemy and H.E. Stanley
The Long Tail: Why the Future of Business Is
Selling Less of More
— C. Anderson
•37
Outline
Introduction and basic concepts
Introduce the general ideas of BUBBLE algorithm
Introduce centralised community detection algorithms
Evaluate the BUBBLE algorithm
Show the possibility of a distributed implementation for
BUBBLE
•38
Introduction - Goals
To improve our understanding of human mobility in terms of
social structures
— Community
— Centrality
To use these structures in the design of forwarding algorithms
for Pocket Switched Networks(PSNs)
— In PSN, we do not try to find or build end-to-end paths. Data is
forwarded hop-by-hop, taking advantage of any opportunities in
the course of device mobility (local/global network connectivity)
• Whenever two PSN nodes come into contact, they must detect each
other and determine what to transfer in each direction
— PSN falls under the more general space of Delay Tolerant
Networks(DTN)
•39
DTN
The existing TCP/IP model operates on a number of key
assumptions:
— an end-to-end path exists between a data source and its peer(s)
— the maximum round-trip time between any node pairs in the
network is not excessive
— the end-to-end packet drop probability is small
Challenged networks characteristics:
— very significant link delay
— non-existence of end-to-end routing paths
— lack of large memory at end nodes
In a DTN, routing is performed over time to achieve eventual
delivery by employing long-term storage at the intermediate
nodes
•40
Introduction
Some DTN routing algorithms provide forwarding by building
and updating routing tables whenever mobility occurs
Not cost effective for a PSN:
— mobility is often unpredictable
— topology changes can be rapid
Rather than exchange much control traffic to create unreliable routing
structures, it is search for some characteristics of the network which
are less volatile than mobility
A PSN is formed by people. Those people’s social relationships
may vary much more slowly than the topology
•41
Outline
Introduction and basic concepts
Introduce the general ideas of BUBBLE algorithm
Introduce centralised community detection algorithms
Evaluate the BUBBLE algorithm
Show the possibility of a distributed implementation for
BUBBLE
•42
Why BUBBLE Algorithm?
Previous work presented “labeling strategy”:
— Each node have a label that informs other nodes of its affiliation
— Next-hop nodes are selected if they belong to the same label as
the destination
— Very little state information, merely an affiliation label, can already
bring significant improvement in forwarding performance :
• delivery ratio
• delay
• cost
— This is a beginning of social based forwarding in PSN:
• without a concise concept of community
• lack of mechanisms to move messages away from the source
when the destinations are socially far away
•43
Why BUBBLE Algorithm?
BUBBLE combines the knowledge of community structure with
the knowledge of node centrality to make forwarding decisions
Two intuitions behind this algorithm:
— People have varying roles and popularities in society, and these
should be true also in the network
— People form communities in their social lives, and this should
also be observed in the network layer
•44
BUBBLE RAP Algorithm
Forwarding is carried out as follows:
The source node first bubbles the message up the hierarchical
ranking tree using the global ranking, until it reaches a node which is
in the same community as the destination node
The local ranking system is used instead of the global ranking, and
the message continues to bubble up through the local ranking tree
until the destination is reached or the message expires
Require every node to be able to compare ranking of all other nodes
in the system
•45
Illustration of the BUBBLE algorithm
•46
Data Collection
Our dataset:
— Infocom05 - the iMotes were distributed to students attending the Infocom
student workshop
— Hong-Kong - the people carrying the iMotes were chosen independently in a
Hong-Kong bar
— Cambridge - the iMotes were distributed to students from University of Cambridge
Computer Laboratory
— Infocom06 - the same as in Infocom05 except that the scale is larger
— Reality – smart phones were deployed to students and staff at MIT
Experimental data set
Reality
Infocom0
5
iMote
Bluetooth
HongKong
iMote
Bluetooth
Cambridg
e
iMote
Bluetooth
Infocom0
6
iMote
Bluetooth
Phone
Bluetooth
Duration (days)
Number of Experimental
Devices
3
41
5
37
11
54
3
98
246
97
Number of internal
contacts
22,459
560
10,873
191,336
54,667
Average # Contacts pair
day
4.6
0.084
0.345
6.7
0.024
Device
Network type
•47
Frequency of nodes as relays
Shows the number of
times a node fails on
the shortest paths
between all other
node pairs (centrality
of a node in the
system)
In order to design
more efficient
forwarding strategy
we prefer to choose
popular nodes as
relays rather than
unpopular ones
•48
Outline
Introduction and basic concepts
Introduce the general ideas of BUBBLE algorithm
Introduce centralised community detection algorithms
Evaluate the BUBBLE algorithm
Show the possibility of a distributed implementation for
BUBBLE
•49
Are communities of nodes
detectable in PSN traces?
This requires community detection algorithm
Criteria for choosing the algorithm:
— Ability to uncover overlapping communities
— A high degree of automation
•50
Are communities of nodes
detectable in PSN traces?
K-CLIQUE method suits
— but was designed for binary graphs, thus we must threshold
the edges of the contact graphs in order to use it and it is
difficult to choose an optimum threshold manually
Weighted network analysis (WNA) can work on weighted
graphs directly
— but it cannot detect overlapping communities
Chose to use both K-CLIQUE and WNA
•51
K-CLIQUE Community Definition
Union of all k-cliques (complete sub graphs of size k) reachable
through a series of adjacent k-cliques [Palla et al]
Two k-cliques are adjacent if they share k − 1 nodes
A community corresponds to a maximal union of k-cliques in
which we can reach any k-clique from any other k-clique
through series of k-clique adjacencies
•52
K-CLIQUE Community Definition
Overlapping feature (a node can belong to several different kclique clusters at the same time)
As we increase k, the k-clique-communities shrink, but on the
other hand become more cohesive since their member nodes
have to be part of at least one k-clique
Was designed for binary graphs (undirected, unweighted)
•53
K-CLIQUE Community Detection
Communities based on contact durations with weight threshold = 388800s
(4.5days), 648000s (7.5days) and k=3,4 (Reality)
•54
Weighted Network Analysis
Communities detected by applying WNA on four datasets.
Infocom06 – Qmax is low, agrees with the fact that in a conference the community
boundary becomes blurred
Cambridge – the two communities exactly matched the two groups (1st year and
2nd year) of students selected for the experiment
Reality - Qmax is high, reflects the more diverse campus environment
•55
Centralised community detection algorithms
Give us rich information about the human social clustering
Useful for offline data analysis on mobility traces collected
Useful for exploring structures in the data and hence design
useful forwarding strategies, security measures
•56
Outline
Introduction and basic concepts
Introduce the general ideas of BUBBLE algorithm
Introduce centralised community detection algorithms
Evaluate the BUBBLE algorithm
Show the possibility of a distributed implementation for
BUBBLE
•57
Evaluations of Different Forwarding Algorithms
Comparison metrics:
— Delivery ratio - the proportion of messages that have been
delivered out of the total unique messages created
— Delivery cost - the total number of messages (include
duplicates) transmitted across the air. To normalize this, we
divide it by the total number of unique messages created
•58
Evaluations of Different Forwarding Algorithms
WAIT - Hold on to a message until the sender encounters the recipient
directly (paths with single hop)
— lower bound for delivery and cost
FLOOD - Messages are flooded throughout the entire system (length of
the path is unlimited)
— upper bound for delivery and cost
MCP - Multiple-Copy-Multiple-Hop. We use 4-copy-4-hop MCP scheme
in most of the cases
— paths four hops are used (corresponding to a flooding algorithm with a
Time-To-Live of 4 hops)
LABEL - Messages are only forwarded to the nodes in the same
community as the destination
•59
Two-Community Case
Cambridge data can be divided into two communities :
— undergraduate year 1 (Group A)
— year 2 (Group B)
Centrality of nodes within each group:
— traffic is created only between members of the same community
— only members in the same community are chosen as relays for
messages
•60
Two-Community Case
Figure (a) shows the individual node centrality when traffic is created from
one group to another
Figure (b) shows the correlation of node centrality within an individual group
and inter-group centrality
Points lie more or less
around the diagonal line:
— the inter- and intra- group centralities are quite well correlated
— active nodes in a group are also active nodes for inter-group communication
•61
Two-Community Case
Comparisons of several algorithms on Cambridge dataset, delivery
and cost:
— BUBBLE achieves almost the same delivery success rate as the 4-copy4-hop MCP but with only 45% of its cost
•62
Multiple-Community Case
Use the Reality dataset
— There is a total 8 groups within the whole dataset
— Within each individual group, the node centralities demonstrate diversity
similar to the Cambridge case
— First isolate just one group, consisting of 16 nodes, single group case:
• BUBBLE performs very similarly to MCP most of the time and even outperform
MCP when the time TTL is set to be longer than 1 week (delivery success ratio)
• BUBBLE only has 55% of the cost of MCP
•63
Multiple-Community Case
Comparisons of several algorithms on Reality dataset, all
groups:
— flooding achieves the best for delivery ratio, but the cost is :
• 2.5 times that of MCP
• 5 times that of BUBBLE
— BUBBLE is very close in performance to MCP and even outperforms it
when the time TTL of the messages is allowed to be larger than 2 weeks
— BUBBLE cost is only 50% that of MCP
•64
Multiple-Community Case
PROPHET - Uses the history of encounters and transitivity to calculate the
probability that a node can deliver a message to a particular destination
Comparisons of BUBBLE and PROPHET on Reality dataset:
— BUBBLE achieves a similar delivery ratio to PROPHET, but with only half of the cost
— Similar significant improvements by using BUBBLE are also observed in other datasets,
these demonstrate the generality of the BUBBLE algorithm
•65
Outline
Introduction and basic concepts
Introduce the general ideas of BUBBLE algorithm
Introduce centralised community detection algorithms
Evaluate the BUBBLE algorithm
Show the possibility of a distributed implementation
for BUBBLE
•66
DiBuBB Algorithm
For practical applications we need BUBBLE be implemented in a
distributed way
— Each device should be able to:
• detect its own community
• calculate its centrality values
Use the distributed K-CLIQUE algorithm to detect local community
(detecting accuracy up to 85% of the centralised one)
Use C-Window to approximate its own global and local centrality
values
— C-Window – cumulative window. calculate the average value
on all previous windows, such as from yesterday to now
Besides that, it operate exactly like BUBBLE
•67
Distributed BUBBLE RAP
Trace analysis conclusions:
— Total degree (unique nodes seen by a node throughout
the experiment period) is not a good approximation of
the node centrality
— The degree per unit time (for example the number of
unique nodes seen per 6 hours) and the node centrality
have a high correlation value
•68
Approximating Centrality
For evaluation of distributed centrality compare RANK,
S-Window and C-Window:
— RANK - a component of BUBBLE, using only centrality information.
Messages are pushed to nodes which have a higher ranking than the current
node, until either they reach the destinations or they expire
— S-Window - when two nodes meet each other, they compare how many
unique nodes they have met in the previous unit-time slot (e.g. 6 hours)
•69
Approximating Centrality
S-Window achieves maximum of 4% improvement in delivery
ratio than RANK, but at double the cost
C-Window does not achieve as good delivery as RANK (not
more than 10% less in term of delivery), but it also has lower
cost
•70
Approximating Centrality
C-Window is easy to implement in reality and has similar
delivery and cost to RANK (pre-calculated centrality), which is
why it was chosen for DiBuBB
S-Window, and C-Window can approximate the pre-calculated
centrality quite well
Running a set of RANK emulations on more datasets, but using
the centrality values of the Multiple-Community Case showed
that the delivery ratio and cost of RANK on the new datasets is
as good as in the original dataset
These results imply some level of human mobility predictability,
and show empirically that past contact information can be used
in the future
•71
CONCLUSIONS
It is possible to detect characteristic properties of social grouping
in a decentralised fashion from a diverse set of real world traces
Demonstrated that such characteristics can be effectively used
in forwarding decisions
BUBBLE algorithm has similar delivery ratio, but much lower
resource utilisation than flooding, control flooding, and
PROPHET
•72
Papers
BUBBLE Rap: Social-based Forwarding in Delay Tolerant Networks
— Pan Hui, Jon Crowcroft, Eiko Yoneki
Analysis of weighted networks
— M. E. J. Newman
Uncovering the overlapping community structure of complex networks in nature
and society
— Gergely Palla, Imre Derényi, Illés Farkas & Tamás Vicsek
Pocket Switched Networks and Human Mobility in Conference Environments
— Pan Hui, Augustin Chaintreau, James Scott, Richard Gass, Jon Crowcroft and Christophe Diot
How Small Labels create Big Improvements
— Pan Hui, Jon Crowcroft
A Delay-Tolerant Network Architecture for Challenged Internets
— Kevin Fall
Routing in a Delay Tolerant Network
— Sushant Jain, Kevin Fall, Rabin Patra