Trends in Social Network

Download Report

Transcript Trends in Social Network

Trends in Social Network
{
By :
Pranay Agarwal
2008CS50220
M. Tech Project Presentation
Guides :
Amitabha Bagchi
Maya Ramanath







Introduction
Why twitter ?
Filter model for tweets
Social Graph construction
Evolving graphs for topics
Results and Conclusion
Future work
Outline
• 2.4 B Online
1.2 B
• Facebook
• Twitter 200 M [1]
Age old query :
“what are people talking about” ?
Introduction : Trends




Search Engines
News Media
Online Contents
E-Commerce




200 Millions users, 400 Millions tweets
everyday. [2]
All major news sources, Govt. offices etc.
Fast and real time
Data exposed via Twitter API.
Why Twitter
Needle in Haystack
Model Features
Tweet based
Meta data
NLP tags
Word count
Retweet count
NERs
Mentions count
Followers count
POS
Hash tags count
URLs count
-
Source (mobile, web)
S(t) = Σ ( Wi ∗ Fi(t) )
S (t) = score of tweet t
Wi = Weight of feature I
Fi(t) = feature i value for tweet t
Filter Model for tweets
Too short tweets (<= 2 words) are always
“chat” tweets.
 40 % of the “chat” tweets had one of the stop
words while only 2% of the “informative”
tweets had it.
Stop words = (I, me, mine, you, yours etc.)

Two Phase model
Goal :
“Enhance Timesense (Yahoo! proprietary)
capabilities using twitter.”
Tasks :


Marcelo Filter model
Trend prediction using Social Graph.
Yahoo! Proposal
Evolving graph for a topic
The hypothesis was that during evolution of graph, its structure
and topology gives rise to patterns, which could act as distinct
features to distinguish “trending” topic from “non-trending” topic.



Collecting data from Twitter API.
Storing and processing Social Graph.
Topic wise clusters of tweets.
Spatio Temporal Analysis of Topic Popularity in
Twitter





Wrote a python library to communicate with the API.
Collecting friends and followers relations
Several instances of nodes making calls to Twitter API
under normal rate limit.[3]
Frequent outages in the API service causing further delay
and blocking.
Resolve first for “Good” users, who will be involved in
creating and sharing of “informative” tweets.
Twitter API



Node : Each node is a user, which also contains several
other details of user profiles. We also label users for which
we have resolved all the relations.
Edge : An directed edge edge(u → v) represents user “u”
follows user “v” stored in Adjacency List.
Index : We need to make several queries where given a user
details we want to get it’s all followers and friends. To make
this query fast and efficient we indexed the graph by a
unique key “uid” as user id of all user nodes. This “uid” is
same as the twitter user id, which is already present in the
tweet object.
Graph Database Neo4j [4]

30 Millions nodes and 60 Millions edges.
Our graph is only 10% of the whole twitter
graph.
Validation
 Almost 95% of the top celebs present
 Around 60% of the users of the second set
present in our graph

This is a very strong indication that our graph mostly
contains “active” and “good” users while there could be
significant fraction of twitter users as “inactive”


“Timesense”, a Yahoo! Proprietary service, which gives
list of topic search queries along with “buzz” score which
indicates the “trendiness”
The search queries returned by Timesense are not
clustered together, which means different search queries
related to same event is given as different queries
mark appel houston mlb draft
mark appel and pat appel
mark appel 2013 mlb draft
mark appel contract
what high school did mark appel go to
mark appel major stanford
mark appel stanford baseball baseball player
mark appel of stanford
Tweets Clustering



Users in twitter use different variants of the same topic
We implemented a Bi-gram matching algorithm to
cluster together search queries like these.
One pass of all the public tweets and fetch the tweet if it
contains any of the bi-gram terms pair in it.
N gram Matching
“Topics that are going to become very popular
witness intense discussion within communities at
first. When the level of intensity rises then the
users who bridge communities enter the
discussion in a big way causing a merging of
what were earlier disjoint discussions.”
Evolving Graphs



The vertex set of Gt0 comprises the users V0 who tweet about t
on window 0 (the edge set is empty)
The vertex set Vit of Gti is the set of all users who have tweeted
on a topic in windows 0 through i
An edge(u ← v) is added to Eit if u ∈ V (Current set) and v has
tweeted about t on window i
Window = 30 Minutes.
Algorithm
Topic no.
Trend Type
Topic
1
High
IRS Scandal of Obama administration
2
High
Angelina Jolie going through mastectomy
3
Low
Cannes Film Festival 2013
4
Low
The Great Gatsby (Movie 2013)
Experiments
High Trend : First row, Low Trend : Second Row
High Trends
Low Trends



Largest component size increases for all the topics. But the
increase in the size is much more significant in case of topic 1
and 2
Topics 1 and 2 contain most of its nodes in the largest
low ratio c1/c2 for topics 2 and 3 shows that there are many
small independent clusters of communities discussing among
themselves without leading to a large component component.
Topic
C1
C1/C2
1
1732
346
2
974
44
3
127
8
4
22
3


Users who bridge communities enter the discussion
Bridge users serve as a barometer of the topics rising
popularity
External edges = {(u → v) : u ∈ S, v ∈ V \ S}
Total Edges = |{(u → v) : u ∈ S}|
φ(S) = External Edges / Total edges
Conductance
Topic No.
Nature
Topic
1
Very Low
Mattapoisett Car accident
2
Mild
Jeep Patriot new model car
3
Trend
Mayor Bloomberg
4
High trend
Mark Appel
• Collecting tweets clusters for these topics using above Algorithm
• Resolving all relations for all the authors in these tweets.
Resolve Missing Edges
First row : low, Second row: high Trend
Low Trends
High Trends


Table shows the higher conversion of external to internal
edges, in case of trending topics, which means more behavior
influence and spreading to followers in case of trends.
Largest connected component contains around 15 % and 35
% of all users in case of topic 1 and 2 respectively, while in
case of topic 3 it is 80 % and > 90% in case of topic 4.This
strongly supports the hypothesis that in case of trending
topics, users form large connected community.
Topic
E/N
L/N %
Fall in
conductance
1
0.09
8.7
0.3
2
0.38
36.7
0.1
3
0.83
82.1
0.5
4
0.98
97.98
1.2

Resolve all relations ?

May be NOT..
0.6
Drop
0.1
Drop
Limitations





Identify good “sensors” or users
Resolve as many relations possible
Better topic detection and clustering from
tweets.
Efficient Graph processing data Structure
Inter relation of topics
Future Work
Thanks





[1]http://www.internetworldstats.com/stats.ht
m
[2] Twitter official blog
[3] Twitter api. 2013.
[4] neo4j http://www.neo4j.org. 2013.
S. Ardon, A. Bagchi, A. Mahanti, A. Ruhela, A.
Seth, R. M. Tripathy, and S. Triukose. Spatiotemporal analysis of topic popularity in twitter.
CoRR, abs/1111.2904, 2011.
References