PPT - NetLab@SYSU

Download Report

Transcript PPT - NetLab@SYSU

TwitterRank:
Finding Topic-sensitive
Influential Twitterers
Presenter 吴伟涛
Outline
1.
Introduction
2.
Dataset
3.
Topic modeling and Homophily in Twitter
4.
TwitterRank
5.
Experiment and results
6.
Conclusions
Introduction
Motivation
The number of followers is the main metric to identify
influential twitterers. Twitterer’s influence may vary with
different topics.
Solution
Identify influential twitterers taking both the topical
similarity between users and the link structure into
account.
Introduction
Two contributions of this paper:
1.
First to report homophily in Twitter
2.
Introduce TwitterRank to measure the
topic-sensitive influence of the twitterers.
Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions
Twitter Dataset
1. Obtain a set of top-1000 Singapore-based
twitterers. Denote the set as S, |S|=996.
2. Crawled all the followers and the friends of each
s ∈ S and stored them in set S’.
3. Let S’’= S ∪ S’, and S* = {s|s ∈ S’’, and s is
from Singapore}.|S*| = 6748. For each s ∈ S*,
crawled all the tweets she had published so far.
Denote it as T. |T|=1,021,039.
Tweet Distribution
Tweet Distribution
Reciprocity in Following
Relationships
Reciprocity in Following
Relationships
 72.4% of the twitterers follow more than
80% of their followers
 80.5% of the twitterers have 80% of their
friends follow them back
 Casual following or homophily?
Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions
Homophily in Twitter
Q1: Are twitterers with “following” relationships
more similar than those without according to
the topics they are interested in?
Q2: Are twitterers with reciprocal “following”
relationships more similar than those without
according to the topics they are interested in?
Topic
modeling
定义距离:
Dist(i,j)
计算平均距离
计算平均距离
sym asym
 follow nofollow
验证:
 follow  nofollow
?
验证:
sym  asym
结论:
homophily
?
Topic Modeling
Goal:
Automatically identify the topics that twitterers are
interested in based on the tweets they published.
Latent Dirichlet Allocation (LDA) model is applied
Topic Modeling
LDA-based generative process for generating a doc:
1. For each document, pick a topic from its
distribution over topic,
2. Sample a word from the distribution over the
words associated with the chosen topic.
3. The process is repeated for all the words in the
document.
Topic Modeling Results
1. DT — D×T matrix
D: the number of users
T: the number of topics
DTij : the number of times a word in user si’s
tweets has been assigned to topic tj.
Topic Modeling
we first row normalize the DT matrix as DT’ such
that ||DT’i ·||1=1 for each row DT’i · . Thus each row
of matrix DT’ is basically the probability distribution
of twitterer si’s interest over the T topics, i.e. each
element DT’i j captures the probability that twitterer
si is interested in topic tj.
Topic Difference
Definition 1: the topical difference between two
twitterers si and sj can be calculated as:
dist (i, j)  2* DJS (i, j)
DJS(i,j) is the Jensen-Shannon Divergence
between the two probability distributions
DT’i · and DT’j · which is defined as:
1
DJS (i, j )  ( DKL ( DTi' || M )  DKL ( DT j' || M ))
2
Topic Difference
M is the average of the two probability distibutions,
i.e.
1
M  ( DTi'  DT j' )
2
DKL is the Kullback-Leibler Divergence which defines
the divergence from distribution Q to P as:
P(i)
DKL ( P || Q)   P(i ) log
Q(i)
i
Hypothesis Testing
* Note that, this part of work, hypothesis testing,
and topic distillation as well, is applied on a set of
twitterers who publish more than 10 tweets in total.
We denote this set as
Su* , and | Su* | = 4050.
Hypothesis Testing (I)
Formalize Q1 as a two-sample t-tet:
 follow : the mean topical difference of the pairs of
users with “following” relationship.
nofollow : the mea topical difference of those without.
H0 :  follow  nofollow
H1 :  follow  nofollow
Hypothesis Testing (I)
Result:
The null-hypothesis H0 is rejected at significant
level
  0.01 .
Hypothesis Testing (II)
Formalize Q2 as a two-sample t-tet:
sym
: the mean topical difference of the pairs of
users with reciprocal following relationship.
asym
: the mea topical difference of pairs of users
with only one-direction relationship.
H0 : sym  asym
H1 : sym  asym
Hypothesis Testing (II)
Result:
The null-hypothesis H0 is rejected at significant
level
  0.01 .
Implication
Homophily phenomenon does exist:
-The answer to Q1 is yes.
-The answer to Q2 is also yes.
-There are twitterers who are serious in following
others.
Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions
Topic-specific TwitterRank

A topic-specific random walk model is
applied to calculate the user’s influential
score.

The transition matrix for topic t, denoted
as Pt . The transition probability of surfer
from follower si to friend sj is:
pt (i, j ) 
| Tj |

a:si  sa
| Ta |
* simt (i, j )
simt (i, j)  1 | DTit'  DTjt' |
Topic-specific TwitterRank

Topic-specific teleportation:
Et  DT''t

The influence scores of twitters are
calculated iteratively:


TRt   Pt  TRt  (1   ) Et

Aggregation of topic-specific TwitterRank:


TR   rt  TRt
t
Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions
Comparison with other
Algorithms

Comparison to:
In-degree
PageRank
Topic-sensitive PageRank

Comparison in recommendation scenario.
Recommendation task
Recommendation task
St
sf
s0
L
Evaluation

Assume A is a ranked list recommended
by any of the algorithms. Let A(si) to be
the rank of si in A. The quality of the
recommendation Q(A) is measured as
Q(A)=|{si|si ∈St, and A(si)<A(sf)}|. The
lower the value of Q(A) is, the higher the
quality of corresponding algorithm is.
Criteria to generate L set

The number of followers that sf has.

The number of tweets that sf published.

Topical difference between s0 and sf .

Whether reciprocal relationship between
s0 and sf .
Experiment Results
Experiment Results
All performs better in Ldf than in Ldh:
- There are twitterers who “follow” because of the
topical similarity between them and their friends.
This support the homophily phenomenon.
 TR is outperformed in Lfh, Ltl and Ldh:
- InD perform the best in Lfh. This is because
twitterers “following” benaviors have already been
biased toward those with more followers.

Experiment Results
- TR performs the worst in Ltl, because LDA-based
topic distillation needs more contents to achieve
reasonable accuracy.
- TR outperforms all the other algorithms except InD
in Ldh. There still exist some twitters who do not
“follow” based on topical similarity, although
homophily is observed.
Outline

Introduction

Dataset

Topic modeling and Homophily in Twitter

TwitterRank

Experiment and results

Conclusions
Conclusion and future work

Homophily does exist:
-
Not all users just randomly “follows”.

Future work:
-
To make the algorithm more robust to manipulation, e.g
purposely publish large number of tweets.
-
To classify different categories of users by studying their
following behaviors more closely.
-
Incremental topic distillation/ event detection.
Thank you