A Unified Model for Stable and Temporal Topic Detection

Download Report

Transcript A Unified Model for Stable and Temporal Topic Detection

A Unified Model for Stable and
Temporal Topic Detection from Social
Media Data
Hongzhi Yin†, Bin Cui†, Hua Lu‡,
Yuxin Huang† and Junjie Yao†
†Peking
University
‡Aalobrg
University
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted
2 / 38

Experiments

Q/A
Boosting
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted
3 / 38

Experiments

Q/A
Boosting
Motivation
4 / 38
Motivation (Cont.)

Two different types of topics are mixed up in the
social media platforms such as Twitter, Weibo and
Delicious;

Temporal Topics are temporally coherent
meaningful themes. They are time-sensitive and
often on popular real-life events or hot spots, i.e.,
breaking events in the real world.

Stable Topics are often on users’ regular interests
and their daily routine discussions, e.g., their
moods and statuses.
5 / 38
One Example in Twitter
Temporal Topic : Dead pigs in Shanghai
6 / 38
Stable Topic : Big Data
Another Example in Twitter
Stable Topic: Animal Adoption
Temporal Topic: Independence Day
7 / 38

8 / 38
We can tell the difference between temporal and
Stable topics from their temporal distributions
and their description words.
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted

Experiments

Q/A
10 / 38
Smoothing
Problem Formulation

A user-time-associated document d is a text
document associated with a time stamp and a user.

A temporal topic is a temporally coherent theme.
In other words, the words that are emerging in the
close time dimension are clustered in a topic.
11 / 38

An example of temporal topics: Given a collection of usertime-associated tweets, the desired temporal topics are the
events happening in different times.

Formally, a temporal/stable topic is represented by a word
distribution
where
Problem Formulation (Cont.)

A topic distribution in time dimension is the
distribution of topics given a specific time
interval.
 Formally,
topic

is the probability of temporal
given time interval t.
A topic distribution in user space is the
distribution of topics given a specific user.
 Formally,
given user u.
12 / 38
is the probability of stable topic
Problem Formulation (Cont.)

A User-Time-Keyword Matrix M is a hypermatrix whose three dimensions refer to user, time
and keyword. A cell in M[u, t, w] stores the
frequency of word w generated by user u within
time interval t.

Given a collection of user-time-associated
documents C, we first formulate matrix M
 Detecting
Temporal Topics
 Extracting
13 / 38
Stable Topics
Task 1
Task 2
Problem Formulation (Cont.)

Detecting a set of temporal topics that are eventdriven.
 Detecting
bursty events, such as disaster (e.g.,
earthquakes), politics (e.g., election), and public
events (e.g., Olympics)
 Analyzing

topic trends
Extracting a set of stable topics that are interestdriven.
 Finding
user intrinsic interests and better
modeling user preference
14 / 38
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted

Experiments

Q/A
15 / 38
Boosting
A User-Time Mixture Model

Main Insights

To find both temporal and stable topics in a unified
manner, we propose a topic model that simultaneously
captures two observations:
●Words generated
around the same time are more likely
to have the same event-driven temporal topic
●
Words generated by the same user are more likely to
have the same interest-driven stable topic.
 The
former helps find event-driven temporal topics
while the latter helps identify interest-driven stable
topics.
16 / 38

Combine user and time information

We assume that when a user u generates a word
w at time t, he/she is probably influenced by two
factors: the breaking news/events occurring in
time t and his/her intrinsic interests.

Breaking events are modeled by temporal topics
and user intrinsic interests are modeled by stable
topics.
17 / 38

The likelihood that user u generates word w at
time t is as follows:

Parameters
and
are mixing weights
controlling the motivation factor choice, also
denoting the proportions of temporal topics and
stable topics in the dataset. It is worth
mentioning that they are learnt automatically,
instead of being fixed.
18 / 38
Parameter Estimation

The log-likelihood of the whole user-timeassociated document collection C is

E-M algorithm to estimate
…
E-Step
Compute expectation
Q(; n )
M-Step
Maximize, closed form solution
Please refer to the details of E-M algorithm in Section 4.2
19 / 38
…
Parameter Estimation

E-step:

M-step:
20 / 38
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted

Experiments

Q/A
21 / 38
Boosting
Spatial Regularization

Intuitions
 If
two users are connected in the social network space, they
are more likely to enjoy same/similar interests/topics.
A
topic is interest-coherent if people who are interested in
this topic also close in the network space.
DB
DB
22 / 38
?
DB
Intuition: users’ interests are
similar to their neighbors
More likely to be an DB
person or an IR person?
22
Spatial Regularization

Topic Model With Spatial Regularization

A regularized data likelihood is defined as follows:
Regularizer
The Spatial Regularizer plays the role of
spatial smoothing for user interests.
23 / 38
Parameter Estimation
…
E-Step
Q(; n )
Compute expectation
M-Step
Maximize, using NewtonRaphson
Regularized complete log-likelihood
Smooth using a spatial regularizer; in each iteration, a user interest is
smoothed by his/her spatial neighbors.
24 / 38
24
…
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted

Experiments

Q/A
25 / 38
Boosting
Insights

In topic models, the words with high occurrence rate,
i.e., popular words, enjoy high probabilities to appear at
top positions in each discovered topic.

These popular words are mostly general words,
denoting abstract concepts. In stable topics, they can
illustrate the domain of topics at the first glimpse.

However, in temporal topics, words with notable bursty
feature are superior in expressing temporal information
since users are more interested in bursty words than in
abstract concepts when browsing temporal topic
26 / 38
Example: Michael Jackson’s Death
In this temporal topic, we
expect that bursty words
“mj”, “michael jackson”
“moonwalk” become the
dominant words rather than
the general words “world”,
“news” and “death”.
But they cannot be removed
as stop words, since they can
help illustrate the stable
topics.
27 / 38
Burst-Weighted Boosting

We implement a bursty boosting step to escalate the
probability of these bursty words during the procedure of
detecting temporal topics.
 We
first compute the bursty-degree of each word in each
time interval. (Yao et al. ICDE’2010)
A
boosting step is then taken after each few E-M
iterations, as follows.
 In
this step, a word w will have its generation probability
boosted in a temporal topic only if w’s bursty period
overlaps with that of the topic.
28 / 38
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted

Experiments

Q/A
29 / 38
Boosting
Data Sets

Twitter Data set (Mar. 2009 to Oct.2009)

Delicious Data set (Feb.2008 to Dec. 2009)

Sina Weibo (2011)
30 / 38
Data Sets

Twitter: People in this platform often discuss many social
events and their daily life. It contains 9,884,640 tweets
posted by 456,024 users in the period of Mar. 2009 to
Oct.2009. Each user in this data set at least published 200
posts. We first removed all the stop words.

Delicious: Delicious is a collaborative tagging system on
which users can upload and tag web pages. We collected
200,000 users and their tagging behaviors from the period
of Feb.2008 to Dec. 2009. The dataset contains 7,103,622
tags. Topics on technology and electronic cover more than
half of tags. Breaking news also co-exists.
31 / 38
Compared Methods

Our models
 BUT
is the basic model
 EUTS
is the model enhanced with spatial
regularization
 EUTB
is the model enhanced with both spatial
regularization and burst-weighted boosting.

PLSA Model on Time Slices (Mei et al. KDD’05)

Individual Detection Method (Wang et al. KDD’07)

Topic Over Time Model (TOT) (Wang et al. KDD’06)

TimeUserLDA (Diao et al. ACL’12)
32 / 38
Time Stamp Prediction Comparison
0.8
EUTB
0.7
EUTS
0.6
BUT
0.5
0.4
Individual Detection
TimeUserLDA
TOT
BUT
0.3
0.2
EUTS
0.1
EUTB
0
Compared
Methods
34 / 38
TOT
TimeUserLDA
Individual Detection
Topic Quality Comparison
Excellent: a nicely presented
temporal topic;
Good: a topic containing
bursty features;
Poor: a topic without obvious
bursty features
35 / 38
Stable Topics Detected in Delicious
T 10
T 16
T 27
T 55
T8
T 33
windows
resources
news
u.s.
programmin
food
0.107
0.096
g 0.028
0.034
0.049
0.034
tools
education
latest
news
python
recipe
0.048
0.031
0.102
0.081
0.019
0.033
Freeware
interactive
Current
politics
Ruby
Cooking
0.038
0.020
0.099
0.076
0.016
0.030
firefox
Teaching
World
Democrats
javascript
Dessert
0.038
0.020
0.094
0.068
0.015
0.026
Google
science
events
international
software
Shopping
0.029
0.019
0.084
0.064
0.014
0.021
security
tools
newspaper
obama
tutorial
Home
0.015
0.084
0.061
0.011
0.016
36 / 38
0.028
Temporal Topics Detected in Delicious
T77
T78
T 87
T89
1.12-1.31
6.15-6.27
4.24-5.6
5.27-6.6
obama 0.144
moon 0.090
flu 0.158
google 0.061
inauguration 0.106
Space 0.068
swineflu 0.078
googlewave 0.059
bush 0.059
apollo11 0.032
pandemic 0.062
wave 0.042
president 0.021
apollo 0.023
swine 0.050
bing 0.040
gaza 0.017
nasa 0.018
health 0.020
apps 0.040
whitehouse 0.012
competition 0.015
disease 0.010
realtime 0.038
37 / 38
Stable Topics Detected in Twitter
T5
T6
T 11
T 53
T 39
T 22
free
free
day
assassin
god
teeth
0.020
0.007
0.104
0.039
0.015
0.035
market
iphone
travel
attempt
day
white
0.011
0.006
0.009
0.034
0.013
0.027
money
video
hotel
wound
follow
mom
0.010
0.006
0.008
0.024
0.010
0.027
People
photo
Check
level
free
yellow
0.007
0.006
0.006
0.020
0.009
0.023
check
camera
site
reach
look
trick
0.007
0.004
0.004
0.016
0.008
0.022
help
Apple
Golf
Account
check
free
0.004
0.004
0.01
0.006
0.021
38 /0.006
38
Temporal Topics Detected in Twitter
T63
T86
T 66
T70
7.6-7.15
7.1-7.6
10.7-10.15
6.24-6.30
july 0.012
july 0.035
free 0.012
michael 0.038
free 0.010
happy 0.020
nobel 0.012
jackson 0.036
summer 0.008
day 0.016
prize 0.011
rip 0.007
live 0.007
firework 0.009
peace 0.008
farrah 0.007
potter 0.006
independ 0.006
win 0.008
dead 0.005
harry 0.006
celebrate 0.005
obama 0.008
sad 0.005
39 / 38
Temporal Topic Trends Analysis
43 / 38
Temporal Topic Trends Analysis
44 / 38
Outline

Motivation

Problem Formulation

A Basic Solution
A

User-Temporal Mixture Model
Enhancement of the basic solution
 Regularization
Technique
 Burst-Weighted

Experiments

Q/A
45 / 38
Boosting
Thank You!
Any Question ?
Email: [email protected]