Transcript slides

Inferring User Political Preferences
from Streaming Communications
Svitlana Volkova1, Glen Coppersmith2
and Benjamin Van Durme1,2
1Center
for Language and Speech Processing
2Human Language Technology Center of Excellence
ACL 2014, Baltimore
Motivation
• Personalized, diverse and timely data
• Can reveal user interests, preferences and opinions
DemographicsPro – http://www.demographicspro.com/
WolphralAlpha Analytics – http://www.wolframalpha.com/facebook/
Applications
• Large-scale passive
polling and real-time
live polling
• Online advertising
• Healthcare analytics
• Personalized
recommendation
systems and search
User Attribute Prediction
Communications
…
…
Political Preference
Rao et al., 2010; Conover et al., 2011,
Pennacchiotti and Popescu, 2011;
Zamal et al., 2012; Cohen and Ruths,
2013
…
Age
…
Rao et al., 2010; Zamal et al., 2012;
Cohen and Ruth, 2013; Nguyen et al.,
2011, 2013
.
.
.
Gender
…
Garera and Yarowsky, 2009; Rao et al.,
2010; Burger et al., 2011; Van Durme,
2012; Zamal et al., 2012; Bergsma and
Van Durme, 2013
Existing Approaches ~1K Tweets*
….…
….…
Tweets as….…
a
document
….…
Does an average Twitter user produce
thousands of tweets?
….…
….…
….…
….…
*Rao et al., 2010; Conover et al., 2011; Pennacchiotti and Popescu,
2011a; Burger et al., 2011; Zamal et al., 2012; Nguyen et al., 2013
How Active are Twitter Users?
http://www.digitalbuzzblog.com/visualizing-twitter-statistics-x100/
Real-World Predictions
Active users 1,000+ tweets
10%
50%
Average Twitter users
Median = 10 tweets per day
Private users: no content
Not active users: no or limited content
20%
20%
Our Approach
Real world batch
predictions
1. Take advantage of user local
neighborhoods
2. Incremental dynamic real-time predictions
Streaming
predictions
Our Approach
Real world batch
predictions
1. Take advantage of user local
neighborhoods
2. Incremental dynamic real-time predictions
Attributed Social Network
User Local Neighborhoods a.k.a. Social Circles
Twitter Network Data
Code, data and trained models for gender, age, political preference prediction
http://www.cs.jhu.edu/~svitlana/
Twitter Social Graph
friend
reply
hashtag
I.
Candidate-Centric
1,031 users of interest
II. Geo-Centric
270 users
retweet
III. Politically Active*
371 users
10 - 20 neighbors of each
What types of neighborstype
lead
to
the
best
per user
attribute prediction for~50K
a given
nodes,user?
~60K edges
follower @mention
*Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013
Code, data and trained models for gender, age, political preference prediction
http://www.cs.jhu.edu/~svitlana/
Experiments
• Log-linear binary unigram models:
(I) Users vs. (II) Neighbors and (III) Both
• Evaluate the relative utility of different
neighborhood types:
– varying neighborhood size n=[1, 2, 5, 10] and
content amount t=[5, 10, 15, 25, 50, 100, 200]
– 10-fold cross validation with 100 random
restarts for every n and t parameter combination
Neighborhood Comparison
10
15
25
50
100
1 Neighbor
Friend
Follower
Hashtag
Usermention
5
200 50
Accuracy
0.50 0.55 0.60 0.65 0.70 0.75
Accuracy
Accuracy
0.50 0.55 0.60 0.65 0.70 0.75
5
250
500
1000
2000
10 Neighbors
Retweet
Reply
User
10
20
50
100
Tweets perPer
Neighbor
log(Tweets
Neighbor)
100
200 5
Friend
Follower
Hashtag
Usermention
Retweet
Reply
User
10
20
50
100
Tweets per
log(Tweets
PerNeighbor
Neighbor)
200
Optimizing Twitter API Calls
Cand-Centric Graph: Friend Circle
Optimizing Twitter API Calls
Cand-Centric Graph: Friend Circle
Optimizing Twitter API Calls
Cand-Centric Graph: Friend Circle
Optimizing Twitter API Calls
Cand-Centric Graph: Friend Circle
Summary: Batch Real-World
Predictions with Limited User Data
More data is better How to get it?
• More neighbors per user > additional
content from the existing neighbors
What kind of data?
• Follower, friend, @mention, retweet
• Users recently joined Twitter
• No or limited access to user tweets
Real-world predictions
no or very
limited
content!
Our Approach
1. Take advantage of user local
neighborhoods
2. Incremental dynamic real-time predictions
Streaming
predictions
Iterative Bayesian Predictions
(
)
(
Pt 2 R Tt 2 = 0.62
(
)
)
Pt k R Tt k = 0.77
(
)
Pt 2 R Tt k-1 = 0.65
Pt1 R Tt1 = 0.52
?
t0
t1
t2
t k-1
…
tk
Õ P ( t a = D) × P ( a = D)
P (a = D T ) =
=
å P (T a = y ) × P ( a = y ) å Õ P ( t a = y ) × P ( a = y )
P (T a = D ) × P ( a = D )
k
k
k
"yÎA
"yÎA
k
Time
0.6
0.5
1.0
0.5
0.9
0.4
p(Republican|T)
p(Republican|T)
0
20
40
60
Cand-Centric Graph: Belief
Updates
0.8
0.7
0.6
0.5
blican|T)
20
40
60
0.1
0
20
40
Tweet Stream (T)
60
?
?
Time
Time
0.4
0.3
0.2
0.0
0
0.5
0.3
t0
t1
t 2 t k-1
…
t0
t1
t2
t k-1
…
Cand-Centric Graph: Prediction Time
User Stream
User-Neighbor
500
Users
Users
500
400
95%
75%
300
Weeks (log scale)
0
5
10
Time in Weeks
100
1
0.1
0.01
0.001
300
15
12
10
0.02
400
0
20
19
Cand
1.2
Cand
Geo
0.01
0.002 0.001
1 2 3 4
Time in Weeks
3.5
8.9
3.2
5
1.1
Active
100 users
75% confidence
Batch vs. Online Performance
1
0.8
+0.27 +0.28
+0.31
+0.27
+0.03
0.72
+0.1
+0.25
+0.14
+0.11
0.75
0.57
0.6
0.4
0.2
0
Cand
User Batch
User Stream
Geo
Active
Neighbor Batch
User-Neighbor Stream
Summary
• Neighborhood content is useful*
• Neighborhoods constructed from friends,
usermentions and retweets are most effective
• Signal is distributed in the neighborhood
• Streaming models > batch models
*Pennacchiotti and Popescu, 2011a, 2001b; Conover et al.,
2011a, 2001b; Golbeck et al., 2011; Zamal et al., 2012
Thank you!
Labeled Twitter network data for gender,
age, political preference prediction:
http://www.cs.jhu.edu/~svitlana/
Code and pre-trained models available upon
request: [email protected]