Transcript slides
Inferring User Political Preferences from Streaming Communications Svitlana Volkova1, Glen Coppersmith2 and Benjamin Van Durme1,2 1Center for Language and Speech Processing 2Human Language Technology Center of Excellence ACL 2014, Baltimore Motivation • Personalized, diverse and timely data • Can reveal user interests, preferences and opinions DemographicsPro – http://www.demographicspro.com/ WolphralAlpha Analytics – http://www.wolframalpha.com/facebook/ Applications • Large-scale passive polling and real-time live polling • Online advertising • Healthcare analytics • Personalized recommendation systems and search User Attribute Prediction Communications … … Political Preference Rao et al., 2010; Conover et al., 2011, Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013 … Age … Rao et al., 2010; Zamal et al., 2012; Cohen and Ruth, 2013; Nguyen et al., 2011, 2013 . . . Gender … Garera and Yarowsky, 2009; Rao et al., 2010; Burger et al., 2011; Van Durme, 2012; Zamal et al., 2012; Bergsma and Van Durme, 2013 Existing Approaches ~1K Tweets* ….… ….… Tweets as….… a document ….… Does an average Twitter user produce thousands of tweets? ….… ….… ….… ….… *Rao et al., 2010; Conover et al., 2011; Pennacchiotti and Popescu, 2011a; Burger et al., 2011; Zamal et al., 2012; Nguyen et al., 2013 How Active are Twitter Users? http://www.digitalbuzzblog.com/visualizing-twitter-statistics-x100/ Real-World Predictions Active users 1,000+ tweets 10% 50% Average Twitter users Median = 10 tweets per day Private users: no content Not active users: no or limited content 20% 20% Our Approach Real world batch predictions 1. Take advantage of user local neighborhoods 2. Incremental dynamic real-time predictions Streaming predictions Our Approach Real world batch predictions 1. Take advantage of user local neighborhoods 2. Incremental dynamic real-time predictions Attributed Social Network User Local Neighborhoods a.k.a. Social Circles Twitter Network Data Code, data and trained models for gender, age, political preference prediction http://www.cs.jhu.edu/~svitlana/ Twitter Social Graph friend reply hashtag I. Candidate-Centric 1,031 users of interest II. Geo-Centric 270 users retweet III. Politically Active* 371 users 10 - 20 neighbors of each What types of neighborstype lead to the best per user attribute prediction for~50K a given nodes,user? ~60K edges follower @mention *Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013 Code, data and trained models for gender, age, political preference prediction http://www.cs.jhu.edu/~svitlana/ Experiments • Log-linear binary unigram models: (I) Users vs. (II) Neighbors and (III) Both • Evaluate the relative utility of different neighborhood types: – varying neighborhood size n=[1, 2, 5, 10] and content amount t=[5, 10, 15, 25, 50, 100, 200] – 10-fold cross validation with 100 random restarts for every n and t parameter combination Neighborhood Comparison 10 15 25 50 100 1 Neighbor Friend Follower Hashtag Usermention 5 200 50 Accuracy 0.50 0.55 0.60 0.65 0.70 0.75 Accuracy Accuracy 0.50 0.55 0.60 0.65 0.70 0.75 5 250 500 1000 2000 10 Neighbors Retweet Reply User 10 20 50 100 Tweets perPer Neighbor log(Tweets Neighbor) 100 200 5 Friend Follower Hashtag Usermention Retweet Reply User 10 20 50 100 Tweets per log(Tweets PerNeighbor Neighbor) 200 Optimizing Twitter API Calls Cand-Centric Graph: Friend Circle Optimizing Twitter API Calls Cand-Centric Graph: Friend Circle Optimizing Twitter API Calls Cand-Centric Graph: Friend Circle Optimizing Twitter API Calls Cand-Centric Graph: Friend Circle Summary: Batch Real-World Predictions with Limited User Data More data is better How to get it? • More neighbors per user > additional content from the existing neighbors What kind of data? • Follower, friend, @mention, retweet • Users recently joined Twitter • No or limited access to user tweets Real-world predictions no or very limited content! Our Approach 1. Take advantage of user local neighborhoods 2. Incremental dynamic real-time predictions Streaming predictions Iterative Bayesian Predictions ( ) ( Pt 2 R Tt 2 = 0.62 ( ) ) Pt k R Tt k = 0.77 ( ) Pt 2 R Tt k-1 = 0.65 Pt1 R Tt1 = 0.52 ? t0 t1 t2 t k-1 … tk Õ P ( t a = D) × P ( a = D) P (a = D T ) = = å P (T a = y ) × P ( a = y ) å Õ P ( t a = y ) × P ( a = y ) P (T a = D ) × P ( a = D ) k k k "yÎA "yÎA k Time 0.6 0.5 1.0 0.5 0.9 0.4 p(Republican|T) p(Republican|T) 0 20 40 60 Cand-Centric Graph: Belief Updates 0.8 0.7 0.6 0.5 blican|T) 20 40 60 0.1 0 20 40 Tweet Stream (T) 60 ? ? Time Time 0.4 0.3 0.2 0.0 0 0.5 0.3 t0 t1 t 2 t k-1 … t0 t1 t2 t k-1 … Cand-Centric Graph: Prediction Time User Stream User-Neighbor 500 Users Users 500 400 95% 75% 300 Weeks (log scale) 0 5 10 Time in Weeks 100 1 0.1 0.01 0.001 300 15 12 10 0.02 400 0 20 19 Cand 1.2 Cand Geo 0.01 0.002 0.001 1 2 3 4 Time in Weeks 3.5 8.9 3.2 5 1.1 Active 100 users 75% confidence Batch vs. Online Performance 1 0.8 +0.27 +0.28 +0.31 +0.27 +0.03 0.72 +0.1 +0.25 +0.14 +0.11 0.75 0.57 0.6 0.4 0.2 0 Cand User Batch User Stream Geo Active Neighbor Batch User-Neighbor Stream Summary • Neighborhood content is useful* • Neighborhoods constructed from friends, usermentions and retweets are most effective • Signal is distributed in the neighborhood • Streaming models > batch models *Pennacchiotti and Popescu, 2011a, 2001b; Conover et al., 2011a, 2001b; Golbeck et al., 2011; Zamal et al., 2012 Thank you! Labeled Twitter network data for gender, age, political preference prediction: http://www.cs.jhu.edu/~svitlana/ Code and pre-trained models available upon request: [email protected]