Relational Non-parametric models for Analyzing Influences on

Download Report

Transcript Relational Non-parametric models for Analyzing Influences on

Workshop on Social Computing
IIT Kharagpur, Oct 5-6 2012
Dynamic Multi-Relational
Chinese Restaurant Process
for Analyzing Influences on Users
in Social Media*
Indrajit Bhattacharya
Research Scientist
IBM Research, Bangalore
*Collaboration w/ Himabindu Lakkaraju & Chiranjib Bhattacharyya
Social Media Analysis: Motivation
Microblogs: Twitter, Facebook, MySpace
Understanding and analyzing topics & trends
Influences on users
Variety of stakeholders
Business
Government
Social scientists
2
Social Media Analysis: Challenges
Network and Influences on Users
User personality: Personal preferences, global and
geographic trends, social circle in the network [Yang
WSDM 11]
Dynamic nature
Topics & user personalities evolve over time
Volume of data
Existing approaches fall short
3
Soc Med Analysis: State of the Art
Content Analysis
Ramage ICWSM 2010, Hong SOMA 2010
Variants of LDA
Inferring User Interests
Ahmed KDD 2011, Wen KDD 2010
Individual features such as user activity or network
Patterns in Temporal Evolution
Yang et al WSDM 2011
4
Bayesian Non-parametric Models
Choosing no of components in a mixture model
Particularly severe problem for large data volumes
such as for social media data
Bayesian solution
Infinite dimensional prior
Allows no of mixture components to grow with data size
Cannot capture richness of social media data
Algorithms often not scalable
5
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results
8
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results
9
Dirichlet Process (Informal)
10
Dirichlet Process: Properties
12
Chinese Restaurant Process (CRP)
14
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Parallelized Online Inference Algorithm
Experimental Results
15
Relational Ch. Rest. Pr. (RelCRP)
R
16
Relational Ch. Rest. Pr. (RelCRP)
17
Influence of World-wide Factors
18
Influence of World-wide Factors
19
Influence of Personal Preferences
20
Influence of Personal Preferences
21
Influence of Friend Network
22
Influence of Friend Network
23
Influence of Geography
China
India
UK
24
Influence of Geography
25
Aggregating Influences
RelCRP is exchangeable like the CRP
Useful as a prior for infinite mixture model
RelCRP captures influence of one relation on posts
Influences act simultaneously on any user
Aggregated influence pattern is user specific
Different users affected differently by same
combination of world-wide and geographic factors
Multi Relational CRP
28
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results
30
Evolving Patterns in Social Media
Number of Topics
Topics die and new ones are born
User Personalities
Susceptibility to influence by world-wide, geographic and
friends’ preferences
Existing Topic Distributions
Words go out of fashion, new ones enter vocabulary
Topic Characters:
Popularity of topic changes world-wide, in users
preference, sub-networks and geographies
31
Dynamic MultiRelCRP
32
User Personality Trends
33
Evolving Topic Distributions
34
Topic Character Trends
35
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results
36
Inference and Estimation Tasks
37
Online Algorithm
Traditional iterative framework does not scale for
social media data
Sequential Monte Carlo methods [Canini AIStats
‘09] that rejuvenate some old labels also infeasible
Online sampling [Banerjee SDM ‘07] does not
revisit old labels at all; initial batch phase
Adapt for non-parametric setting
38
Multi-threaded Implementation
Sequential online implementation does not scale
Iterative Gibbs sampling algorithms parallelized for
hierarchical Bayesian models [Asuncion NIPS 08,
Smola VLDB 10]
Our algorithm is parallel, online and non-parametric
Explicit consolidation by master thread at the end of
each iteration
Only new topics consolidated
39
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results
40
Datasets and Baselines
Twitter: 360 million tweets (Jun-Dec 2009)
Facebook: 300,000 posts (public profiles, 3 mths)
Latent Dirichlet Allocation (LDA)
[Hong SOMA 2010]
Labeled LDA (L-LDA)
Hashtags as topics [Ramage ICWSM 2010]
Timeline
Dynamic non-parametric topic model [Ahmed UAI 2010]
41
1 Model Goodness
Perplexity: Ability to generalize to unseen data
Model
DMRelCRP
Timeline
L-LDA
LDA
Perplexity
Twitter Facebook
1188.29 1562.34
1582.86 1802.9
1982.76
2932.06
3602
Both network and dynamics are important for
modeling social media data
42
2 Quality of Discovered Topics
Label assigned to each post indicating category
Distribution over words indicating semantics
A. Clustering posts using topic labels
B. Prediction using topic labels
Predicting post authorship & user commenting activity
C. Major event detection
43
2A Post Clustering using Topics
Use hashtags as gold standard (for Twitter)
16K posts #NIPS2009, #ICML2009, #bollywood etc
Model
DMRelCRP
Timeline
L-LDA
LDA
Clustering accuracy (Tw)
nMI
R-Index
F1
0.93
0.88
0.86
0.81
0.72
0.73
1
1
1
0.55
0.52
0.48
DMRelCRP close to L-LDA without using hashtags
DMelCRP produces ‘finer-grained’ clusters
44
2B Prediction Using Topics
Authorship: Given post and user, predict if author
Commenting activity: Given post and (non-author)
user, predict if user comments on that post
Model
DMRelCRP
Timeline
L-LDA
LDA
Authorship
Commenting
Twitter Facebook Twitter Facebook
0.793
0.734
0.683
0.648
0.718
0.669
0.582
0.579
0.521
0.432
0.429
0.482
0.647
0.542
-
DMRelCRP topics lead to more accurate prediction
45
2C Major Event Detection
47
2C Major Event Detection
48
3 Analysis of Influences
49
3A Global Personality Trends
50
3A Global Personality Trends
FIFA WC
Michael
Jackson’s
death
Google Wave
51
3A Global Personality Trends
52
3B Geo-specific Personality Trends
Personality trends very similar in UK and US
Geographic influences high at different epochs
53
3B Geo-specific Personality Trends
India: W-wide and geographic influences weaker
China: W-wide weak, geo strong; stable pattern
54
3C Topic Character Trends
55
3C Topic Character Trends
56
3C Topic Character Trends
57
Scaling with Data Size
Java-based multi-threaded framework; 7 threads
8-core 32 GB RAM
Scales largely because of multi-threading
58
Summary
First attempt at studying user influences in social
media data
New non-parametric model that captures multiple
relationships and temporal evolution
Multi-threaded online Gibbs sampling algorithm
Extensive evaluation on large real dataset
Topics lead to better clustering and prediction
Insights on user influence patterns
59