Transcript Document

Modeling Political Blog Posts
with Response
Tae Yano
Carnegie Mellon University
[email protected]
IBM SMiLe Open House
Yorktown Heights, NY
October 8, 2009
1
Talk is about
How we are designing topic models for online political discussion
2
Political blogs
Why (should we) study political blogs?
• An influential social phenomenon.
• An important venue for civil discourse.
• Blog text is relatively understudied.
• Interest in text analysis from social/political science researchers
• Monroe et al., 2009; Hopkins and King, 2009; many others
3
Political blogs
Why (should we) study political blogs?
A different / interesting type of text we don’t usually deal with in NLP
• Spontaneous text: Often ungrammatical, copious misspelling and
colloquialism
• Elusive information needs (“popularity”, “influence”, “trustworthy”).
• Difficult and costly in classical supervised approach.
• The text is a composed of the mixture of diverse linguistic styles.
4
Political blogs - Illustration
5
Political blogs - Illustration
Posts are often coupled
with comment sections
Comment style is
casual, creative,
less carefully edited
6
Political blogs - Illustration
Comments often meander
across several themes
“If the “President gets health care”
“Taxes and Fee”
On topic
“The rock that keeps things off the table”
Tangent
Ranting?
7
Political blogs - Illustration
Posts tend to discuss
multiple themes
House Republicans?
Government neglect?
Energy policy?
Oil companies?
8
Political blogs - Illustration
“I am in total agreement
… In contrast … My
understanding is….”
Comments can
be constructive
and formal
…or subjective and
conversational
“ Iowa-Shiowa”
9
Political blogs - Illustration
Comments can be very long
…or quite terse
“Absurd”
10
Political blogs - Illustration
How should we approach this sort of data?
Our approach is to treat it as an instance of Topic Modeling
Latent Dirichlet Allocation or LDA (Blei, Ng, and Jordan, 2003)
11
Topic modeling
What does this approach buy us?
• Naturally express the idea that a text is comprised of several
distinctive components:
• A post and its reactions (comments)
• A mixture of different themes within one post
• Diverse personal styles and pet peeves
• A convenient choice for corpora with uncertainty
• We can encode hypotheses, and have the model learn from data.
• Modularity makes it easy to change the model
12
Modeling political blogs
Our proposed political blog model:
Comment
LDA
z, z` = topic
w = word (in post)
w`= word (in comments)
u = user
D = # of documents;
N = # of words in post;
M = # of words in comments
Modeling political blogs
ß
Our proposed political blog model:
LHS is vanilla LDA
D = # of documents;
wi
zi
Nd
d
a
D
CommentLDA
N = # of words in post;
M = # of words in comments
Modeling political blogs
RHS to capture the
generation of reaction
separately from the
post body
Our proposed political blog model:
CommentLDA
Two chambers
share the same
topic-mixture
Two separate
sets of word
distributions
D = # of documents;
N = # of words in post;
M = # of words in comments
Modeling political blogs
Our proposed political blog model:
User IDs of the
commenters as a part
of comment text
CommentLDA
generate the words
in the comment
section
D = # of documents;
N = # of words in post;
M = # of words in comments
Modeling political blogs
Three variations on user ID generation:
“Verbosity” (original model)
M = # of words in all comments
L=1
CommentLDA
“Comment frequency”
M = # of comments to the post
L = # of words in the comment
“Response”
M = # of participants to the post
L = # of words by one participant
L
Liberty
Think of this as encoding a
hypothesis about which type
of user ought to weigh more!
Democracy
Fraternity
Whatever
Equality
Comment freq
Verbosity
:^)
….Liberty…
…Democracy…
….Fraternity…
…Equality…
…Whatever…
Response
Modeling political blogs
Another model we tried:
Took out the words from the
comment section!
CommentLDA
This is a model
agnostic to the
words in the
comment section!
D = # of documents;
N = # of words in post;
M = # of words in comments
Modeling political blogs
Another model we tried:
LinkLDA
(Erosheva et al, 2004)
The model is structurally
(but not semantically)
equivalent to the Link LDA
from (Erosheva et al.,
2004; Nallapati and
Cohen, 2008)
D = # of documents;
N = # of words in post;
M = # of words in comments
Topic discovery
What topics did the models discover?
What differences are there between the post and comments?
• Data sets: 5 major US blogs collected over a year - this data is available on
our website (http://www.ark.cs.cmu.edu/blog-data).
• Each site has 1000 to 2000 training posts; details about the data sets in
Yano, Cohen, and Smith, 2009.
• Inference is implemented with Gibbs sampling.
• Following are some topics from Matthew Yglesias site.
21
Topic discovery
22
Topic discovery
23
Topic discovery
24
Comment prediction
A guessing game:
Can we predict which users will react given an unseen post?
• Infer the topic mixture for each test post using the fitted model
• Rank users according to p(user | post, model) for each user
• Envisioned useful for personalized blog filtering or recommendation
system
25
Comment prediction
CommentLDA performs consistently
better for MY site, LinkLDA is a much
better option for RS.
Does our model lack the expressive
power to reflect site differences?
(MY)
Our models perform at least as well
as a word-based NB baseline
(RS)
27.54
25.19
20.54
14.83
12.56
CommentLDA (R,C)
16.92
12.14
9.82
LinkLDA (R)
Precision at top 5, 10, 20, 30 user prediction
From left to right: Link LDA(-v, -r,-c) Comment LDA (-v, -r, -c)
26
Comment prediction
Variation in user counting does make
a difference.
Giving more weight to verbose users
does not help for this task.
CommentLDA: (MY)
LinkLDA: (RS)
Verbosity vs. Response
From left to right: cut off n = 5,10, 20, and 30 top ranked users
27
Future work
What forecasting task can our model do?
Using Comment LDA to predict the topics of the post given comments:
Useful for automatic text categorization or text search when post has no
searchable text.
28
Future work
Can we automatically adjust how much the words influence the topics
given the site?
S
BG
• Better comment prediction?
• Inferential questions involving
multiple sites
Future work
Can we guess which posts will collect more responses (number of
comments, volume of comments)?
• A variant of SLDA (Blei and
McAuliffe, 2007) with comments
M
• Link LDA-type model also
possible.
Summary
Political blogs are an exciting new domain for language and learning
research.
Topic modeling is a viable framework for analyzing the text of online
political discussions.
It is convenient and competitive in tasks that have potential uses in real
applications.
31
End of presentation
32
References
• Our published version of this work includes a detailed profile of our
data set, as well as more experiments.
http://www.aclweb.org/anthology/N/N09/N09-1054.pdf
• Please refer back to the original LDA paper for the complete picture.
http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
• The Gibbs sampling for LDA is detailed in Griffiths & Steyvers, 2004.
http://www.pnas.org/cgi/reprint/0307752101v1.pdf
• Hierarchical Bayesian Compiler (HBC) used for Gibbs sampling:
http://www.cs.utah.edu/~hal/HBC
33
Comment prediction
(MY)
20.54 %
Modest performance (16% to 32%
precision), but compares favorably
to the Naïve Bayes baseline
Comment LDA (R)
(RS)
(CB)
16.92 %
32.06 %
Link LDA (R)
Link LDA (C)
Precision at top 10 user prediction
From left to right: Link LDA(-v, -r,-c) Cmnt LDA (-v, -r, -c), Baseline (Freq, NB)
34