User Classification in Twitter

Transcript User Classification in Twitter

KDD ‘11
Utku Şirin
Marco Pennacchiotti
Research Scientist at Yahoo! Labs
 PhD is from Uni. Of Rome
 Studied in Saarland University
 Large-Scale Text Mining, Information Extraction,
and Natural Language Processing
Ana-Maria Popescu
Research Scientist at Yahoo! Labs
 Graduated from University of Washington
 Social Media Research and Analytics, User
modelling, and Sentiment Analysis
Hotly growing phenomenan
 Everyone is there, the conservatives and the
revolutionaries !
 As Data Miners, what we are interested in is the
very large number of available data about the
social users
 A basic and important task: Classification of
the Users
Authorative users extraction
 Post reranking in web search (KDDCUP ‘12, Track
 User recommendation
How to do the classification ?
The starting point is to fulfill the incomplete user
attributes by classifiying the user with respect to the
incomplete user attribute, indeed.
Most of the users do not mention explicitly her political
view, for example
There are various methods for solving the user
classification problem
What do we have in social media domain ?
Users have many attributes, such as age, gender, etc…
Social Network
Based on the attributes a classifier may be trained/constructed
Users have friends that she follows
How to define the classification task so that we
can combine these two types of information
‘structure’, user attributes and social network ?
A novel architecture combining user-centric
information and social network information
User-centric information are the attributes of the users,
which we call as features hereafter
 Social Network information is the information of friends of
the users
Main contribution of the paper
Use Gradient Boosted Decision Trees (GBDT) framework
as the classification algorithm
Train the GDBT with given labeled input data
And label the users with respect to the built classifier
Then apply same classifier model to the friends of the users
and label the friends also
Lastly, update each user’s label with respect to her friends’
label using an update formulae
User-centric information is represented as features. There is a overmuch
feature set mainly comprised of four parts
Profile features(PROF)
Tweeting behavior features(BEHAV)
User name, use of avatar picture, date of account creation, etc…
Average number of tweets per day, number of replies etc...
Linguistic content features
Richest feature set, comprised of four sub-feature sets
Uses Latent Drichlet Allocation (LDA) as Language Model
 Prototypical words(LING-WORD):
Prototypical hashtags(LING-HASH):
LDA is the language model they used, extracted topics with respect to the LDA model and
represents users as a distribution over topics
LDA is trained by all sets of users
Domain-specific LDA(LING-DLDA):
Hashtag (#) to denote topics
Same technique for proto words
Proto words, words that are icons in users.
Found probabilistically from the data
Firstly partition the users into n class, then find the most frequent words for each class and take
mostly used k words for each class
Same as Generic LDA, but trained with specific training set such as users that are only democrats
and republicans
Sentiment words(LING-SENT):
Manually collected small set of terms, Ronald Regan, good or bad ?
Opinion Finder Tool gives the sentiment as positive, negative, neutral
Social Network Features
Combination of two different features
 Friend accounts(SOC-FRIE):
Prototypical replied(SOC-REP) and retweeted (SOCRET) users:
Informs about sharing same friends for different labeled
users such as democrats and republicans
Find most frequent mentioned (@) and retweeted (RT) users
for different labeled users
That’s all for user-centric information
OVERMUCH, indeed…
Now each user in the test set is labeled by the
classifer that is trained with the features just
Label update part updates the labels with respect to
the labels of friends of the users, this is done as
Label each user and all of her friends using the built
classifier. Labels are numbers in [+1, -1], higher values
shows higher confidence level
 Then update the labels of users with respect to the
following formula for the user ui :
label′ (ui ) =
𝑤𝑖𝑗 =
∗ label(ui) + (1 − ) ∗
𝑘 ∈𝐹𝑖 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑠𝑖𝑘
𝑗 ∈𝐹𝑖 𝑤𝑖𝑗
∗ 𝑙𝑎𝑏𝑒𝑙(uj)
𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠𝑗 /𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝐹𝑟𝑖𝑒𝑛𝑑𝑠𝑗
𝑘 ∈𝐹𝑖 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠𝑗 /𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝐹𝑟𝑖𝑒𝑛𝑑𝑠𝑗
Three binary classification tasks:
Detecting political affiliation
Democrat or Republican
 5169 Democrats and 5169 Republicans
 1.2 millions friends
African American or Not
 3000 African Americans and 3000 Not African Americans
 508K friends
Following a business
Following Starbucks or Not
 5000 Starbucks follower and 5000 Not
 981K friends
Best achieved result for combined HYBRID model among three tasks however, not
significant increase over single ML model
Social Network features are very successfull. This is because users from a particular
political view are friends with similar particular views. Suportting sinle GraphBased Label update is also very successfull alone
Social Graph update is not
that much successfull as
political affiliation task
since Starbucks does not
build friends, indeed
Profile features are very
successfull alone
Linguistic features are also
HYBRID method still does not
increase the alone ML system significantly
HYBRID method fails,
decreases the alone ML model
Social network features a so
bad !
As in Starbukcs Task case,
ethnicity does not form a community.
Hence, social network features
and graph-based update has very
low results
Best feature alone results are in linguistic features. Linguistic
features always have a point !
#1 ML method mostly good enough and update
part of the architecture does not bring significant
improvement. If the task allows for users to form
a community update function works, else, it may
even hurt the alone ML system as in ethnicity
#2 Linguistic Features always reliable
The novelty of combining the types of information is attractive, however, there
are serious points that should be criticized
First of all the classifier is doing only binary classification and nothing said about
multi-dimensional classification. Doing multi-dimensional classification using binary
classifier is time-consuming and weakens the claim about the scalability.
As said, the novel arch. idea is attractive, however, the results show that labelupdate does not work well. Why ? They did not give any appriciable comment
on why label update does not work well. This, I believe, shows that the feature
set and the novel architecture is not well-studied.
There are overmuch features. But the reasons why these features are selected are not given.
Morever, applying same ML model the users and their friends replicates the information.
Obviously connected users will have some common and different attributes, what is the point?
The social graph should be used more effectively. I think it should not be used to update the
labels but as an importantly weigthed feature in the ML model. This is because we
should superpose different information types instead of using one to compensate the
other. You can see difference in thinking vector space, update means spanning same
vector again, superposing means using both vector concurrently. For example, proto
words would have been extracted using the network, somehow.
They told about Gradient Boosted Decision Trees (GBDT) but gave nothing about this
classification algorithm, an explanation is expected at least in princpile about GBDT. Same thing is
valid for Latent Drichlett Allocation (LDA) language model. It is the first time I hear this
language model, and they said nothing about LDA. It is only said that LDA is used as language
model and associated with topics. But, what is LDA and how it is associated with topics?
There is no data analysis, very cruical lacking of paper, everything is data! They only gave the
number of users used in training, but what about the test set? Development set? Any other
statistics about the data? Moreover, they used different number of samples for each task. The
success of label update is very low for ethnicity task than the political affiliation task, however,
there are 1.2M friends for political affiliation task but almost half of them for ethnicity task, 508K.
Hence the cross-task comments are not confident.
The system they built have a stroing constraint, indeed. It is language dependent, English. For
example, the features based on frequencies of proto words will not work for Turkish due to its
agglutinative nature, many inflected forms of same words: masayı, masada, masanın,
masalardakilkerin etc… (A stemmer will be need most probably)
Experiments are not done in a structured way. They have just done the experiments and shows the
results. There is not a useful comment. Beside, they did not explain why they have chosen these
experiments. For example, I would want to see some success of subset features as features alone
have mostly very good results, some subset may increase the overall HYBRID result.