Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham emrek@microsoft.com, shellyfa@microsoft.com Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia.

Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham [email protected], [email protected] Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia.

Transcript Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham [email protected], [email protected] Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia.

Analyzing Social Media
Systems
CHI Course 2013
Emre Kıcıman, Shelly Farnham
[email protected], [email protected]
Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia
People talking and interacting.
Publicly. Or semi-publicly.
Often about the quotidian, but not necessarily.
Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia
Why?
Learn about World
1. Observe
2. Tweet
But people are
not perfect
sensors, for
many reasons.
Let’s Learn about Donuts
Where Do People Get Donuts?
2000
1800
1600
1400
1200
1000
800
600
400
200
0
1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.
What Do People Drink with Donuts?
700
600
500
400
300
200
100
0
1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.
What Kind of Donuts Do People Eat?
potato
powdered
chinese
maple
jam
maple bacon
old fashioned
glazed
jelly
400
350
300
250
200
150
100
50
0
1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.
Beyond Donuts…
Drugs, diseases, and contagions
“You Are What You Tweet: Analyzing Twitter for Public Health”
Paul and Dredze, 2011
Symptoms and medication usage, tracking illness over time, behavioral risk factors
“Predicting Disease Transmission from Geo-Tagged Micro-Blog Data”
Sadilek, Kautz and Silenzio, 2012
Study disease transmission in physical world based on location traces of sick & healthy
people
Public Sentiment
Political and election indices, market insights
Everyday life
Why use social media?
Cross-domain / open-domain
Large-scale, fine-grained, naturalistic
Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia
Why?
Learn How People Interact with Each Other
What are common conventions in interactions
Ex. “Unsupervised Modeling of Twitter Conversations”, Ritter, Cherry and Dolan, 2010
How do people’s interactions impact each other? How do norms form?
Ex. “The Birth of Retweeting Conventions on Twitter”, Cha, Gummadi, Kooti, Mason and
Yang, 2012
How do communities organize themselves?
Ex., social media usage in the context of war, disasters and crises. Starbird et al. 2010; AlAni, Mark & Semaan 2010; Monroy-Hernandez et al. 2012
Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia
Why?
Learn How System Influences People
Ex. “Feed Me: Motivating Newcomer Contribution in Social Network Sites,” Burke,
Marlow and Lento, 2009.
Plus, a case study a bit later on return visits from first-time users
Recap: Social Media Analyses
Social media captures people talking and interacting with each other, publicly, on a
wide variety of topics.
We would like to study it to learn about the world…
… about how people interact with each other
… and the role of the system in influencing these interactions
Preliminaries
Speaker Bio: Shelly Farnham
Specialize in social technologies
Social networks, community, identity, mobile social
Early stage innovation
Extremely rapid R&D cycle
study, brainstorm, design, prototype, deploy, evaluate (repeat)
Convergent evaluation methodologies: usage analysis, interviews, questionnaires
Career
PhD in Social Psychology from UW
7 years Microsoft Research: Virtual Worlds, Social Computing, Community Technologies
4 years startup world: Waggle Labs (consulting), Pathable
2 Years Yahoo!
FUSE Labs, Microsoft Research
Personal Map
Speaker Bio: Emre Kıcıman
Specialize in Social Data Analytics
Social media, analytics and search
Focus:
1. Improving our analytical capabilities
2. Extracting information about the world from social media
3. Reasoning about social media biases, reinforcing useful signals
Career
Ph.D. in computer science from Stanford University, ’05
7 years at Internet Services Research Center, Microsoft Research
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Basic Model of Social Media Interaction
Message from Bob
Alice
Effect
* Wow, Bob is ______!
* I should donate money to _____!
* Write back to Bob....
* Forward Bob’s message to others…
Effect depends on many factors:
* Relationship <Alice, Bob>
* Content of the message
* Alice’s environment:
-> social context
-> current tasks
From Bob’s Point-of-View
Effect
Message1 ?
Message2 ?
Message3 ?
Alice
Bob
Bob needs feedback!
A bit more formally…
Assumption: Source writes a message to effect recipient
Effect might be to project a persona; to elicit engagement or action; build social
capital [cite communications theory; social capital; etc]
(Of course, this is not always true; sometimes messages are written for effect on self.
E.g., cathartic writing)
A bit more formally…
Let effect be a function of relationship, message and context:
𝐸 = 𝐹 𝑠, 𝑟 , 𝑚, 𝑒
𝑠, 𝑟 represents the relationship between a source and recipient. E.g., “close friends”,
“authority/expert”
𝑚 represents the message content and style
𝑒 represents the environment in which the message is received E.g., facebook or linkedin.
Also includes broader social norms, etc.
A bit more formally…
Then a rational source, trying to achieve an effect 𝐸 ∗ will select messages based on:
𝑎𝑟𝑔𝑚𝑖𝑛𝑚∈𝑀 |𝐸 ∗ − 𝐹 𝑠, 𝑟 , 𝑚, 𝑒 |
Where 𝑠, 𝑟 and 𝑒 are fixed and 𝐹 is source’s approximation of 𝐹.
More complex versions take into account multiple recipients, thresholds of cost &
utility, etc.
Messages about the real world…
Let’s assume for simplicity that style is constant, and content is what an author is
actively choosing. Then, given a set of observations 𝑊 about real-world events,
author chooses a 𝑤:
𝑎𝑟𝑔𝑚𝑖𝑛𝑤∈𝑊 |𝐸 ∗ − 𝐹 𝑠, 𝑟 , 𝑤, 𝑐 |
If no 𝑤 achieves goal 𝐸 ∗ within some threshold, author writes nothing.
The weather bias case study is essentially investigating the relationship between
features of 𝑤 and 𝐸 ∗ − 𝐹
>t
Bob is not the only Actor
Messaging is not the only Action
(this slide intentionally left blank)
Bob
Effect
Alice
?
Charlie
Justin
Effect
Bob
?
Charlie
Alice
Justin
?
Message
Alice
Bob
Alice
Effect
Bob
Message
Social Media System
Social Media System Role
Alice:
Bob:
𝐸𝐴𝑙𝑖𝑐𝑒 = 𝐹 𝐵𝑜𝑏, 𝐴𝑙𝑖𝑐𝑒 , 𝑚, 𝑒
∗
𝑎𝑟𝑔𝑚𝑖𝑛𝑚∈𝑀 |𝐸𝐵𝑜𝑏
− 𝐹 𝐵𝑜𝑏, 𝐴𝑙𝑖𝑐𝑒 , 𝑚, 𝑒 |
∗
Social media system is trying to align parameters to achieve its own effect 𝐸𝑆𝑦𝑠
E.g., align relationships 𝑆, 𝑅 , the environment 𝑒, as well as the space of messages
𝑚 that Bob selects from.
Reinforcing real-world signals
𝑎𝑟𝑔𝑚𝑖𝑛𝑤∈𝑊 𝐸 ∗ − 𝐹( 𝑠, 𝑟 , 𝑤, 𝑐)
Ex. Reinforcing useful signals: if we want to increase the likelihood that world events
𝑊 + that we care about are reported, this model indicates several possible
directions:
Investigate and optimize 𝐸 ∗ , 𝑠, 𝑟 and 𝑐 such that 𝑊 + is reported
Design feedback that improves approximation 𝐹 to reinforce 𝑊 +
Recap & How might this model help?
Basic model of social media system interactions:
 Messages have an effect on recipients, conditioned on various factors
 Authors choose their messages to have some desired effect
 Social media systems today can (and do) play an active role
Having a generation process in the back of our mind helps us realize biases
and limitations of data
Having a sense of the basic knobs can help us think about how to improve
social media systems
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Some Basics:
Clarity:
* Be clear in your purpose and the real-world problem you wish to address
* State your question as a hypothesis
Testability:
* In some cases, the hypothesis is tested through the social data
* Other times validation must lie outside the social data
Passive and active experiments:
* If we look at questions of causality we often need to do active experimentation
Data Analysis is one tool of many:
* User surveys, interviews, mockups, prototypes, etc.
Defining Research Question
Amount of data overwhelming – the more defined your
question, the easier the analysis
What real world problem are you trying to explore?
Avoid pitfall of technology for technology’s sake
What argument do you want to be able to make?
State your problem as a hypothesis
Introducing a Running Example
Studying the relationship between activities
and locations
What do people do? And where do they do it?
(Note: research question is about accuracy of
results, and its use in applications)
Processing Pipeline
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Followed by higher-level statistical, graph and
machine learning analyses to build
Social Media Processing Pipeline
Collect:
Raw Social
Media
Feature
Extraction
Location Tiger Mountain
“I had fun hiking Tiger
Mountain last
weekend” – Alice said
on Monday, at 10am
Mood Happy
Activity Hiking
Define the Key
Context
Extract Core
Relationships
Location:
Tiger
Mountain
Activity:
Hiking
Name Alice
Gender Female
Post Time Mon 10am
Activity Time {Sat-Sun}
Followed by higher-level graph and machine learning
analyses on the combined structure and context…
1. Collection
Instrumentation
Avoid tendency to collect everything without organization
Validate logging -> untested instrumentation is prone towards bugs
Design for key scenarios: Make it easy to get data for key questions up front
Streaming and Search APIs
Easy to use. Appropriate for many experiments
Often rate-limited, but can build large-scale data over time
Crawling
More effort, but can grab historical data
Some sites will block
In all cases, do consider user privacy and expectations.
To consider, during collection and signal
extraction:
Filters
Time span, type of person, type of actions
Sampling
Random selection
Snow balling, to get complete picture of person’s social experience
Consider your research questions, how you want to generalize
2. Cleaning & Feature Extraction
Clean once: removing irrelevant raw data (depends on your research question)
Spammy users, people who were never active,
Geographic or temporal filtering,
When you remove a user, message, or action, think about whether to remove associated data
(e.g., might want to keep a spammy user’s interactions with other, non-outlier users)
Implication:
Feature extraction:
Deriving cleaner feature-values from the rawWhile
data:developing your analysis process and
feature
Absolute date-time stamp -> HourOfWeek, isWeekDay,
… extraction, you have to be inspecting
Entity recognition from text
the raw data and the feature results
User classification, …
This is also a good stage to bring in external data…
Clean again:
Look for outliers and remove feature values that are not dependable
Keep samples of raw data for distinct feature values to make inspection easier
3. Pick defining context of relationships
After extracting features from our social media, we want to reason about the
relationships among these features.
What defines a relationship?
One common choice: Context == Co-occurrence within the same message
(next slide)
“I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am
Location:
Tiger
Mountain
Gender:
Female
Mood:
Happy
Activity:
Hiking
Name: Alice
Post Time:
Mon 10am
Activity
Time: {SatSun}
Gender:
Male
Name: Bob
Post Time:
Fri 3pm
Location:
Tiger
Mountain
Gender:
Female
Mood:
Happy
Activity:
Hiking
Name: Alice
Post Time:
Mon 10am
Activity
Time: {SatSun}
Other common choices
User as defining context
-> Two things are related if they are associated with the same user
-> Common in recommender systems
-> Ex. Livehood study of neighborhood boundaries
Location as defining context
-> Two things (users, actions, …) are related if co-occur at same physical location
-> Ex. Sadilek’s study of disease transmission
4. Extracting Core Relationships
Location:
Tiger
Mountain
Activity:
Hiking
• Focus on core relationships among domains of interest
• Strength defined by how frequently items co-occurred in key context
• Statistical distribution of other features annotates core relationships
Iterate on “core relationships”
Gender:
Male
Location:
Tiger
Mountain
Activity:
Hiking
Gender:
Female
Higher-level algorithms
Statistical tests
Ex. Test that relationships are statistically significant
Ex. Test that two items are statistically different from one another
Graph analyses
Ex. Clique finding, graph clustering, path algorithms, network centrality, ….
Machine learning
Ex. Classifiers, clustering, etc. Based on graph relationships or annotations
Some usage scenarios
Build “profiles” of things based on the words and sentiment used
Build demographics of places and concepts based on who is talking about them
Build co-mention graph among entities, people, places, etc.
Build user profiles of users based on what they talk about and how they express
themselves
Include “time” in the projection, and see how profiles change over time
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Selection of Case Studies
Social responses and engagement on So.Cl
Clear & simple study of user interactions
Population biases in political tweets
Extracting basic features from tweets
Demonstrates complexity of population biases
Studying self-reporting bias by comparing tweet rates to ground-truth
Example of building a domain classifier
Methodology to study
Annotating graph structures with discussion context to interpret high-level graph analysis
results
Applies higher-level graph analyses to graphs of discussion topics
Shows how discussion context can be useful at different layers
Outline
Part I: Introduction and conceptual framework
1.
Introduction and preliminaries
2.
Basic model for interaction through social media
3.
A processing pipeline for analyzing social media
Part II: Case Studies
4.
Social responses and engagement on So.Cl
5.
Population biases in political tweets
6.
Studying self-reporting bias by comparing tweet rates to ground-truth
7.
Annotating graph structures with discussion context to interpret high-level graph analysis results
8.
(Bonus) Statistical language modeling to analyze language differences across user populations
Case Study: Usage analysis of So.Cl
So.cl is an experimental web site that allows people to
connect around their interests by integrating search tools
with social networking.
Study: How important are social interactions in encouraging
users to become engaged with an interest network?
We’ll start with some background on So.Cl, and then dive into
the case study.
SO.CL
reimagining search as social from the ground up
search +
sharing +
networking
= informal
discovery
and learning
History:
Oct 2011:
Pre-release deployment
study
Dec 2011:
Private, invitation-only beta
May 2012:
removed invitation restrictions
Nov 2012:
over 300K registered users,
13K active per month
Try it now! http://www.so.cl
So.Cl as
Interest Network
 Find others
around common
interests
 Be inspired by
new interests
 Learn from each
other through
these shared
interests
Search & Post
How It Works
Feed Filters
Feed
People
Try it now! http://www.so.cl – use facsumm tag
Search (Bing)
Post Building
Filter Results
Experience:
Step 1: Perform search
Step 2: Click on items
in results to add
to post
Step 3: Add a message
Step 4: Tag
Try it now! http://www.so.cl – use facsumm tag
Post Builder
Results
So.Cl as Research Platform
Discovery
Network
Social
Search
?
riffing
Collaborate
liking
Connect
Interest
Network
follow
Simple
profiles
?
Wall
messages
People
list
commenting
?
Collect
Create
Consume
interests
visual
Post
builder
stream
Video
parties
Add links
Explore
Page
Search
interests
?
Interest
Page
Increasing Engagement, Community, Learning, Innovation
So.cl Research Dataset Program
Access to public so.cl
behavioral data for
research purposes
Foster research in
interest networking,
social search, and
community
development
http://fuse.microsoft.com/research/srd
Case Study:
Hypothesis:
If people receive a social response when they first join So.cl they are more likely to
become engaged.
Measuring social/behavioral constructs:
When first join
First session = time of first action to time of last action prior to an hour of inactivity
Social responses
Follows user, likes user’s post(s), comments on user’s post(s)
Engagement = coming back
A second session = any action occurs 60 minutes or more after first session
Restating hypothesis:
If a people receive follows, likes, and comments in their first session they are
more likely to come back for a second session
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Simple, common instrumentation schema, kept
in a database
Users table: Row per user
Include creation time and other metadata
Content table: Row per content
includes text, URLs, etc.
Actions table: Row per action
Filter out non-meaningful, non-user
generated actions
Actions capture user interactions and context
Extract Core
Relationships
Always look at your raw data: play with it,
ask yourself if it makes sense, test!
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Filters
Time span, type of person, type of actions
Sampling
Random selection
Snow balling, so get complete picture of person’s social experience
Consider your research questions, how you want to generalize
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Filtered out administrators/community managers
New users only
Date range: Sept 28 to Oct 13
100% sample for that time span: 2462 people
SYSTEMATIC BIASES IN SOCIAL SYSTEMS #1
If you want to understand your “typical”
users, keep in mind generally find:
Large percent never become active or return
--“kicking the tires” unduly biases averages
Common reporting format:
X% performed Y behavior, of those averaged Z
times each
5% commented on a post their first session,
averaging 5 times each
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
OUTLIERS: Filtered out 13 people outliers z > 4 in number of
actions (if do more than sign in)
SYSTEMATIC BIASES IN SOCIAL SYSTEMS #2
A small percent “hyper-active” users: avid, spammers,
trolls, administrators, and can unduly bias averages
Remove outliers
A substantial percent are consumers but not producers
(“lurkers”), often no signal for lurkers
So.cl has about 75% lurkers
Custom instrumentation, logging sign ins
Web analytics for clicks
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
 Very important to spend time
examining data
 Descriptives, Frequencies,
Correlations, Graphs
 Use tool that easily
generates graphs,
correlations
 Does it make sense? If not,
really chase it down. Often a
bug or misinterpretation of
data.
Extract Core
Relationships
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Feature: Active Sessions
Active session = a time of activity
(public), with 60 minute gap of no
activity before or after
91% of users
only one active session
On average,
34.6 hours apart
First session,
1.6 minutes
Extract Core
Relationships
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: User Actions
Number of Posts in First Session
8% created a post in their
first session, of those
averaged 1.5 posts each
Actions in First Session
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Feature: Coming back
9.1% came back for
another active session
(~25% including
inactive)
On average, 35 hours
later
Define the Key
Context
Extract Core
Relationships
Collect:
Content &
Interactions
Cleaning
and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Aggregation: merging down for summarization
What is your level of analysis?
Person, group, network
Content types
If person is unit of analysis, aggregate measures to the person
level
E.g. in SPSS: One line per person
very important to have appropriate unit analysis, to avoid bias in
statistics
AGGREGATIONS
SPSS Syntax:
PRELIMINARY CORRELATIONS
Always
ask, does
this pattern
make
sense?
IN THE FIRST SESSION
How often is user the target of social behavior?
23% received some response up to 2nd session
->3% if did not create a post, 37% if did create a post
Response *During* First Session
Response *in Between* 1st and 2nd Sessions
PREDICTORS OF COMING BACK
Social responses inspire people to return to site, especially if
occurring during first session
N = 2273
N = 179
N = 1942
N = 510
Social responses to user: following, commenting on post, liking post, liking comment,
riffing
WHICH RESPONSE MATTERS
Logistic Regression, Any Response Predicts Coming Back
B
S.E.
Sig.
Created post first session
.71
.20
.000
Response1: during first session
1.12
.21
.000
Response2: after first session
.60
.17
.000
Logistic Regression, Which Predicts Coming Back
B
Sig.
Created post first session
.95
.000
Followed
.92
.003
Commented On
.38
ns
Post Liked
.87
.02
Comment Liked
-.09
ns
Messaged
-.09
ns
Riffed
.00
ns
IDENTIFYING SUBGROUPS
Component Matrixa
Type:
% Variance:
Creators
32%
Component
Socialites Browsers
12%
9%
Created post
.86
.17
.10
Invited
.01
-.16
.63
Followed
-.03
.10
.37
Added item to post
.83
.08
-.06
Searched
.81
.03
.17
Commented
.36
.64
.09
Liked post
.15
.58
.32
Liked comment
.13
.80
.06
Messaged
-.09
.50
-.08
Viewed person
.22
.47
.48
Navigated to All
.51
.37
.53
Joined party
.17
.09
.68
Principle components, varimax rotation [meaning forced to be orthoganol]
Factor Analysis for Associated Behaviors:
Three types of usage – creating, socializing, browsing
Factors about equally predict
if user comes back
Regression Coefficients
Beta
t
Sig
Creating
.14
5.28
.000
Socializing
.07
2.61
.000
Browsing
.19
7.20
.000
Browsing stronger predictor
of overall activity level
Regression Coefficients
Beta
t
Sig
Creating
0.20
7.89
0.00
Socializing
0.17
6.58
0.00
Browsing
0.29
9.07
0.00
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Case Study: Population biases in political
tweets
There was a significant amount of political discussion on Twitter
during the US election season in Summer/Fall 2012.
Case Study: Is the population of tweeters representative of US
demographics along two demographic axes of gender and
geography?
Why this case study?
Good illustration of simple extractors for gender, location, and
simple methods for identifying topics.
More fundamentally, highlights challenges of dealing with
population biases
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Collected all tweets during August – November, 2012 that
mentioned “Obama”, “Romney” or other politician names
Inspecting raw data:
* Removed some common names and issue phrases from
collection
Extract Core
Relationships
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: Gender
Simple gender classifier based on first name of Twitter user in profile
Approach: Look up first name in a weighted gender map built from census data
and other sources.
Practical results:
Ad hoc inspection is positive
Coverage is 60-70%, depending on domain. Remainder are organizations and
ambiguous names
Still requires:
Accuracy evaluation based on ground-truth data
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: Location
Map from self-declared user profile locations to lat-lon regions.
Approach: Use a mapping learned from the small % of tweets that are geocoded.
Cluster mapped geo-locations together into city-size areas.
Practical results:
Maps to metropolitan-area size regions.
Learns official location names, as well as abbreviations, nicknames, etc.
Automatically identifies non-specific locations
Coverage is 60-70%, depending on domain. Remainder have non-specific locations
or “tail” locations not covered in training set.
Example results:
Location cluster
New York
Los Angeles
Filtered out due
to ambiguity
(large area)
Example members
“NYC”, “Yonkers”, “manhattan,”
“NY,NY”, “Nueva York”, “N Y C”,
The Big Apple”
“Laguna beach”, “long beach”,
“LosAngeles,CA”, “West Los
Angeles, CA”, “Downtown Los
Angeles”, “LAX”
“World”, “everywhere”, “USA”,
“California”, …
Location detection alternatives
1. Use geo-tagged tweets.
- Most appropriate when you need fine-grained locations per tweet (e.g., user
tracking)
- But trade-off is that very small % of tweets are geo-coded
2. Much recent research on location inference.
- State-of-the-art uses textual references to known locations to identify user
location.
This mapping technique is a little coarser-grained, but simpler.
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: Politician mention
Approach: Exact-match on well-known, unambiguous politician names.
Still needs:
* Domain classification and/or stronger entity linking to recognize ambiguous
names. For example, “Mitt” is likely Mitt Romney in a political context, but not
otherwise.
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Key context is the tweet itself.
We will assume a relationship among features if they co-occur in the same
tweet.
It will be stronger if it co-occurs across many tweets.
For example:
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
We extract two sets of relationships:
1) Politician mentions per Day:
 Strength of relationship indicates volume of discussion about a given
politician on a given day
 Discussion context summarizes gender and location for each day.
2) Politician mentions over all time:
 Discussion context summarizes gender and location over all time
Gender Bias
Gender distribution of authors of tweets mentioning
Obama
80%
70%
60%
50%
40%
30%
m
20%
f
10%
0%
8/29/12
9/3/12
9/8/12
9/13/12
9/18/12
9/23/12
9/28/12
10/3/12
10/8/12
10/13/12
Gender distribution equalizes during high-volume events like DNC
Geographic bias in Political Tweets
Metro-area
Tweets % of tweets
New York, NY
141878
10%
Washington, DC
135347
9%
Los Angeles
68676
5%
Chicago
47130
3%
Atlanta, GA
45475
3%
Houston, TX
35956
2%
Boston, MA
34363
2%
Actual population
22,000,000
8,500,000
12,800,000
9,800,000
5,200,000
2,100,000
7,600,000
* Geographic distribution of tweets mentioning Obama during 2012 Elections
Election 2012
Moods over time for Obama
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Studying self-reporting bias by
comparing tweet rates to ground-truth
Background:
 Frequency of discussion about events does not directly reflect real-world frequency of
occurrence.
 We may assume that bias is constant for a given kind of event, but not about bias across
different kinds of events.
 We can make few inferences about the relationship between distinct events through social
media analysis.
Study:
 Compare tweet rates about weather to ground-truth data about weather
Why this case study:
 Easy example of domain identification and ambiguity resolution in cleaning stage
 Good illustration of self-reporting bias
Self-reporting Bias
We study reporting bias by comparing tweet rates
about the weather to ground-truth weather data.
Does the weather’s extremeness, changes, or
unexpectedness affect tweet rates?
[Kıcıman, ICWSM 2012]
Tweets & Weather Timeline
Thunderstorm
Hottest day
60
Daily Tweet Count
50
1000
40
100
30
20
10
Weather-Related Tweet Rate
Temperature
1
10
Daily Max Temperature
(C)
10000
0
Sep. 1
Sep. 15
Sep. 29
Oct. 13
Weather-related Tweet rate and temperature in San Diego, CA from Sep. 1-Oct 15, 2010
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Collected 12 months of tweets that mentioned weatherrelated words (e.g., “rain”, “snow”, “sun”, “heat”…) Word list
built by hand from weather glossaries, dictionaries, etc.
Extract Core
Relationships
Example “Weather” Tweets
Woke up to a sunny 63F (17C) morning. It's going
to be a good day :)
The rainy season has started.
The inside of our house looks like a tornado came
through it.
Japan, Germany hail U.N. Iran sanctions
resolution
Domain Classifier
Used a language-based classifier, with a simple Bayes model:
1
|𝑇|
𝑡∈𝑇 𝑃(𝑤𝑒𝑎𝑡ℎ𝑒𝑟|𝑡)
Where 𝑇 is the set of features (all pairs of co-occurring words within a tweet,
regardless of order)
𝑃 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 = (1 + 𝐶 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 ) (1 + 𝐶 𝑡 )
Also:
• Simple stemming of words – remove ‘-s’ and ‘-ing’ suffixes
Domain Classifier: Labeling
Labeling 2000 tweets manually (2 labelers) to create a “gold training/test set”
What were challenges of labeling? Mainly a strong, consistent criteria.
For example: “incidental” mentions of the weather?
Or, mentions of the weather someplace else?
Slightly less complicated:
Mentions of the weather in proverbs (‘when it rains it pours’)
Domain Classifier
Results:
Classifier F-Score of 0.83, with a precision of 0.80 and recall of 0.85
Is this good? In general, the precision/recall will depend heavily on the domain and
the collection criteria for tweets.
Collect:
Raw Social
Media
Feature
Extraction
Feature: Location
Extract as described in politics case study
Define the Key
Context
Extract Core
Relationships
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Add derived weather features from external (non-social) data:
Extremeness
Expectation
Change
Calculated based on the nearest weather station to the
median location within the metropolitan area
Extract Core
Relationships
Data Preparatıon
12 months of Tweets, June 2010-June 2011
130M tweets include a weather-related word
179 words from weather glossaries, etc.
71M tweets pass a Bayesian classifier
Trained on 2k labeled tweets
8M tweets geo-located to 56 US cities
Used geo-tagged tweets to learn a mapping from
profile locations
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Key context in this case is location-day pair.
This also defines the core relationship.
We are most interested in is the count of tweets per locationday and the weather features per location-day…
Extract Core
Relationships
Correlation Analysis
Linear regression on derived features with L2 regularization.
Model Features
Global R2 Correlation
Local R2 Correlation
Basic Weather
0.30
0.45
Expectation + Basic
0.33
0.70
Change + Basic
0.35
0.71
Extreme + Basic
0.40
0.70
% of Cities where feature
Granger-causes tweets
Granger Analysis
100%
98.2%
85.7%
80%
66.1%
60%
57.1%
40%
20%
0%
Extreme
Basic
Expectation
Change
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Case study on activities & locations
Question: Given a set of related locations inferred from social media, what we can
we tell about why they are related?
Why this case study?
Introduction to higher-level analyses and using context to interpret them.
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Extracting features:
Activities: Exact match on activity names derived from search queries
Locations: Exact match on unambiguous location names from Wikipedia articles
Key Context == Tweet
Extract Core Relationships: Locations
Contextual statistics of discussions
Pseudo-clique of
NYC tourist locations
Gender
Male
Female
New York Tourist Midtown Worker
49%
63%
33%
23%
Metroarea NYC
Other
33%
67%
54%
46%
Mood
56%
14%
11%
8%
3%
3%
2%
49%
13%
15%
6%
6%
4%
4%
Joviality
Fear
Sadness
Guilt
Fatigue
Serenity
Hostility
Pseudo-clique of
NYC “midtown worker”
Outline
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis
results
Recap: Basic model of interaction
Message
Alice
Bob
Bob sends Alice a message 𝑚: 𝑎𝑟𝑔𝑚𝑖𝑛𝑚∈𝑀 |𝐸 ∗ − 𝐹 𝑠, 𝑟 , 𝑚, 𝑒 |
Recap: Processing Framework
Collect:
Raw Social
Media
Feature
Extraction
Location Tiger Mountain
“I had fun hiking Tiger
Mountain last
weekend” – Alice said
on Monday, at 10am
Mood Happy
Activity Hiking
Define the Key
Context
Extract Core
Relationships
Location:
Tiger
Mountain
Activity:
Hiking
Name Alice
Gender Female
Post Time Mon 10am
Activity Time {Sat-Sun}
Followed by higher-level graph and machine learning
analyses on the combined structure and context…
Recap: Case Studies
Social responses and engagement on So.Cl
Clear & simple study of user interactions
Population biases in political tweets
Extracting basic features from tweets
Demonstrates complexity of population biases
Studying self-reporting bias by comparing tweet rates to ground-truth
Example of building a domain classifier
Methodology to study
Annotating graph structures with discussion context to interpret high-level graph analysis
results
Applies higher-level graph analyses to graphs of discussion topics
Shows how discussion context can be useful at different layers
Summary
Social media data provides a fine-grained and large-scale representation of people’s
discussions and interactions with each other.
Extract information about the real-world
Study people’s interactions with each other
How system design influences those interactions
But be careful, social media is generated through a complicated system, and has many
biases!
Questions?
E-mail: Emre Kiciman [email protected]
http://research.Microsoft.com/~emrek/
Dataset Resources:
Selected Dataset resources
So.cl dataset: http://fuse.microsoft.com/research/srd
ICWSM Datasets http://icwsm.org/2013/datasets/datasets/
MyPersonality Project: http://mypersonality.org
Extra
Outline
Part I: Introduction and conceptual framework
1.
Introduction and preliminaries
2.
Basic model for interaction through social media
3.
A processing pipeline for analyzing social media
Part II: Case Studies
4.
Social responses and engagement on So.Cl
5.
Population biases in political tweets
6.
Studying self-reporting bias by comparing tweet rates to ground-truth
7.
Annotating graph structures with discussion context to interpret high-level graph analysis results
8.
(Bonus) Statistical language modeling to analyze language differences across user populations
Statistical language modeling to analyze
language differences across user populations
What we’re doing:
Build and compare language models of Tweets, conditioned on various metadata
features such as geography and number of followers.
Why we’re doing it:
1. It’s also just interesting to find and quantify the differences in style, topic among
different groups of users.
2. Analysis and info extraction from Tweets important. More accurate language
models may improve algo’s for word segmentation, NER, …
Metadata
Class
Explicit Signals
Inferred Signals
Geography
Time zone
GPS Coordinates
User reported location
User Metadata
Number of followers
Number followed
Total tweets count
Age of account
Gender
Interests
Message Metadata
Message length
Retweet
Contains URL
Number of user references
Time of day
#Topic
Well-capitalized
Twitter Data set
Data
72M Tweets gathered over ~3 days
90% training; 10% test
Focus on English tweets in these experiments
Approach
Partition by metadata feature
E.g., group messages by whether there’s a link in it
Build 1- to 3-gram LM per partition
Smoothed LMs with closed vocabulary
Cross-entropy among all partitions
Analyze differences in term-likelihoods among LMs
Tokyo
Osaka
Jakarta
London
Greenland
Brasilia
Quito
Eastern
Central
Mountain
Pacific
Alaska
Hawaii
Cross-entropy across Timezones
1573
3078
3623
2795
3018
3294
3094
5591
4228
2027
6623
5051
3529
Hawaii
3506
1500
3238
2641
2866
3182
2892
11005
6496
2907
6004
12610
6477
Alaska
2775
1894
1303
1825
2040
2222
2226
11676
6501
2493
2769
11591
5611
Pacific
Mountain 4619 4379 5263 1360 2362 2742 2824 13384 7465 2874 17897 13453 7023
4941
4655
5969
1774
1185
2009
1838
13244
7368
2695
24610
14107
6740
Central
5586
5208
7244
2053
1943
1216
1767
15560
8475
2648
31850
14535
6953
Eastern
5042
4689
6539
2324
2200
2241
1153
8234
6061
2810
26049
13806
7197
Quito
8063
8279
10229
5674
6230
6528
6666
724
5810
4909
28775
11331
7465
Brasilia
Greenland 4437 4776 5966 3642 4006 4170 4030 1932 1536 2868 14817 11179 5962
5013
5573
7160
3478
4065
4115
4266
10621
6561
917
21472
15561
7342
London
5631
4896
5494
5298
5761
6200
6138
17000
9690
4461
1338
12107
7407
Jakarta
8276
8086
9359
6599
6944
7340
7252
16236
10461
5444
19994
1598
4495
Osaka
5682
5546
6589
4521
5006
5043
5222
8904
6811
3635
13864
2386
1265
Tokyo
Perplexity of bi-gram models learned for each time zone with respect to others
Differences across Timezone
3 kinds of differences:
• Geographic locations
• Topic variance
• Dialect, spelling differences
Cross-entropy across Num Followers
0≤x<10 0≤x<100 0≤x<1000 x≥1000
0≤x<10
922
2413
4528
7831
0≤x<100
1166
1071
2477
4811
0≤x<1000
1682
1341
1216
2317
x≥1000
3345
2421
2804
1544
• Similar language models for <10, <100, <1000
followers
• Differences appear for authors with > 1000
follower
your
Differences across num. followers
me
you
you
your
I
my
me
you
your
me
I
my
you
your
me
1
my
1.2
I
0.8
I
my
Relative Likelihood in comparison to global
model
1.4
0.6
0.4
0.2
0
0≤x<10
10≤x<100
100≤x<1000
Number of Followers
x≥1000

Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham [email protected], [email protected] Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia.

Transcript Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham [email protected], [email protected] Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia.

Directory