Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham [email protected], [email protected] Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia.
Download ReportTranscript Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham [email protected], [email protected] Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia.
Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham [email protected], [email protected] Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia People talking and interacting. Publicly. Or semi-publicly. Often about the quotidian, but not necessarily. Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia Why? Learn about World 1. Observe 2. Tweet But people are not perfect sensors, for many reasons. Let’s Learn about Donuts Where Do People Get Donuts? 2000 1800 1600 1400 1200 1000 800 600 400 200 0 1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012. What Do People Drink with Donuts? 700 600 500 400 300 200 100 0 1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012. What Kind of Donuts Do People Eat? potato powdered chinese maple jam maple bacon old fashioned glazed jelly 400 350 300 250 200 150 100 50 0 1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012. Beyond Donuts… Drugs, diseases, and contagions “You Are What You Tweet: Analyzing Twitter for Public Health” Paul and Dredze, 2011 Symptoms and medication usage, tracking illness over time, behavioral risk factors “Predicting Disease Transmission from Geo-Tagged Micro-Blog Data” Sadilek, Kautz and Silenzio, 2012 Study disease transmission in physical world based on location traces of sick & healthy people Public Sentiment Political and election indices, market insights Everyday life Why use social media? Cross-domain / open-domain Large-scale, fine-grained, naturalistic Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia Why? Learn How People Interact with Each Other What are common conventions in interactions Ex. “Unsupervised Modeling of Twitter Conversations”, Ritter, Cherry and Dolan, 2010 How do people’s interactions impact each other? How do norms form? Ex. “The Birth of Retweeting Conventions on Twitter”, Cha, Gummadi, Kooti, Mason and Yang, 2012 How do communities organize themselves? Ex., social media usage in the context of war, disasters and crises. Starbird et al. 2010; AlAni, Mark & Semaan 2010; Monroy-Hernandez et al. 2012 Douglas Wray - http://instagr.am/p/nm695/ @ThreeShipsMedia Why? Learn How System Influences People Ex. “Feed Me: Motivating Newcomer Contribution in Social Network Sites,” Burke, Marlow and Lento, 2009. Plus, a case study a bit later on return visits from first-time users Recap: Social Media Analyses Social media captures people talking and interacting with each other, publicly, on a wide variety of topics. We would like to study it to learn about the world… … about how people interact with each other … and the role of the system in influencing these interactions Preliminaries Speaker Bio: Shelly Farnham Specialize in social technologies Social networks, community, identity, mobile social Early stage innovation Extremely rapid R&D cycle study, brainstorm, design, prototype, deploy, evaluate (repeat) Convergent evaluation methodologies: usage analysis, interviews, questionnaires Career PhD in Social Psychology from UW 7 years Microsoft Research: Virtual Worlds, Social Computing, Community Technologies 4 years startup world: Waggle Labs (consulting), Pathable 2 Years Yahoo! FUSE Labs, Microsoft Research Personal Map Speaker Bio: Emre Kıcıman Specialize in Social Data Analytics Social media, analytics and search Focus: 1. Improving our analytical capabilities 2. Extracting information about the world from social media 3. Reasoning about social media biases, reinforcing useful signals Career Ph.D. in computer science from Stanford University, ’05 7 years at Internet Services Research Center, Microsoft Research Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Basic Model of Social Media Interaction Message from Bob Alice Effect * Wow, Bob is ______! * I should donate money to _____! * Write back to Bob.... * Forward Bob’s message to others… Effect depends on many factors: * Relationship <Alice, Bob> * Content of the message * Alice’s environment: -> social context -> current tasks From Bob’s Point-of-View Effect Message1 ? Message2 ? Message3 ? Alice Bob Bob needs feedback! A bit more formally… Assumption: Source writes a message to effect recipient Effect might be to project a persona; to elicit engagement or action; build social capital [cite communications theory; social capital; etc] (Of course, this is not always true; sometimes messages are written for effect on self. E.g., cathartic writing) A bit more formally… Let effect be a function of relationship, message and context: 𝐸 = 𝐹 𝑠, 𝑟 , 𝑚, 𝑒 𝑠, 𝑟 represents the relationship between a source and recipient. E.g., “close friends”, “authority/expert” 𝑚 represents the message content and style 𝑒 represents the environment in which the message is received E.g., facebook or linkedin. Also includes broader social norms, etc. A bit more formally… Then a rational source, trying to achieve an effect 𝐸 ∗ will select messages based on: 𝑎𝑟𝑔𝑚𝑖𝑛𝑚∈𝑀 |𝐸 ∗ − 𝐹 𝑠, 𝑟 , 𝑚, 𝑒 | Where 𝑠, 𝑟 and 𝑒 are fixed and 𝐹 is source’s approximation of 𝐹. More complex versions take into account multiple recipients, thresholds of cost & utility, etc. Messages about the real world… Let’s assume for simplicity that style is constant, and content is what an author is actively choosing. Then, given a set of observations 𝑊 about real-world events, author chooses a 𝑤: 𝑎𝑟𝑔𝑚𝑖𝑛𝑤∈𝑊 |𝐸 ∗ − 𝐹 𝑠, 𝑟 , 𝑤, 𝑐 | If no 𝑤 achieves goal 𝐸 ∗ within some threshold, author writes nothing. The weather bias case study is essentially investigating the relationship between features of 𝑤 and 𝐸 ∗ − 𝐹 >t Bob is not the only Actor Messaging is not the only Action (this slide intentionally left blank) Bob Effect Alice ? Charlie Justin Effect Bob ? Charlie Alice Justin ? Message Alice Bob Alice Effect Bob Message Social Media System Social Media System Role Alice: Bob: 𝐸𝐴𝑙𝑖𝑐𝑒 = 𝐹 𝐵𝑜𝑏, 𝐴𝑙𝑖𝑐𝑒 , 𝑚, 𝑒 ∗ 𝑎𝑟𝑔𝑚𝑖𝑛𝑚∈𝑀 |𝐸𝐵𝑜𝑏 − 𝐹 𝐵𝑜𝑏, 𝐴𝑙𝑖𝑐𝑒 , 𝑚, 𝑒 | ∗ Social media system is trying to align parameters to achieve its own effect 𝐸𝑆𝑦𝑠 E.g., align relationships 𝑆, 𝑅 , the environment 𝑒, as well as the space of messages 𝑚 that Bob selects from. Reinforcing real-world signals 𝑎𝑟𝑔𝑚𝑖𝑛𝑤∈𝑊 𝐸 ∗ − 𝐹( 𝑠, 𝑟 , 𝑤, 𝑐) Ex. Reinforcing useful signals: if we want to increase the likelihood that world events 𝑊 + that we care about are reported, this model indicates several possible directions: Investigate and optimize 𝐸 ∗ , 𝑠, 𝑟 and 𝑐 such that 𝑊 + is reported Design feedback that improves approximation 𝐹 to reinforce 𝑊 + Recap & How might this model help? Basic model of social media system interactions: Messages have an effect on recipients, conditioned on various factors Authors choose their messages to have some desired effect Social media systems today can (and do) play an active role Having a generation process in the back of our mind helps us realize biases and limitations of data Having a sense of the basic knobs can help us think about how to improve social media systems Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Some Basics: Clarity: * Be clear in your purpose and the real-world problem you wish to address * State your question as a hypothesis Testability: * In some cases, the hypothesis is tested through the social data * Other times validation must lie outside the social data Passive and active experiments: * If we look at questions of causality we often need to do active experimentation Data Analysis is one tool of many: * User surveys, interviews, mockups, prototypes, etc. Defining Research Question Amount of data overwhelming – the more defined your question, the easier the analysis What real world problem are you trying to explore? Avoid pitfall of technology for technology’s sake What argument do you want to be able to make? State your problem as a hypothesis Introducing a Running Example Studying the relationship between activities and locations What do people do? And where do they do it? (Note: research question is about accuracy of results, and its use in applications) Processing Pipeline Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Extract Core Relationships Followed by higher-level statistical, graph and machine learning analyses to build Social Media Processing Pipeline Collect: Raw Social Media Feature Extraction Location Tiger Mountain “I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am Mood Happy Activity Hiking Define the Key Context Extract Core Relationships Location: Tiger Mountain Activity: Hiking Name Alice Gender Female Post Time Mon 10am Activity Time {Sat-Sun} Followed by higher-level graph and machine learning analyses on the combined structure and context… 1. Collection Instrumentation Avoid tendency to collect everything without organization Validate logging -> untested instrumentation is prone towards bugs Design for key scenarios: Make it easy to get data for key questions up front Streaming and Search APIs Easy to use. Appropriate for many experiments Often rate-limited, but can build large-scale data over time Crawling More effort, but can grab historical data Some sites will block In all cases, do consider user privacy and expectations. To consider, during collection and signal extraction: Filters Time span, type of person, type of actions Sampling Random selection Snow balling, to get complete picture of person’s social experience Consider your research questions, how you want to generalize 2. Cleaning & Feature Extraction Clean once: removing irrelevant raw data (depends on your research question) Spammy users, people who were never active, Geographic or temporal filtering, When you remove a user, message, or action, think about whether to remove associated data (e.g., might want to keep a spammy user’s interactions with other, non-outlier users) Implication: Feature extraction: Deriving cleaner feature-values from the rawWhile data:developing your analysis process and feature Absolute date-time stamp -> HourOfWeek, isWeekDay, … extraction, you have to be inspecting Entity recognition from text the raw data and the feature results User classification, … This is also a good stage to bring in external data… Clean again: Look for outliers and remove feature values that are not dependable Keep samples of raw data for distinct feature values to make inspection easier 3. Pick defining context of relationships After extracting features from our social media, we want to reason about the relationships among these features. What defines a relationship? One common choice: Context == Co-occurrence within the same message (next slide) “I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am Location: Tiger Mountain Gender: Female Mood: Happy Activity: Hiking Name: Alice Post Time: Mon 10am Activity Time: {SatSun} Gender: Male Name: Bob Post Time: Fri 3pm Location: Tiger Mountain Gender: Female Mood: Happy Activity: Hiking Name: Alice Post Time: Mon 10am Activity Time: {SatSun} Other common choices User as defining context -> Two things are related if they are associated with the same user -> Common in recommender systems -> Ex. Livehood study of neighborhood boundaries Location as defining context -> Two things (users, actions, …) are related if co-occur at same physical location -> Ex. Sadilek’s study of disease transmission 4. Extracting Core Relationships Location: Tiger Mountain Activity: Hiking • Focus on core relationships among domains of interest • Strength defined by how frequently items co-occurred in key context • Statistical distribution of other features annotates core relationships Iterate on “core relationships” Gender: Male Location: Tiger Mountain Activity: Hiking Gender: Female Higher-level algorithms Statistical tests Ex. Test that relationships are statistically significant Ex. Test that two items are statistically different from one another Graph analyses Ex. Clique finding, graph clustering, path algorithms, network centrality, …. Machine learning Ex. Classifiers, clustering, etc. Based on graph relationships or annotations Some usage scenarios Build “profiles” of things based on the words and sentiment used Build demographics of places and concepts based on who is talking about them Build co-mention graph among entities, people, places, etc. Build user profiles of users based on what they talk about and how they express themselves Include “time” in the projection, and see how profiles change over time Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Selection of Case Studies Social responses and engagement on So.Cl Clear & simple study of user interactions Population biases in political tweets Extracting basic features from tweets Demonstrates complexity of population biases Studying self-reporting bias by comparing tweet rates to ground-truth Example of building a domain classifier Methodology to study Annotating graph structures with discussion context to interpret high-level graph analysis results Applies higher-level graph analyses to graphs of discussion topics Shows how discussion context can be useful at different layers Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results 8. (Bonus) Statistical language modeling to analyze language differences across user populations Case Study: Usage analysis of So.Cl So.cl is an experimental web site that allows people to connect around their interests by integrating search tools with social networking. Study: How important are social interactions in encouraging users to become engaged with an interest network? We’ll start with some background on So.Cl, and then dive into the case study. SO.CL reimagining search as social from the ground up search + sharing + networking = informal discovery and learning History: Oct 2011: Pre-release deployment study Dec 2011: Private, invitation-only beta May 2012: removed invitation restrictions Nov 2012: over 300K registered users, 13K active per month Try it now! http://www.so.cl So.Cl as Interest Network Find others around common interests Be inspired by new interests Learn from each other through these shared interests Search & Post How It Works Feed Filters Feed People Try it now! http://www.so.cl – use facsumm tag Search (Bing) Post Building Filter Results Experience: Step 1: Perform search Step 2: Click on items in results to add to post Step 3: Add a message Step 4: Tag Try it now! http://www.so.cl – use facsumm tag Post Builder Results So.Cl as Research Platform Discovery Network Social Search ? riffing Collaborate liking Connect Interest Network follow Simple profiles ? Wall messages People list commenting ? Collect Create Consume interests visual Post builder stream Video parties Add links Explore Page Search interests ? Interest Page Increasing Engagement, Community, Learning, Innovation So.cl Research Dataset Program Access to public so.cl behavioral data for research purposes Foster research in interest networking, social search, and community development http://fuse.microsoft.com/research/srd Case Study: Hypothesis: If people receive a social response when they first join So.cl they are more likely to become engaged. Measuring social/behavioral constructs: When first join First session = time of first action to time of last action prior to an hour of inactivity Social responses Follows user, likes user’s post(s), comments on user’s post(s) Engagement = coming back A second session = any action occurs 60 minutes or more after first session Restating hypothesis: If a people receive follows, likes, and comments in their first session they are more likely to come back for a second session Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Extract Core Relationships Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Simple, common instrumentation schema, kept in a database Users table: Row per user Include creation time and other metadata Content table: Row per content includes text, URLs, etc. Actions table: Row per action Filter out non-meaningful, non-user generated actions Actions capture user interactions and context Extract Core Relationships Always look at your raw data: play with it, ask yourself if it makes sense, test! Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Extract Core Relationships Filters Time span, type of person, type of actions Sampling Random selection Snow balling, so get complete picture of person’s social experience Consider your research questions, how you want to generalize Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Extract Core Relationships Filtered out administrators/community managers New users only Date range: Sept 28 to Oct 13 100% sample for that time span: 2462 people SYSTEMATIC BIASES IN SOCIAL SYSTEMS #1 If you want to understand your “typical” users, keep in mind generally find: Large percent never become active or return --“kicking the tires” unduly biases averages Common reporting format: X% performed Y behavior, of those averaged Z times each 5% commented on a post their first session, averaging 5 times each Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Extract Core Relationships OUTLIERS: Filtered out 13 people outliers z > 4 in number of actions (if do more than sign in) SYSTEMATIC BIASES IN SOCIAL SYSTEMS #2 A small percent “hyper-active” users: avid, spammers, trolls, administrators, and can unduly bias averages Remove outliers A substantial percent are consumers but not producers (“lurkers”), often no signal for lurkers So.cl has about 75% lurkers Custom instrumentation, logging sign ins Web analytics for clicks Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Very important to spend time examining data Descriptives, Frequencies, Correlations, Graphs Use tool that easily generates graphs, correlations Does it make sense? If not, really chase it down. Often a bug or misinterpretation of data. Extract Core Relationships Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Feature: Active Sessions Active session = a time of activity (public), with 60 minute gap of no activity before or after 91% of users only one active session On average, 34.6 hours apart First session, 1.6 minutes Extract Core Relationships Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Extract Core Relationships Feature: User Actions Number of Posts in First Session 8% created a post in their first session, of those averaged 1.5 posts each Actions in First Session Collect: Content & Interactions Cleaning and Feature Extraction Feature: Coming back 9.1% came back for another active session (~25% including inactive) On average, 35 hours later Define the Key Context Extract Core Relationships Collect: Content & Interactions Cleaning and Feature Extraction Define the Key Context Extract Core Relationships Aggregation: merging down for summarization What is your level of analysis? Person, group, network Content types If person is unit of analysis, aggregate measures to the person level E.g. in SPSS: One line per person very important to have appropriate unit analysis, to avoid bias in statistics AGGREGATIONS SPSS Syntax: PRELIMINARY CORRELATIONS Always ask, does this pattern make sense? IN THE FIRST SESSION How often is user the target of social behavior? 23% received some response up to 2nd session ->3% if did not create a post, 37% if did create a post Response *During* First Session Response *in Between* 1st and 2nd Sessions PREDICTORS OF COMING BACK Social responses inspire people to return to site, especially if occurring during first session N = 2273 N = 179 N = 1942 N = 510 Social responses to user: following, commenting on post, liking post, liking comment, riffing WHICH RESPONSE MATTERS Logistic Regression, Any Response Predicts Coming Back B S.E. Sig. Created post first session .71 .20 .000 Response1: during first session 1.12 .21 .000 Response2: after first session .60 .17 .000 Logistic Regression, Which Predicts Coming Back B Sig. Created post first session .95 .000 Followed .92 .003 Commented On .38 ns Post Liked .87 .02 Comment Liked -.09 ns Messaged -.09 ns Riffed .00 ns IDENTIFYING SUBGROUPS Component Matrixa Type: % Variance: Creators 32% Component Socialites Browsers 12% 9% Created post .86 .17 .10 Invited .01 -.16 .63 Followed -.03 .10 .37 Added item to post .83 .08 -.06 Searched .81 .03 .17 Commented .36 .64 .09 Liked post .15 .58 .32 Liked comment .13 .80 .06 Messaged -.09 .50 -.08 Viewed person .22 .47 .48 Navigated to All .51 .37 .53 Joined party .17 .09 .68 Principle components, varimax rotation [meaning forced to be orthoganol] Factor Analysis for Associated Behaviors: Three types of usage – creating, socializing, browsing Factors about equally predict if user comes back Regression Coefficients Beta t Sig Creating .14 5.28 .000 Socializing .07 2.61 .000 Browsing .19 7.20 .000 Browsing stronger predictor of overall activity level Regression Coefficients Beta t Sig Creating 0.20 7.89 0.00 Socializing 0.17 6.58 0.00 Browsing 0.29 9.07 0.00 Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Case Study: Population biases in political tweets There was a significant amount of political discussion on Twitter during the US election season in Summer/Fall 2012. Case Study: Is the population of tweeters representative of US demographics along two demographic axes of gender and geography? Why this case study? Good illustration of simple extractors for gender, location, and simple methods for identifying topics. More fundamentally, highlights challenges of dealing with population biases Collect: Raw Social Media Feature Extraction Define the Key Context Collected all tweets during August – November, 2012 that mentioned “Obama”, “Romney” or other politician names Inspecting raw data: * Removed some common names and issue phrases from collection Extract Core Relationships Collect: Raw Social Media Feature Extraction Define the Key Context Extract Core Relationships Feature: Gender Simple gender classifier based on first name of Twitter user in profile Approach: Look up first name in a weighted gender map built from census data and other sources. Practical results: Ad hoc inspection is positive Coverage is 60-70%, depending on domain. Remainder are organizations and ambiguous names Still requires: Accuracy evaluation based on ground-truth data Collect: Raw Social Media Feature Extraction Define the Key Context Extract Core Relationships Feature: Location Map from self-declared user profile locations to lat-lon regions. Approach: Use a mapping learned from the small % of tweets that are geocoded. Cluster mapped geo-locations together into city-size areas. Practical results: Maps to metropolitan-area size regions. Learns official location names, as well as abbreviations, nicknames, etc. Automatically identifies non-specific locations Coverage is 60-70%, depending on domain. Remainder have non-specific locations or “tail” locations not covered in training set. Example results: Location cluster New York Los Angeles Filtered out due to ambiguity (large area) Example members “NYC”, “Yonkers”, “manhattan,” “NY,NY”, “Nueva York”, “N Y C”, The Big Apple” “Laguna beach”, “long beach”, “LosAngeles,CA”, “West Los Angeles, CA”, “Downtown Los Angeles”, “LAX” “World”, “everywhere”, “USA”, “California”, … Location detection alternatives 1. Use geo-tagged tweets. - Most appropriate when you need fine-grained locations per tweet (e.g., user tracking) - But trade-off is that very small % of tweets are geo-coded 2. Much recent research on location inference. - State-of-the-art uses textual references to known locations to identify user location. This mapping technique is a little coarser-grained, but simpler. Collect: Raw Social Media Feature Extraction Define the Key Context Extract Core Relationships Feature: Politician mention Approach: Exact-match on well-known, unambiguous politician names. Still needs: * Domain classification and/or stronger entity linking to recognize ambiguous names. For example, “Mitt” is likely Mitt Romney in a political context, but not otherwise. Collect: Raw Social Media Feature Extraction Define the Key Context Extract Core Relationships Key context is the tweet itself. We will assume a relationship among features if they co-occur in the same tweet. It will be stronger if it co-occurs across many tweets. For example: Collect: Raw Social Media Feature Extraction Define the Key Context Extract Core Relationships We extract two sets of relationships: 1) Politician mentions per Day: Strength of relationship indicates volume of discussion about a given politician on a given day Discussion context summarizes gender and location for each day. 2) Politician mentions over all time: Discussion context summarizes gender and location over all time Gender Bias Gender distribution of authors of tweets mentioning Obama 80% 70% 60% 50% 40% 30% m 20% f 10% 0% 8/29/12 9/3/12 9/8/12 9/13/12 9/18/12 9/23/12 9/28/12 10/3/12 10/8/12 10/13/12 Gender distribution equalizes during high-volume events like DNC Geographic bias in Political Tweets Metro-area Tweets % of tweets New York, NY 141878 10% Washington, DC 135347 9% Los Angeles 68676 5% Chicago 47130 3% Atlanta, GA 45475 3% Houston, TX 35956 2% Boston, MA 34363 2% Actual population 22,000,000 8,500,000 12,800,000 9,800,000 5,200,000 2,100,000 7,600,000 * Geographic distribution of tweets mentioning Obama during 2012 Elections Election 2012 Moods over time for Obama Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Studying self-reporting bias by comparing tweet rates to ground-truth Background: Frequency of discussion about events does not directly reflect real-world frequency of occurrence. We may assume that bias is constant for a given kind of event, but not about bias across different kinds of events. We can make few inferences about the relationship between distinct events through social media analysis. Study: Compare tweet rates about weather to ground-truth data about weather Why this case study: Easy example of domain identification and ambiguity resolution in cleaning stage Good illustration of self-reporting bias Self-reporting Bias We study reporting bias by comparing tweet rates about the weather to ground-truth weather data. Does the weather’s extremeness, changes, or unexpectedness affect tweet rates? [Kıcıman, ICWSM 2012] Tweets & Weather Timeline Thunderstorm Hottest day 60 Daily Tweet Count 50 1000 40 100 30 20 10 Weather-Related Tweet Rate Temperature 1 10 Daily Max Temperature (C) 10000 0 Sep. 1 Sep. 15 Sep. 29 Oct. 13 Weather-related Tweet rate and temperature in San Diego, CA from Sep. 1-Oct 15, 2010 Collect: Raw Social Media Feature Extraction Define the Key Context Collected 12 months of tweets that mentioned weatherrelated words (e.g., “rain”, “snow”, “sun”, “heat”…) Word list built by hand from weather glossaries, dictionaries, etc. Extract Core Relationships Example “Weather” Tweets Woke up to a sunny 63F (17C) morning. It's going to be a good day :) The rainy season has started. The inside of our house looks like a tornado came through it. Japan, Germany hail U.N. Iran sanctions resolution Domain Classifier Used a language-based classifier, with a simple Bayes model: 1 |𝑇| 𝑡∈𝑇 𝑃(𝑤𝑒𝑎𝑡ℎ𝑒𝑟|𝑡) Where 𝑇 is the set of features (all pairs of co-occurring words within a tweet, regardless of order) 𝑃 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 = (1 + 𝐶 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 ) (1 + 𝐶 𝑡 ) Also: • Simple stemming of words – remove ‘-s’ and ‘-ing’ suffixes Domain Classifier: Labeling Labeling 2000 tweets manually (2 labelers) to create a “gold training/test set” What were challenges of labeling? Mainly a strong, consistent criteria. For example: “incidental” mentions of the weather? Or, mentions of the weather someplace else? Slightly less complicated: Mentions of the weather in proverbs (‘when it rains it pours’) Domain Classifier Results: Classifier F-Score of 0.83, with a precision of 0.80 and recall of 0.85 Is this good? In general, the precision/recall will depend heavily on the domain and the collection criteria for tweets. Collect: Raw Social Media Feature Extraction Feature: Location Extract as described in politics case study Define the Key Context Extract Core Relationships Collect: Raw Social Media Feature Extraction Define the Key Context Add derived weather features from external (non-social) data: Extremeness Expectation Change Calculated based on the nearest weather station to the median location within the metropolitan area Extract Core Relationships Data Preparatıon 12 months of Tweets, June 2010-June 2011 130M tweets include a weather-related word 179 words from weather glossaries, etc. 71M tweets pass a Bayesian classifier Trained on 2k labeled tweets 8M tweets geo-located to 56 US cities Used geo-tagged tweets to learn a mapping from profile locations Collect: Raw Social Media Feature Extraction Define the Key Context Key context in this case is location-day pair. This also defines the core relationship. We are most interested in is the count of tweets per locationday and the weather features per location-day… Extract Core Relationships Correlation Analysis Linear regression on derived features with L2 regularization. Model Features Global R2 Correlation Local R2 Correlation Basic Weather 0.30 0.45 Expectation + Basic 0.33 0.70 Change + Basic 0.35 0.71 Extreme + Basic 0.40 0.70 % of Cities where feature Granger-causes tweets Granger Analysis 100% 98.2% 85.7% 80% 66.1% 60% 57.1% 40% 20% 0% Extreme Basic Expectation Change Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Case study on activities & locations Question: Given a set of related locations inferred from social media, what we can we tell about why they are related? Why this case study? Introduction to higher-level analyses and using context to interpret them. Collect: Raw Social Media Feature Extraction Define the Key Context Extract Core Relationships Extracting features: Activities: Exact match on activity names derived from search queries Locations: Exact match on unambiguous location names from Wikipedia articles Key Context == Tweet Extract Core Relationships: Locations Contextual statistics of discussions Pseudo-clique of NYC tourist locations Gender Male Female New York Tourist Midtown Worker 49% 63% 33% 23% Metroarea NYC Other 33% 67% 54% 46% Mood 56% 14% 11% 8% 3% 3% 2% 49% 13% 15% 6% 6% 4% 4% Joviality Fear Sadness Guilt Fatigue Serenity Hostility Pseudo-clique of NYC “midtown worker” Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results Recap: Basic model of interaction Message Alice Bob Bob sends Alice a message 𝑚: 𝑎𝑟𝑔𝑚𝑖𝑛𝑚∈𝑀 |𝐸 ∗ − 𝐹 𝑠, 𝑟 , 𝑚, 𝑒 | Recap: Processing Framework Collect: Raw Social Media Feature Extraction Location Tiger Mountain “I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am Mood Happy Activity Hiking Define the Key Context Extract Core Relationships Location: Tiger Mountain Activity: Hiking Name Alice Gender Female Post Time Mon 10am Activity Time {Sat-Sun} Followed by higher-level graph and machine learning analyses on the combined structure and context… Recap: Case Studies Social responses and engagement on So.Cl Clear & simple study of user interactions Population biases in political tweets Extracting basic features from tweets Demonstrates complexity of population biases Studying self-reporting bias by comparing tweet rates to ground-truth Example of building a domain classifier Methodology to study Annotating graph structures with discussion context to interpret high-level graph analysis results Applies higher-level graph analyses to graphs of discussion topics Shows how discussion context can be useful at different layers Summary Social media data provides a fine-grained and large-scale representation of people’s discussions and interactions with each other. Extract information about the real-world Study people’s interactions with each other How system design influences those interactions But be careful, social media is generated through a complicated system, and has many biases! Questions? E-mail: Emre Kiciman [email protected] http://research.Microsoft.com/~emrek/ Dataset Resources: Selected Dataset resources So.cl dataset: http://fuse.microsoft.com/research/srd ICWSM Datasets http://icwsm.org/2013/datasets/datasets/ MyPersonality Project: http://mypersonality.org Extra Outline Part I: Introduction and conceptual framework 1. Introduction and preliminaries 2. Basic model for interaction through social media 3. A processing pipeline for analyzing social media Part II: Case Studies 4. Social responses and engagement on So.Cl 5. Population biases in political tweets 6. Studying self-reporting bias by comparing tweet rates to ground-truth 7. Annotating graph structures with discussion context to interpret high-level graph analysis results 8. (Bonus) Statistical language modeling to analyze language differences across user populations Statistical language modeling to analyze language differences across user populations What we’re doing: Build and compare language models of Tweets, conditioned on various metadata features such as geography and number of followers. Why we’re doing it: 1. It’s also just interesting to find and quantify the differences in style, topic among different groups of users. 2. Analysis and info extraction from Tweets important. More accurate language models may improve algo’s for word segmentation, NER, … Metadata Class Explicit Signals Inferred Signals Geography Time zone GPS Coordinates User reported location User Metadata Number of followers Number followed Total tweets count Age of account Gender Interests Message Metadata Message length Retweet Contains URL Number of user references Time of day #Topic Well-capitalized Twitter Data set Data 72M Tweets gathered over ~3 days 90% training; 10% test Focus on English tweets in these experiments Approach Partition by metadata feature E.g., group messages by whether there’s a link in it Build 1- to 3-gram LM per partition Smoothed LMs with closed vocabulary Cross-entropy among all partitions Analyze differences in term-likelihoods among LMs Tokyo Osaka Jakarta London Greenland Brasilia Quito Eastern Central Mountain Pacific Alaska Hawaii Cross-entropy across Timezones 1573 3078 3623 2795 3018 3294 3094 5591 4228 2027 6623 5051 3529 Hawaii 3506 1500 3238 2641 2866 3182 2892 11005 6496 2907 6004 12610 6477 Alaska 2775 1894 1303 1825 2040 2222 2226 11676 6501 2493 2769 11591 5611 Pacific Mountain 4619 4379 5263 1360 2362 2742 2824 13384 7465 2874 17897 13453 7023 4941 4655 5969 1774 1185 2009 1838 13244 7368 2695 24610 14107 6740 Central 5586 5208 7244 2053 1943 1216 1767 15560 8475 2648 31850 14535 6953 Eastern 5042 4689 6539 2324 2200 2241 1153 8234 6061 2810 26049 13806 7197 Quito 8063 8279 10229 5674 6230 6528 6666 724 5810 4909 28775 11331 7465 Brasilia Greenland 4437 4776 5966 3642 4006 4170 4030 1932 1536 2868 14817 11179 5962 5013 5573 7160 3478 4065 4115 4266 10621 6561 917 21472 15561 7342 London 5631 4896 5494 5298 5761 6200 6138 17000 9690 4461 1338 12107 7407 Jakarta 8276 8086 9359 6599 6944 7340 7252 16236 10461 5444 19994 1598 4495 Osaka 5682 5546 6589 4521 5006 5043 5222 8904 6811 3635 13864 2386 1265 Tokyo Perplexity of bi-gram models learned for each time zone with respect to others Differences across Timezone 3 kinds of differences: • Geographic locations • Topic variance • Dialect, spelling differences Cross-entropy across Num Followers 0≤x<10 0≤x<100 0≤x<1000 x≥1000 0≤x<10 922 2413 4528 7831 0≤x<100 1166 1071 2477 4811 0≤x<1000 1682 1341 1216 2317 x≥1000 3345 2421 2804 1544 • Similar language models for <10, <100, <1000 followers • Differences appear for authors with > 1000 follower your Differences across num. followers me you you your I my me you your me I my you your me 1 my 1.2 I 0.8 I my Relative Likelihood in comparison to global model 1.4 0.6 0.4 0.2 0 0≤x<10 10≤x<100 100≤x<1000 Number of Followers x≥1000