Network Analytics meets Text Mining for Social Media Analysis Dr. Rosaria Silipo Social Media Data Water Water Everywhere, and not a drop to drink.

Download Report

Transcript Network Analytics meets Text Mining for Social Media Analysis Dr. Rosaria Silipo Social Media Data Water Water Everywhere, and not a drop to drink.

Network Analytics meets Text Mining for Social Media Analysis

Dr. Rosaria Silipo

Social Media Data Water Water Everywhere, and not a drop to drink 2

Social Media Data Water Water Everywhere, and not a drop to drink What companies do with it: • • • • • Download and keep Topic [Shift] Detection (email content routing, detect market interest shift, clinical studies, query non structured DBs, ...) Sentiment Analysis (marketing, polls, elections, ...) Connection Analysis (influencers, risk analysis, ...) ....

3

Social Media Data Water Water Everywhere, and not a drop to drink • • • • • • The Analysis Tools: Web Crawlers Visual Exploration Topic Detection (Text Mining, NLP, Ontologies) Sentiment Score (Text Mining, NLP) Influence Score (Network Analytics) Find Groups (Predictive Analytics) 4

Case Study Example: Slashdot Data Post

Basic Numbers:

Comments • •

24532

users

491

threads with • 15 – 843 responses • 12 – 507 users • •

113505

posts

60

main topics • Selected Topic:

Politics

5

Case Study Example: Slashdot • Very rich data sources about customers !

• We want to establish: • • • How users feel about the discussed topic Whether it matters how users feel A more general abstraction of the results 6

Sentiment Analysis

Remove anonymous users, group by PostID Words Tagging MPQA Corpus Positive words Negative words Total Attitude by User User Bins Word cloud for selected users

Slashdot – Text Mining Most Negative User pNutz

Slashdot – Text Mining Most Positive User dada21

Slashdot – Sentiment Analysis • • 16016 positive users 7107 negative users • • Most positive user: dada21 ( 2838 positive/1725 negative words ) Most negative user: pNutz ( 43 positive/109 negative words ) • Which Topics have positive users in common ?

– Government – People – Law/s – Money – Market – Parties

Network Creation User1 User2 User4 User3 User5 User6 11

Topic Graphs 12

Topic Graph: NASA 14

Topic Graph: Sci-Fi 15

Hubs & Authorities • • Hubs = Followers Authorities = Leaders

Users with hub and authority weights and other features Filtering anonymous users and creating network Centrality index to define hub weight and authority weight

16

Hubs & Authorities dada21 Carl Bialik from the WSJ pNutz Tube Steak Doc Ruby 99BottlesOfBeerInMyF 17

KNIME: Bringing it all together Users with hub and authority weights and other features

Network Analysis Text Analysis

Users bins: positive, negative, neutral 18

Carl Bialik from the WSJ Catbeller dada21 Tube Steak Doc Ruby 99BottlesOfBeerInMyF pNutz 19

What we have found ...

- The positive leaders - The neutral leaders - The negative leaders - The inactive users

What identifies each group?

How do I identify a new user?

How do I handle each user?

20

Why Clustering?

- No a priori knowledge (not even on a subset of users) - Prediction and interpretation capabilities required k-Means algorithm 21

Re-sampling the Training Set k = 10 23

The k-Means Clusters 24

The k-Means Clusters Neutral users Negative users Fans Superfans 25

Additional Discoveries • • • • • There are only very few real leaders! Authority and hub scores identify active participants rather than leaders.

Superfans can be found in cluster_3 Negative and (sigh!) active users are collected in cluster_1.

Neutral users are usually inactive (cluster_2, cluster_7, and cluster_8) Positive users with different degrees of activity are scattered across the remaining clusters.

26

Pre-processing The operational Workflow Cluster Extraction Assignment of new data 27

Notes • • • • MPQA Corpus: publicly available Subjectivity Lexicon ( http://www.cs.pitt.edu/mpqa/lexicons.html

) User Characterization is Sum -> Mean NLP: No sentence splitting, no negation identification.

For a more refined syntax-based sentiment analysis -> „External Tool“ node 28

External Tool Node The „External Tool“ node executes

any

external program from command line 1. Writes input data to an input file 2. Calls Tool to run on input file and command line options and to write results to output file 3. Reads output file and presents data at output port 29

Alternative Sentiment Analysis Free non-interactive Command Line running Tools for Sentiment Analysis not found SentiStrength v2.2 (still interactive)

External Tool and Generic Web Service Client

30

Community Web Crawler Node

Web Crawling Workflow

XML Parsing Nodes

31

Next Steps - Integrate topic information - Integrate user demographic and behavioural information - Discover [time series] patterns for early detection of negative users and superfans - Try other techniques, maybe even on manually segmented data, to discover new user segments 32

Where do I find more?

Whitepaper: [email protected]

Complete Workflows + Data: - text mining www.knime.com

- network mining - combined analysis (note the above 3 process huge data and require 16G memory) – clustering Open Source Software: KNIME www.knime.com

33

Next Appointment

User Day US Boston (free)

October 22nd 2013 10:00 -17:00

Microsoft New England R&D Center (NERD) One Memorial Drive, Suite 100, Cambridge http://www.knime.com/user-day-boston-2013 34

Hands-on Session 1. Download KNIME from www.knime.com

35

Hands-on Session 2. Install Extensions Help -> Install New Software Select: • KNIME & Extensions In KNIME Labs Extensions, select: • KNIME Network Mining • KNIME Textprocessing 36

Hands-on Session 3. Get workflows and Slashdot data • • • • • • • • Get workflows from USB stick (KNIMEBoston2013.zip) Text Mining Network Analytics Text and Network Mining Social Media Clustering Slashdot Raw Data is included in the downloaded workflows A smaller set of data is available, Slashdot Reduced Data, for lower memory requirements Both data sets are available from USB Stick 37

Hands-on Session 3. Import Workflows 38

Hands-on Session Memory Increase in knime.ini

-startup plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar

--launcher.library

plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20110502

-vmargs

-Xmx2G

-XX:MaxPermSize=256m -server -Dsun.java2d.d3d=false -Dosgi.classloader.lock=classname -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Dknime.enable.fastload=true -Djava.library.path=C:\Users\rosy\Documents\R\win-library\2.15\rJava\jri\x64 39

Hands-on Session 5. Improve Workflows: Text Mining

Data Reading Data Preprocessing Tagging Words Scoring and Tag Cloud Reading Tag Corpus BoW

40

Hands-on Session 6. Improve Workflows: Network Analytics

Data Reading and preprocessing Create Network Object Clean up Network Visualize Network

41

zoomba 42

nahdude812 43