Network Analytics meets Text Mining for Social Media Analysis Dr. Rosaria Silipo Social Media Data Water Water Everywhere, and not a drop to drink.
Download ReportTranscript Network Analytics meets Text Mining for Social Media Analysis Dr. Rosaria Silipo Social Media Data Water Water Everywhere, and not a drop to drink.
Network Analytics meets Text Mining for Social Media Analysis
Dr. Rosaria Silipo
Social Media Data Water Water Everywhere, and not a drop to drink 2
Social Media Data Water Water Everywhere, and not a drop to drink What companies do with it: • • • • • Download and keep Topic [Shift] Detection (email content routing, detect market interest shift, clinical studies, query non structured DBs, ...) Sentiment Analysis (marketing, polls, elections, ...) Connection Analysis (influencers, risk analysis, ...) ....
3
Social Media Data Water Water Everywhere, and not a drop to drink • • • • • • The Analysis Tools: Web Crawlers Visual Exploration Topic Detection (Text Mining, NLP, Ontologies) Sentiment Score (Text Mining, NLP) Influence Score (Network Analytics) Find Groups (Predictive Analytics) 4
Case Study Example: Slashdot Data Post
Basic Numbers:
Comments • •
24532
users
491
threads with • 15 – 843 responses • 12 – 507 users • •
113505
posts
60
main topics • Selected Topic:
Politics
5
Case Study Example: Slashdot • Very rich data sources about customers !
• We want to establish: • • • How users feel about the discussed topic Whether it matters how users feel A more general abstraction of the results 6
Sentiment Analysis
Remove anonymous users, group by PostID Words Tagging MPQA Corpus Positive words Negative words Total Attitude by User User Bins Word cloud for selected users
Slashdot – Text Mining Most Negative User pNutz
Slashdot – Text Mining Most Positive User dada21
Slashdot – Sentiment Analysis • • 16016 positive users 7107 negative users • • Most positive user: dada21 ( 2838 positive/1725 negative words ) Most negative user: pNutz ( 43 positive/109 negative words ) • Which Topics have positive users in common ?
– Government – People – Law/s – Money – Market – Parties
Network Creation User1 User2 User4 User3 User5 User6 11
Topic Graphs 12
Topic Graph: NASA 14
Topic Graph: Sci-Fi 15
Hubs & Authorities • • Hubs = Followers Authorities = Leaders
Users with hub and authority weights and other features Filtering anonymous users and creating network Centrality index to define hub weight and authority weight
16
Hubs & Authorities dada21 Carl Bialik from the WSJ pNutz Tube Steak Doc Ruby 99BottlesOfBeerInMyF 17
KNIME: Bringing it all together Users with hub and authority weights and other features
Network Analysis Text Analysis
Users bins: positive, negative, neutral 18
Carl Bialik from the WSJ Catbeller dada21 Tube Steak Doc Ruby 99BottlesOfBeerInMyF pNutz 19
What we have found ...
- The positive leaders - The neutral leaders - The negative leaders - The inactive users
What identifies each group?
How do I identify a new user?
How do I handle each user?
20
Why Clustering?
- No a priori knowledge (not even on a subset of users) - Prediction and interpretation capabilities required k-Means algorithm 21
Re-sampling the Training Set k = 10 23
The k-Means Clusters 24
The k-Means Clusters Neutral users Negative users Fans Superfans 25
Additional Discoveries • • • • • There are only very few real leaders! Authority and hub scores identify active participants rather than leaders.
Superfans can be found in cluster_3 Negative and (sigh!) active users are collected in cluster_1.
Neutral users are usually inactive (cluster_2, cluster_7, and cluster_8) Positive users with different degrees of activity are scattered across the remaining clusters.
26
Pre-processing The operational Workflow Cluster Extraction Assignment of new data 27
Notes • • • • MPQA Corpus: publicly available Subjectivity Lexicon ( http://www.cs.pitt.edu/mpqa/lexicons.html
) User Characterization is Sum -> Mean NLP: No sentence splitting, no negation identification.
For a more refined syntax-based sentiment analysis -> „External Tool“ node 28
External Tool Node The „External Tool“ node executes
any
external program from command line 1. Writes input data to an input file 2. Calls Tool to run on input file and command line options and to write results to output file 3. Reads output file and presents data at output port 29
Alternative Sentiment Analysis Free non-interactive Command Line running Tools for Sentiment Analysis not found SentiStrength v2.2 (still interactive)
External Tool and Generic Web Service Client
30
Community Web Crawler Node
Web Crawling Workflow
XML Parsing Nodes
31
Next Steps - Integrate topic information - Integrate user demographic and behavioural information - Discover [time series] patterns for early detection of negative users and superfans - Try other techniques, maybe even on manually segmented data, to discover new user segments 32
Where do I find more?
Whitepaper: [email protected]
Complete Workflows + Data: - text mining www.knime.com
- network mining - combined analysis (note the above 3 process huge data and require 16G memory) – clustering Open Source Software: KNIME www.knime.com
33
Next Appointment
User Day US Boston (free)
October 22nd 2013 10:00 -17:00
Microsoft New England R&D Center (NERD) One Memorial Drive, Suite 100, Cambridge http://www.knime.com/user-day-boston-2013 34
Hands-on Session 1. Download KNIME from www.knime.com
35
Hands-on Session 2. Install Extensions Help -> Install New Software Select: • KNIME & Extensions In KNIME Labs Extensions, select: • KNIME Network Mining • KNIME Textprocessing 36
Hands-on Session 3. Get workflows and Slashdot data • • • • • • • • Get workflows from USB stick (KNIMEBoston2013.zip) Text Mining Network Analytics Text and Network Mining Social Media Clustering Slashdot Raw Data is included in the downloaded workflows A smaller set of data is available, Slashdot Reduced Data, for lower memory requirements Both data sets are available from USB Stick 37
Hands-on Session 3. Import Workflows 38
Hands-on Session Memory Increase in knime.ini
-startup plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20110502
-vmargs
-Xmx2G
-XX:MaxPermSize=256m -server -Dsun.java2d.d3d=false -Dosgi.classloader.lock=classname -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Dknime.enable.fastload=true -Djava.library.path=C:\Users\rosy\Documents\R\win-library\2.15\rJava\jri\x64 39
Hands-on Session 5. Improve Workflows: Text Mining
Data Reading Data Preprocessing Tagging Words Scoring and Tag Cloud Reading Tag Corpus BoW
40
Hands-on Session 6. Improve Workflows: Network Analytics
Data Reading and preprocessing Create Network Object Clean up Network Visualize Network
41
zoomba 42
nahdude812 43