Text Analytics Workshop
Download
Report
Transcript Text Analytics Workshop
Text Analytics
Workshop
Tom Reamy
Chief Knowledge Architect
KAPS Group
Program Chair – Text Analytics World
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
Introduction – State of Text Analytics
–
Text Analytics Features
– Information / Knowledge Environment – Taxonomy, Metadata,
Information Technology
– Value of Text Analytics
– Quick Start for Text Analytics
Development – Taxonomy, Categorization, Faceted Metadata
Text Analytics Applications
–
–
Integration with Search and ECM
Platform for Information Applications
Questions / Discussions
2
Introduction: KAPS Group
Knowledge Architecture Professional Services – Network of Consultants
Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Quick Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST,
Concept Searching, Attensity, Clarabridge, Lexalytics
Clients:
–
Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, World Bank, etc.
Presentations, Articles, White Papers – www.kapsgroup.com
3
Text Analytics Workshop
Introduction: Text Analytics
History – academic research, focus on NLP
Inxight –out of Zerox Parc
–
Moved TA from academic and NLP to auto-categorization, entity
extraction, and Search-Meta Data
Explosion of companies – many based on Inxight extraction with
some analytical-visualization front ends
–
Half from 2008 are gone - Lucky ones got bought
Focus on enterprise text analytics – shift to sentiment analysis easier to do, obvious pay off (customers, not employees)
–
Backlash – Real business value?
Enterprise search down – 10 years of effort for what?
–
Need Text Analytics to work
Text Analytics is slowly growing – time for a jump?
4
Text Analytics Workshop
Current State of Text Analytics
Current Market: 2012 – exceed $1 Bil for text analytics (10% of total
Analytics)
Growing 20% a year
Search is 33% of total market
Other major areas:
–
Sentiment and Social Media Analysis, Customer Intelligence
– Business Intelligence, Range of text based applications
Fragmented market place – full platform, low level, specialty
–
Embedded in content management, search, No clear leader.
Big Data – Big Text is bigger, text into data, data for text
–
Watson – ensemble methods, pun module
5
Text Analytics Workshop
Current State of Text Analytics: Vendor Space
Taxonomy Management – SchemaLogic, Pool Party
From Taxonomy to Text Analytics
– Data Harmony, Multi-Tes
Extraction and Analytics
– Linguamatics (Pharma), Temis, whole range of companies
Business Intelligence – Clear Forest, Inxight
Sentiment Analysis – Attensity, Lexalytics, Clarabridge
Open Source – GATE
Stand alone text analytics platforms – IBM, SAS, SAP, Smart
Logic, Expert System, Basis, Open Text, Megaputer, Temis,
Concept Searching
Embedded in Content Management, Search
– Autonomy, FAST, Endeca, Exalead, etc.
6
Future Directions: Survey Results
Important Areas:
– Predictive Analytics & text mining – 90%
– Search & Search-based Apps – 86%
– Business Intelligence – 84%
– Voice of the Customer – 82%, Social Media – 75%
– Decision Support, KM – 81%
– Big Data- other – 70%, Finance – 61%
– Call Center, Tech Support – 63%
– Risk, Compliance, Governance – 61%
– Security, Fraud Detection-54%
7
Future Directions: Survey Results
28% just getting started, 11% not yet
What factors are holding back adoption of TA?
Lack of clarity about value of TA – 23.4%
– Lack of knowledge about TA – 17.0%
– Lack of senior management buy-in - 8.5%
– Don’t believe TA has enough business value -6.4%
–
Other factors
Financial Constraints – 14.9%
– Other priorities more important – 12.8%
–
Lack of articulated strategic vision – by vendors, consultants,
advocates, etc.
8
Introduction: Future Directions
What is Text Analytics Good For?
9
Text Analytics Workshop
What is Text Analytics?
Text Mining – NLP, statistical, predictive, machine learning
Semantic Technology – ontology, fact extraction
Extraction – entities – known and unknown, concepts, events
–
Catalogs with variants, rule based
Sentiment Analysis – Objects/ Products and phrases
–
Statistics, catalogs, rules – Positive and Negative
Auto-categorization
–
–
–
–
–
Training sets, Terms, Semantic Networks
Rules: Boolean - AND, OR, NOT
Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE
Disambiguation - Identification of objects, events, context
Build rules based, not simply Bag of Individual Words
10
11
Case Study – Categorization & Sentiment
12
Case Study – Categorization & Sentiment
13
14
15
16
17
18
19
Case Study – Taxonomy Development
20
Text Analytics Workshop: Information Environment
Building an Infrastructure
Semantic Layer = Taxonomies, Metadata, Vocabularies + Text
Analytics – adding cognitive science, structure to unstructured
Modeling users/audiences
Technology Layer
– Search, Content Management, SharePoint, Intranets
Publishing process, multiple users & info needs
– SharePoint – taxonomies but
• Folksonomies – still a bad idea
Infrastructure – Not an Application
–
Business / Library / KM / EA – not IT
Building on the Foundation
–
Info Apps (Search-based Applications)
Foundation of foundation – Text Analytics
21
Text Analytics Workshop: Information Environment
TA & Taxonomy Complimentary Information Platform
Taxonomy provides a consistent and common vocabulary
– Enterprise resource – integrated not centralized
Text Analytics provides a consistent tagging
– Human indexing is subject to inter and intra individual variation
Taxonomy provides the basic structure for categorization
– And candidates terms
Text Analytics provides the power to apply the taxonomy
– And metadata of all kinds
Text Analytics and Taxonomy Together – Platform
– Consistent in every dimension
– Powerful and economic
22
Text Analytics Workshop: Information Environment
Metadata - Tagging
How do you bridge the gap – taxonomy to documents?
Tagging documents with taxonomy nodes is tough
– And expensive – central or distributed
Library staff –experts in categorization not subject matter
– Too limited, narrow bottleneck
– Often don’t understand business processes and business uses
Authors – Experts in the subject matter, terrible at categorization
– Intra and Inter inconsistency, “intertwingleness”
– Choosing tags from taxonomy – complex task
– Folksonomy – almost as complex, wildly inconsistent
– Resistance – not their job, cognitively difficult = non-compliance
Text Analytics is the answer(s)!
23
Text Analytics Workshop: Information Environment
Mind the Gap – Manual-Automatic-Hybrid
All require human effort – issue of where and how effective
Manual - human effort is tagging (difficult, inconsistent)
–
Small, high value document collections, trained taggers
Automatic - human effort is prior to tagging – auto-categorization
rules and/or NLP algorithm effort
Hybrid Model – before (like automatic) and after
–
Build on expertise – librarians on categorization, SME’s on subject
terms
Facets – Requires a lot of Metadata - Entity Extraction feeds
facets – more automatic, feedback by design
Manual - Hybrid – Automatic is a spectrum – depends on context
24
Text Analytics Workshop
Benefits of Text Analytics
Why Text Analytics?
–
Enterprise search has failed to live up to its potential
– Enterprise Content management has failed to live up to its potential
– Taxonomy has failed to live up to its potential
– Adding metadata, especially keywords has not worked
What is missing?
Intelligence – human level categorization, conceptualization
– Infrastructure – Integrated solutions not technology, software
–
Text Analytics can be the foundation that (finally) drives success
– search, content management, and much more
25
Text Analytics Workshop
Costs and Benefits
IDC study – quantify cost of bad search
Three areas:
–
Time spent searching
– Recreation of documents
– Bad decisions / poor quality work
Costs
–
50% search time is bad search = $2,500 year per person
– Recreation of documents = $5,000 year per person
– Bad quality (harder) = $15,000 year per person
Per 1,000 people = $ 22.5 million a year
–
30% improvement = $6.75 million a year
– Add own stories – especially cost of bad information
– Human measure - # of FTE’s, savings passed on to customers, etc.
26
Text Analytics Workshop
Need for a Quick Start
Text Analytics is weird, a bit academic, and not very practical
• It involves language and thinking and really messy stuff
On the other hand, it is really difficult to do right (Rocket Science)
Organizations don’t know what text analytics is and what it is for
TAW Survey shows - need two things:
• Strategic vision of text analytics in the enterprise
• Business value, problems solved, information overload
• Text Analytics as platform for information access
• Real life functioning program showing value and demonstrating
an understanding of what it is and does
Quick Start – Strategic Vision – Software Evaluation – POC / Pilot
27
Text Analytics Workshop
Text Analytics Vision & Strategy
Strategic Questions – why, what value from the text analytics,
how are you going to use it
–
Platform or Applications?
What are the basic capabilities of Text Analytics?
What can Text Analytics do for Search?
–
After 10 years of failure – get search to work?
What can you do with smart search based applications?
–
RM, PII, Social
ROI for effective search – difficulty of believing
–
Problems with metadata, taxonomy
28
Text Analytics Workshop
Quick Start Step One- Knowledge Audit
Ideas – Content and Content Structure
Map of Content – Tribal language silos
– Structure – articulate and integrate
– Taxonomic resources
–
People – Producers & Consumers
–
Communities, Users, Central Team
Activities – Business processes and procedures
–
Semantics, information needs and behaviors
– Information Governance Policy
Technology
–
–
CMS, Search, portals, text analytics
Applications – BI, CI, Semantic Web, Text Mining
29
Text Analytics Workshop
Quick Start Step One- Knowledge Audit
Info Problems – what, how severe
Formal Process – Knowledge Audit
–
Contextual & Information interviews, content analysis, surveys,
focus groups, ethnographic studies, Text Mining
Informal for smaller organizations, specific application
Category modeling – Cognitive Science – how people think
–
Panda, Monkey, Banana
Natural level categories mapped to communities, activities
• Novice prefer higher levels
• Balance of informative and distinctiveness
Strategic Vision – Text Analytics and Information/Knowledge
Environment
30
Quick Start Step Two - Software Evaluation
Varieties of Taxonomy/ Text Analytics Software
Software is more important to text analytics
–
No spreadsheets for semantics
Taxonomy Management - extraction
Full Platform
–
SAS, SAP, Smart Logic, Concept Searching, Expert System, IBM,
Linguamatics, GATE
Embedded – Search or Content Management
–
FAST, Autonomy, Endeca, Vivisimo, NLP, etc.
– Interwoven, Documentum, etc.
Specialty / Ontology (other semantic)
Sentiment Analysis – Attensity, Lexalytics, Clarabridge, Lots
– Ontology – extraction, plus ontology
–
31
Quick Start Step Two - Software Evaluation
Different Kind of software evaluation
Traditional Software Evaluation - Start
Filter One- Ask Experts - reputation, research – Gartner, etc.
• Market strength of vendor, platforms, etc.
• Feature scorecard – minimum, must have, filter to top 6
– Filter Two – Technology Filter – match to your overall scope and
capabilities – Filter not a focus
– Filter Three – In-Depth Demo – 3-6 vendors
–
Reduce to 1-3 vendors
Vendors have different strengths in multiple environments
–
–
Millions of short, badly typed documents, Build application
Library 200 page PDF, enterprise & public search
32
Quick Start Step Two - Software Evaluation
Design of the Text Analytics Selection Team
IT - Experience with software purchases, needs assess, budget
–
Search/Categorization is unlike other software, deeper look
Business -understand business, focus on business value
They can get executive sponsorship, support, and budget
–
But don’t understand information behavior, semantic focus
Library, KM - Understand information structure
Experts in search experience and categorization
–
But don’t understand business or technology
Interdisciplinary Team, headed by Information Professionals
Much more likely to make a good decision
Create the foundation for implementation
33
Quick Start Step Three –
Proof of Concept / Pilot Project
POC use cases – basic features needed for initial projects
Design - Real life scenarios, categorization with your content
Preparation:
– Preliminary analysis of content and users information needs
• Training & test sets of content, search terms & scenarios
– Train taxonomist(s) on software(s)
– Develop taxonomy if none available
Four week POC – 2 rounds of develop, test, refine / Not OOB
Need SME’s as test evaluators – also to do an initial
categorization of content
Majority of time is on auto-categorization
34
Text Analytics Workshop
POC Design: Evaluation Criteria & Issues
Basic Test Design – categorize test set
– Score – by file name, human testers
Categorization & Sentiment – Accuracy 80-90%
– Effort Level per accuracy level
Combination of scores and report
Operators (DIST, etc.) , relevancy scores, markup
Development Environment – Usability, Integration
Issues:
– Quality of content & initial human categorization
– Normalize among different test evaluators
– Quality of taxonomy – structure, overlapping categories
35
Quick Start for Text Analytics
Proof of Concept -- Value of POC
Selection of best product(s)
Identification and development of infrastructure elements –
taxonomies, metadata – standards and publishing process
Training by doing –SME’s learning categorization,
Library/taxonomist learning business language
Understand effort level for categorization, application
Test suitability of existing taxonomies for range of applications
Explore application issues – example – how accurate does
categorization need to be for that application – 80-90%
Develop resources – categorization taxonomies, entity extraction
catalogs/rules
36
Text Analytics Workshop
POC and Early Development: Risks and Issues
CTO Problem –This is not a regular software process
Semantics is messy not just complex
–
30% accuracy isn’t 30% done – could be 90%
Variability of human categorization
Categorization is iterative, not “the program works”
–
Need realistic budget and flexible project plan
Anyone can do categorization
–
Librarians often overdo, SME’s often get lost (keywords)
Meta-language issues – understanding the results
–
Need to educate IT and business in their language
37
Development
38
Text Analytics Development: Categorization Process
Start with Taxonomy and Content
Starter Taxonomy
–
If no taxonomy, develop (steal) initial high level
• Textbooks, glossaries, Intranet structure
• Organization Structure – facets, not taxonomy
Analysis of taxonomy – suitable for categorization
–
–
Structure – not too flat, not too large
Orthogonal categories
Content Selection
–
–
–
Map of all anticipated content
Selection of training sets – if possible
Automated selection of training sets – taxonomy nodes as first
categorization rules – apply and get content
39
Text Analytics Workshop
Text Analytics Development: Categorization Process
First Round of Categorization Rules
Term building – from content – basic set of terms that
appear often / important to content
Add terms to rule, apply to broader set of content
Repeat for more terms – get recall-precision “scores”
Repeat, refine, repeat, refine, repeat
Get SME feedback – formal process – scoring
Get SME feedback – human judgments
Test against more, new content
Repeat until “done” – 90%?
40
Text Analytics Workshop
Text Analytics Development: Entity Extraction Process
Facet Design – from Knowledge Audit, K Map
Find and Convert catalogs:
–
–
–
–
Organization – internal resources
People – corporate yellow pages, HR
Include variants
Scripts to convert catalogs – programming resource
Build initial rules – follow categorization process
–
–
–
Differences – scale, threshold – application dependent
Recall – Precision – balance set by application
Issue – disambiguation – Ford company, person, car
41
Text Analytics Workshop
Text Analytics Development: Demo
BioPharma – scientific vocabulary / articles
42
Text Analytics Workshop
Case Study - Background
Inxight Smart Discovery
Multiple Taxonomies
–
–
Healthcare – first target
Travel, Media, Education, Business, Consumer Goods,
Content – 800+ Internet news sources
–
5,000 stories a day
Application – Newsletters
–
–
Editors using categorized results
Easier than full automation
43
Text Analytics Workshop
Case Study - Approach
Initial High Level Taxonomy
Auto generation – very strange – not usable
– Editors High Level – sections of newsletters
– Editors & Taxonomy Pro’s - Broad categories & refine
–
Develop Categorization Rules
–
Multiple Test collections
– Good stories, bad stories – close misses - terms
Recall and Precision Cycles
–
–
Refine and test – taxonomists – many rounds
Review – editors – 2-3 rounds
Repeat – about 4 weeks
44
45
46
47
Text Analytics Workshop
Case Study – Issues & Lessons
Taxonomy Structure: Aggregate vs. independent nodes
– Children Nodes – subset – rare
Trade-off of depth of taxonomy and complexity of rules
No best answer – taxonomy structure, format of rules
– Need custom development
–
Recall more important than precision – editors role
Combination of SME and Taxonomy pros
–
Combination of Features – Entity extraction, terms, Boolean, filters,
facts
Training sets and find similar are weakest
Plan for ongoing refinement
48
Text Analytics Workshop
Enterprise Environment – Case Studies
A Tale of Two Taxonomies
–
It was the best of times, it was the worst of times
Basic Approach
–
–
–
–
–
–
Initial meetings – project planning
High level K map – content, people, technology
Contextual and Information Interviews
Content Analysis
Draft Taxonomy – validation interviews, refine
Integration and Governance Plans
49
Text Analytics Workshop
Enterprise Environment – Case One – Taxonomy, 7 facets
Taxonomy of Subjects / Disciplines:
–
Science > Marine Science > Marine microbiology > Marine toxins
Facets:
–
Organization > Division > Group
– Clients > Federal > EPA
– Facilities > Division > Location > Building X
– Content Type – Knowledge Asset > Proposals
– Instruments > Environmental Testing > Ocean Analysis > Vehicle
– Methods > Social > Population Study
– Materials > Compounds > Chemicals
50
Text Analytics Workshop
Enterprise Environment – Case One – Taxonomy, 7 facets
Project Owner – KM department – included RM, business
process
Involvement of library - critical
Realistic budget, flexible project plan
Successful interviews – build on context
–
Overall information strategy – where taxonomy fits
Good Draft taxonomy and extended refinement
–
–
Software, process, team – train library staff
Good selection and number of facets
Developed broad categorization and one deep-Chemistry
Final plans and hand off to client
51
Text Analytics Workshop
Enterprise Environment – Case Two – Taxonomy, 4 facets
Taxonomy of Subjects / Disciplines:
–
Geology > Petrology
Facets:
–
Organization > Division > Group
– Process > Drill a Well > File Test Plan
– Assets > Platforms > Platform A
– Content Type > Communication > Presentations
52
Enterprise Environment – Case Two – Taxonomy, 4 facets
Environment & Project Issues
Value of taxonomy understood, but not the complexity and scope
– Under budget, under staffed
Location – not KM – tied to RM and software
– Solution looking for the right problem
Importance of an internal library staff
– Difficulty of merging internal expertise and taxonomy
Project mind set – not infrastructure
– Rushing to meet deadlines doesn’t work with semantics
Importance of integration – with team, company
– Project plan more important than results
53
Enterprise Environment – Case Two – Taxonomy, 4 facets
Research and Design Issues
Research Issues
– Not enough research – and wrong people
– Misunderstanding of research – wanted tinker toy connections
• Interview 1 leads to taxonomy node 2
Design Issues
– Not enough facets
– Wrong set of facets – business not information
– Ill-defined facets – too complex internal structure
54
Enterprise Environment – Case Two – Taxonomy, 4 facets
Conclusion: Risk Factors
Political-Cultural-Semantic Environment
– Not simple resistance - more subtle
• – re-interpretation of specific conclusions and sequence of
conclusions / Relative importance of specific recommendations
Access to content and people
– Enthusiastic access
Importance of a unified project team
– Working communication as well as weekly meetings
55
Applications
56
Text Analytics Workshop
Building on the Foundation
Text Analytics: Create the Platform – CM & Search
– New Electronic Publishing Process
• Use text analytics to tag, new hybrid workflow
– New Enterprise Search
• Build faceted navigation on metadata, extraction
Enhance Information Access in the Enterprise - InfoApps
–
Governance, Records Management, Doc duplication, Compliance
–
Applications – Business Intelligence, CI, Behavior Prediction
eDiscovery, litigation support, Fraud detection
Productivity / Portals – spider and categorize, extract
–
–
57
Text Analytics Workshop
Information Platform: Content Management
Hybrid Model – Internal Content Management
–
Publish Document -> Text Analytics analysis -> suggestions
for categorization, entities, metadata - > present to author
– Cognitive task is simple -> react to a suggestion instead of
select from head or a complex taxonomy
– Feedback – if author overrides -> suggestion for new category
External Information - human effort is prior to tagging
– More automated, human input as specialized process –
periodic evaluations
– Precision usually more important
– Target usually more general
58
Text Analytics and Search
Multi-dimensional and Smart
Faceted Navigation has become the basic/ norm
–
Facets require huge amounts of metadata
– Entity / noun phrase extraction is fundamental
– Automated with disambiguation (through categorization)
Taxonomy – two roles – subject/topics and facet structure
– Complex facets and faceted taxonomies
Clusters and Tag Clouds – discovery & exploration
Auto-categorization – aboutness, subject facets
– This is still fundamental to search experience
– InfoApps only as good as fundamentals of search
People – tagging, evaluating tags, fine tune rules and taxonomy
59
60
61
Integrated Facet Application
Design Issues - General
What is the right combination of elements?
–
Dominant dimension or equal facets
– Browse topics and filter by facet, search box
– How many facets do you need?
Scale requires more automated solutions
–
More sophisticated rules
Issue of disambiguation:
Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford
– Same word, different entity – Ford and Ford
–
Number of entities and thresholds per results set / document
–
Usability, audience needs
Relevance Ranking – number of entities, rank of facets
62
Text Analytics Workshop : Applications
Text and Data: Two Way Street
New types of applications
–
New ways to make sense of data, enrich data
Harvard – Analyzing Text as Data
–
Detecting deception, Frame Analysis
Narrative Science – take data (baseball statistics, financial data)
and turn into a story
Political campaigns using Big Data, social media, and text
analytics
Watson for healthcare – help doctors keep up with massive
information overload
63
Text Analytics Workshop : Applications
Social Media: Beyond Simple Sentiment
Beyond Good and Evil (positive and negative)
–
Social Media is approaching next stage (growing up)
– Where is the value? How get better results?
Importance of Context – around positive and negative words
Rhetorical reversals – “I was expecting to love it”
– Issues of sarcasm, (“Really Great Product”), slanguage
–
Granularity of Application
Early Categorization – Politics or Sports
Limited value of Positive and Negative
– Degrees of intensity, complexity of emotions and documents
Addition of focus on behaviors – why someone calls a support center
– and likely outcomes
–
64
Text Analytics Workshop : Applications
Social Media: Beyond Simple Sentiment
Two basic approaches [Limited accuracy, depth]
–
Statistical Signature of Bag of Words
– Dictionary of positive & negative words
Essential – need full categorization and concept extraction
New Taxonomies – Appraisal Groups – Adjective and modifiers –
“not very good”
–
Supports more subtle distinctions than positive or negative
Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust
–
New Complex – pride, shame, confusion, skepticism
65
Text Analytics Workshop: Applications
Expertise Analysis
Expertise Analysis
– Experts think & write differently – process, chunks
Expertise Characterization for individuals, communities, documents, and
sets of documents
Applications:
– Business & Customer intelligence, Voice of the Customer
– Deeper understanding of communities, customers – better models
– Security, threat detection – behavior prediction, Are they experts?
– Expertise location- Generate automatic expertise characterization
Crowd Sourcing – technical support to Wiki’s
Political – conservative and liberal minds/texts
– Disgust, shame, cooperation, openness
66
Text Analytics Workshop: Applications
Behavior Prediction – Telecom Customer Service
Problem – distinguish customers likely to cancel from mere threats
Basic Rule
–
(START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"),
–
(NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
Examples:
–
customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency.
– cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
More sophisticated analysis of text and context in text
Combine text analytics with Predictive Analytics and traditional behavior
monitoring for new applications
67
Text Analytics Workshop: Applications
Variety of New Applications
Essay Evaluation Software - Apply to expertise characterization
–
Avoid gaming the system – multi-syllabic nonsense
• Model levels of chunking, procedure words over content
Legal Review
Significant trend – computer-assisted review (manual =too many)
– TA- categorize and filter to smaller, more relevant set
– Payoff is big – One firm with 1.6 M docs – saved $2M
–
Financial Services
–
–
–
–
Trend – using text analytics with predictive analytics – risk and fraud
Combine unstructured text (why) and structured transaction data
(what)
Customer Relationship Management, Fraud Detection
Stock Market Prediction – Twitter, impact articles
68
Text Analytics Workshop: Applications
Pronoun Analysis: Fraud Detection; Enron Emails
Patterns of “Function” words reveal wide range of insights
Function words = pronouns, articles, prepositions, conjunctions, etc.
– Used at a high rate, short and hard to detect, very social, processed
in the brain differently than content words
Areas: sex, age, power-status, personality – individuals and groups
Lying / Fraud detection: Documents with lies have
– Fewer and shorter words, fewer conjunctions, more positive emotion
words
– More use of “if, any, those, he, she, they, you”, less “I”
– More social and causal words, more discrepancy words
Current research – 76% accuracy in some contexts
Text Analytics can improve accuracy and utilize new sources
Data analytics (standard AML) can improve accuracy
69
Text Analytics Workshop
Conclusions
Text Analytics and Taxonomy are partners – enrich each other
Text Analytics can mind the gap – between taxonomies and
documents
Text Analytics needs strategic vision and quick start
–
Need to approach as platform – deep context – understand information
environment
Text Analytics is a platform for huge range of applications:
–
–
Search and Content Management and Basic productivity apps
New kinds of applications - social, data, Info Apps of all kinds
Want to learn more – come to Text Analytics World in Boston in Fall!
–
Early Bird Registration – www.textanalyticsworld.com
70
Questions?
Tom Reamy
[email protected]
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Resources
Books
–
Women, Fire, and Dangerous Things
• George Lakoff
–
Knowledge, Concepts, and Categories
• Koen Lamberts and David Shanks
–
Formal Approaches in Categorization
• Ed. Emmanuel Pothos and Andy Wills
–
The Mind
• Ed John Brockman
• Good introduction to a variety of cognitive science theories,
issues, and new ideas
–
Any cognitive science book written after 2009
72
Resources
Conferences – Web Sites
–
Text Analytics World - All aspects of text analytics
• Oct 2-3, Boston
–
http://www.textanalyticsworld.com
–
Text Analytics Summit
http://www.textanalyticsnews.com
–
–
–
Semtech
http://www.semanticweb.com
73
Resources
Blogs
–
SAS- http://blogs.sas.com/text-mining/
Web Sites
–
–
–
–
–
Taxonomy Community of Practice:
http://finance.groups.yahoo.com/group/TaxoCoP/
LindedIn – Text Analytics Summit Group
http://www.LinkedIn.com
Whitepaper – CM and Text Analytics http://www.textanalyticsnews.com/usa/contentmanagementm
eetstextanalytics.pdf
Whitepaper – Enterprise Content Categorization strategy and
development – http://www.kapsgroup.com
74
Resources
Articles
–
–
–
–
Malt, B. C. 1995. Category coherence in cross-cultural
perspective. Cognitive Psychology 29, 85-148
Rifkin, A. 1985. Evidence for a basic level in event
taxonomies. Memory & Cognition 13, 538-56
Shaver, P., J. Schwarz, D. Kirson, D. O’Conner 1987.
Emotion Knowledge: further explorations of prototype
approach. Journal of Personality and Social Psychology 52,
1061-1086
Tanaka, J. W. & M. E. Taylor 1991. Object categories and
expertise: is the basic level in the eye of the beholder?
Cognitive Psychology 23, 457-82
75
Resources
LinkedIn Groups:
–
–
–
–
–
–
Text Analytics World
Text Analytics Group
Data and Text Professionals
Sentiment Analysis
Metadata Management
Semantic Technologies
Journals
–
–
Academic – Cognitive Science, Linguistics, NLP
Applied – Scientific American Mind, New Scientist
76