TextAnalyst presentation

Download Report

Transcript TextAnalyst presentation

Text Mining
Applications and technologies
Sergei Ananyan
Megaputer Intelligence, Inc.
www.megaputer.com
© 2001 Megaputer intelligence, Inc.
Outline

Definitions and application fields

Text mining functionality

Case study

Technology

Future developments
Text Mining
Text Mining is a process of


extracting new, valid, and actionable
knowledge dispersed throughout text
documents and
utilizing this knowledge to better
organize information for future reference.
Tasks addressed by TM







Search and retrieval
Semantic analysis
Clustering
Categorization
Feature extraction
Ontology building
Dynamic focusing
DM and TM comparison
Data Mining
Object of
investigation
Numerical and categorical
data
Object structure Relational databases
Text Mining
Texts
Free form texts
Goal
Predict outcomes of future
situations
Retrieve relevant information,
distill the meaning,
categorize and target-deliver
Methods
Machine learning: SKAT,
DT, NN, GA, MBR, MBA
Indexing, special neural network
processing, linguistics,
ontologies
Current market
size
100,000 analysts at large
and midsize companies
100,000,000 corporate workers
and individual users
Maturity
Broad implementation
since 1994
Broad implementation starting
2000
TM tasks in detail

Information search and retrieval

Index-based
•

Ontology-based
•
•

HotBot, dt-Search
Semantics and linguistics enhanced
•

Yahoo, Lycos
Megaputer – ontology building
Boolean search + stemming
•

Excite, Alta Vista
Megaputer
Dymanic focusing
•
Megaputer
TM tasks in detail

Semantic analysis

Neural network and customized dictionaries
•


Megaputer
Bayesian inference
•
Autonomy
Clustering and categorization
•

Megaputer, Microsystems
Linguistics
•

(continued)
Megaputer
Feature extraction
•
SRA, Megaputer, IBM
Possible applications





Search engines
Enterprise portals
Knowledge management systems
e-Business systems
Vertical applications:



e-mail categorization and routing
Call center notes categorization
CRM systems
Typical setups

Venture capitalist




Search and retrieval
Estimation of relevance
Summarization and navigation
Investment or Insurance company




Categorization of incoming messages
Target-sharing information with employees
Structured fragments extraction (numbers)
Feature extraction (who owns whom)
Typical setups

Government agency




(continued)
Intelligent infromation retrieval
Chain of events tracing
Supplement documents by their summaries for
more efficient reference
e-Business



Match resource description to a user query
Learn visitor interests by analyzing the content
browsed
Match interests to available resources
Text and the Web

99% of analytical information on the
Web exists in the form of texts

The Web is the place where users
routinely encounter new texts

99% of e-Businesses today do not
leverage competitive advantage
provided by their content-rich
websites because they do not utilize
text mining to the extend they should
Example: nytimes.com







Extremely rich content
Large audience: 10+ mln e-mails
Generates revenue from advertisers
Uses an anonymous survey for login
Does a very good job tracking
individual pages accessed
For any page can furnish demographic
profile of its visitors
But does not utilize text mining.
Cannot see customer-centered view.
Example: nytimes.com





(continued)
Could significantly increase the value
of each visitor to advertisers by doing
individualized marketing
Rich content and high visitor loyalty
are ideal for learning visitors’ interests
through text mining
This silent surveing is done
unobtrusively
Privacy is preserved
Potential result: increased revenue
Megaputer text mining

TextAnalyst*


Tech: combi of n-grams and Neural Networks
Scope: Analyst’s desktop solution
*  Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.

Textractor


Tech: Morphological analysis, Semantic analysis
(WordNet and its extensions), Statistical and
Fuzzy Logic analysis)
Scope: Enterprise solution
TextAnalyst*
Overview
*  Microsystems Ltd., a Megaputer business partner. Megaputer has exclusive distribution rights for TextAnalyst.
TextAnalyst

TextAnalyst is a tool for semantic
analysis, navigation, and search of
unstructured texts.

TextAnalyst is available as


Standlone application
SDK of COM components for easy integration
TextAnalyst functionality







Distilling the meaning (Semantic Network)
Navigation
Summarization
Topic explication
Clustering
Dynamic focusing
Categorization (TextAnalyst COM)
TextAnalyst
Customer base: 300+ installations
Sample customers
Ask Jeeves (USA)
Pfizer (USA)
IMS Health (USA)
TRW (USA)
The Gallup Organization (USA)
McKinsey & Company (USA)
Centers for Disease Control (USA)
Liberty Mutual (USA)
Best Buy (USA)
Logicon (USA)
France Telecom (France)
Net Shepherd (Canada)
Skila.com (USA)
Dept of Environmental Protection (Australia)
US Navy (USA)
KPN Research (Netherlands)
Dow Chemical (USA)
Talkie.com (USA)
Clontech (USA)
NICE Systems (Israel)
TextAnalyst
Underlying
Technology
Text image

Semantic Network - a list of the
most important concepts (words and
word combinations) and relations
between them
temperature (95)
Temperature fusion (100)
(70)
nuclear reactions (98)
(59)
(78)
(52)
papers (86)
nuclear (100)
(37)
(29)
(46)
(28)
(63)
heat (99)
cell (98)
Peterson (96)
Semantic network creation


Text is a string of characters: letters,
spaces, punctuation marks
Steps for building Semantic Network





Break text in words and sentences
Push through a n-character window
Feed patterns to a Recurrent Hierarchical Neural
Network and record frequencies
Identify relations between concepts (joint
occurrence in a sentence)
Carry out preliminary semantic network
renormalization (Hopfield-like Neural Network) assign semantic weights
General & Text-specific tasks




Parse and reorganize input into
sequences of words joined by
concatenation and separation signs
Recognize and remove auxiliary words
and flective morphemes
Recognize, count and store stem
morphemes
Identify words sharing stem
morphemes
Hierarchical Recurrent NN
Hierarchical Recurrent NN
General & Text-specific tasks

Identify relationships



Text - joint occurrence in sentences
Preliminary SN renormalization:
optimization task similar to Hopfield
network
Association of concepts in SN with
sentences and context in original text
Case study


IRLP provides R&D assistance and
information services to Indiana’s small
businesses and governmental units
IRLP searches SBIR and the Commerce
Business Daily to identify research funding
opportunities for its clients.
“TextAnalyst was able to find the
necessary matches even for those
clients where existing search program
was incompatible.”
-- Cindy Moore, Marketing Coordinator, IRLP
Customer
quotes
TextAnalyst supports medical research
at Centers for Disease Control
Eleanor McLellan
Data Manager / Analyst
Centers for Disease Control
Atlanta, GA
"TextAnalyst is able to
efficiently handle
numerous and often large
(90+ pages apiece) text
files without any problem.
Furthermore, the program
is extremely user-friendly."
TextAnalyst helps processing texts
at Clontech
Nikolai Kalnin, Ph.D.
Team Leader
Bioinformatics Group
CLONTECH Laboratories, Inc.
Palo Alto, CA
"TextAnalyst has been
selected as the only text
analysis tool capable of
establishing relations
between terms. It is
reasonably priced, easy to
install and operate."
TextAnalyst saves time and resources
for CaseBank
Kalyan Gupta, Ph.D.
Director, Research
CaseBank Technologies Inc.
Brampton, Ontario
"TextAnalyst is used at
CaseBank to identify and
assess the contents of
electronic repositories of
troubleshooting and
maintenance information.
It saves case preparation
time and allows CaseBank
to be more responsive to
its customer's knowledge
retrieval needs."
Future developments



Text categorization (now implemented
in TextAnalyst COM)
Thesaurus-based text retrieval
Integration with Web technologies
TextAnalyst evaluation
We invite you to download a FREE
evaluation copy of TextAnalyst from
www.megaputer.com
and enjoy using it hands-on following
the provided step-by-step lessons, or
exploring your own data.
Textractor
™
Technology and
Applications
Textractor capabilities






Key senses extraction
Hierarchical clustering
Categorization
Summarization
Intelligent search
Feature extraction
Textractor applications

General

Automated email categorization and routing
(categories can be provided by the user or determined by the system)

Knowledge extraction from call center notes
(example: occupational hazard determination)

Knowledge-based executive reporting system
(one-glance knowledge visualization)

Flexible searching for support documentation
(semantic relations between terms: synonyms, hyponyms, meronyms)


Competitive intelligence
Insurance

Clustering of claims and ontology building
(hierarchical organization of textual data)

Automated feature extraction and claim tagging
Textractor analysis steps



Morphological analysis
Syntactic analysis
Semantic analysis - WordNet filtering
(synonymy, antonymy, hyper/hyponymy and holo/meronymy)

Statistical analysis
(frequency of terms against background frequencies)

Context Analysis
(polysemy resolving and term collocations)

Semantic Network comparison
WordNet

WordNet is a comprehensive semantically
organized lexical database for English
www.cogsci.princeton.edu/~wn

Textractor provides an ability to expand and
edit WordNet for a specific application field.
Semantic term relationships

Synonyms


Hyper/Hyponyms


Car (holonym) :: Motor, Windshield, Tire (meronyms)
Antonyms


Bird (hyperym) : Eagle, Hawk, Pigeon (hyponyms)
Holo/Meronyms


Accident – Collision – Wreck
Cold <> Hot,
Deep <> Shallow
Polysemy

Commercial Bank  River Bank
Textractor architecture
Data sources
Text Mining Engines
Filters and DW interfaces
Core TM engines
Morphological Analysis
Semantic Analysis
WordNet
Field-specific
WordNet
Extensions
WordNet
Extension Editor
Stored
Indices
Syntactic
Analysis
Link Parser
Application-oriented
TM engines
Textractor text mining engines
Application-oriented TM engines
Intelligent Searcher
Core TM engines
Application-oriented TM engines
Text indexer
Text Categorizer
Formal search query creator
Text Clusterizer
(synonyms, hyper/hyponyms,
term proximity, frequencies)
Document tagging
Key senses extractor
Feature extractor
Database enrichment
and mining
Any Questions?
Call Megaputer at
(812) 330-0110
or write
120 W Seventh Street, Suite 310
Bloomington, IN 47404 USA
[email protected]
Appendix A
TextAnalyst technology
details
Two aspects of text

Sequence of characters characterized
by patterns that represent information
recognized by humans

Structured sequence of lexical units
organized together according to
morphological and syntactic rules
(morphemes, auxiliary lexical units, syntactic
members, sentences, etc.)
Semantics of text



Humans rely on multimodal
associations for creating semantic
models
Standalone text - semantics is formal,
but still useful
Meaning of a concept - collection of
relations of this concept to other
concepts in the text (constructive definition)
Lexical vs. Grammatical


Lexical meaning of a word determined by stem morpheme (word
combinations - chains of morphemes)
Grammatical meaning - determined by
morphemes (prefixes, endings, etc.) and
auxiliary semantic units (articles,
prepositions, etc.)

Grammatical chains - word sequences
with extracted stem morphemes frames for contents
Semantic structure of texts



Single text - semantic analysis can be
performed, but is not sufficient: need a
knowledge base against which the text
can be analyzed
Analysis of a large number of texts
from diverse fields => Grammatical
structure of the language
Analysis of a large number of texts
from the field of interest =>
Knowledge Base
Grammatical + Lexical = Semantic



Grammatical dictionaries of
morphemes and auxiliary words of a
language: threshold transformation
applied to a NN trained on a large
corpus of texts from diverse fields
Trained “grammatical NN” - filter.
“Lexical” NN is connected to its
output.
Combining elements from both NN obtain a list of concepts for Semantic
Network (after relational
renormalization)