Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) ChengXiang Zhai

Download Report

Transcript Overview of Text Data Mining (CS591-CXZ Text Data Mining Seminar) ChengXiang Zhai

Overview of Text Data Mining
(CS591-CXZ Text Data Mining Seminar)
Sept. 1, 2004
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Most Data are Unstructured (Text)
or Semi-Structured…
• Email
• Insurance claims
• News articles
• Web pages
• Patent portfolios
•…
• Customer complaint
letters
• Contracts
• Transcripts of phone
calls with customers
• Technical documents
•…
The more data we have,
the more likely we can find patterns in data
(Adapted from J. Dorre et al. “Text Mining:
Finding Nuggets in Mountains of Textual Data”)
Text Management Applications
Mining
Access
Select
information
Create Knowledge
Organization
Add
Structure/Annotations
Elements of Text Info
Management Technologies
Retrieval
Applications
Visualization
Summarization
Filtering
Information
Access
Search
Mining
Applications
Mining
Information
Organization
Categorization
Extraction
Clustering
Natural Language Content Analysis
Text
Knowledge
Acquisition
What Is Text Mining?
“The objective of Text Mining is to exploit
information contained in textual documents in
various ways, including …discovery of patterns
and trends in data, associations among entities,
predictive rules, etc.” (Grobelnik et al., 2001)
“Another way to view text data mining is as a
process of exploratory data analysis that leads to
heretofore unknown information, or to answers
for questions for which the answer is not
currently known.” (Hearst, 1999)
(Slide from Rebecca Hwa’s “Intro to Text Mining”)
Text Mining vs. NLP, IR,
DM…
•How does it relate to data mining in
general?
•How does it relate to computational
linguistics?
•How does itFinding
relate
to information
Patterns
Finding “Nuggets”
retrieval?
Non-textual data
General
data-mining
Textual data
Computational
Linguistics
Novel
Non-Novel
Exploratory
Data
Analysis
Database
queries
(Adapted from Rebecca Hwa’s “Intro to Text Mining”)
Informatio
n Retrieval
Challenges in Text Mining
• Data collection is “free text”
– Data is not well-organized
• Semi-structured or unstructured
– Natural language text contains ambiguities on many
levels
• Lexical, syntactic, semantic, and pragmatic
– Learning techniques for processing text typically
need annotated training examples
• Consider bootstrapping techniques
• What to mine?
(adapted from Rebecca Hwa’s “Intro to Text Mining”)
Applications of Text Mining
•Direct applications
– Domain-dependent (Bioinformatics, Business
Intelligence, etc)
– Data-dependent (WWW, literature, email,
customer reviews, etc)
•Indirect applications
– Assist information access
– Assist information organization
Text Mining for Hypertext
Creation
A general topic
Concept map
Subtopic 1
Subtopic i
Subtopic M
...
Doc 1
Doc 2
Hypertext
Doc N
Type of Links
Term  Term Links
DocTerm Links
A general topic
TermDoc Links
Subtopic 1
Subtopic i
Subtopic M
...
Doc 1
Doc 2
Doc N
Doc  Doc Links
Examples of Linkages in Text
Related Areas/Conferences
•Natural Language Processing (NLP): ACL,
EMNLP, COLING
•Information Retrieval: SIGIR, CIKM
•Machine Learning: ICML, NIPS, UAI
•Data Mining & Knowledge Discovery:
SIGKDD
•World Wide Web: WWW
•Bioinformatics: ISMB, PSC
Candidate Papers – SIGKDD 04
• Probabilistic Author-Topic Models for Information Discovery
Authors: Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi,
Thomas Griffiths
• Mining Reference Tables for Automatic Text Segmentation
Authors: Eugene Agichtein, Venkatesh Ganti
• Exploiting Dictionaries in Named Entity Extraction:
Combining SemiMarkov Extraction Processes and Data
Integration Methods
Authors: William Cohen, Sunita Sarawagi
• Mining and Summarizing Customer Reviews
Authors: Minqing Hu, Bing Liu
• Cluster-based Concept Invention for Statistical Relational
Learning
Authors: Alexandrin Popescul, Lyle Ungar
Candidate Papers –WWW 04
•
•
•
•
•
•
Unsupervised Learning of Soft Patterns for Generating Definitions from Online
News (page 90)
H. Cui, M.-Y. Kan, T.-S. Chua, National University of Singapore , WWW2004
Web-Scale Information Extraction in KnowItAll (Preliminary Results) (page 100)
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S.
Soderland, D. S. Weld, A. Yates, University of Washington, WWW 2004
LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora (page
184)
C.-C. Huang, S.-L. Chuang, Academia Sinica
L.-F. Chien, Academia Sinica, National Taiwan University, WWW 2004
Towards the Self-Annotating Web (page 462)
P. Cimiano, S. Handschuh, University of Karlsruhe
S. Staab, University of Karlsruhe, Ontoprise GmbH
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information
Novelty (page 482)
E. Gabrilovich, Technion, Microsoft Research
S. Dumais, E. Horvitz, Microsoft Research
A Hierarchical Monothetic Document Clustering Algorithm for Summarization
and Browsing Search Results (page 658)
K. Kummamuru, R. Lotlikar, S. Roy, IBM India Research Lab
K. Singal, IIT-Guwahati
R. Krishnapuram, IBM India Research Lab WWW2004
Candidate Papers – PSB & ISMB
03/04
•
•
•
•
•
•
•
•
•
Biological Nomenclatures: A Source of Lexical Knowledge and Ambiguity , O. Tuason, L. Chen, H. Liu,
J.A Blake, and C. Friedman; Pacific Symposium on Biocomputing 9:238-249(2004)
Playing Biology's Name Game: Identifying Protein Names in Scientific Text , D. Hanisch, J. Fluck, HT.
Mevissen, R. Zimmer; Pacific Symposium on Biocomputing 8:403-414(2003).
Mining Terminological Knowledge in Large Biomedical Corpora , H. Liu, C. Friedman; Pacific
Symposium on Biocomputing 8:415-426(2003).
A Biological Named Entity Recognizer , M. Narayanaswamy, K. E. Ravikumar, K. Vi jay-Shanker;
Pacific Symposium on Biocomputing 8:
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , A.S. Schwartz, M.A.
Hearst; Pacific Symposium on Biocomputing 8:451-462(2003).
Evaluation of Text Data mining for Database Curation: LessonsLearned from the KDD Challenge Cup
Alexander Yeh, Lynette Hirschman, Alexander Morgan ISMB 2003
Extracting Synonymous Gene and Protein Terms from Biological Literature
Hong Yu and Eugene Agichtein ISMB 2003
Mining MEDLINE for Implicit Links between Dietary Substances and Diseases, Padmini Srinivasan University of Iowa
Bisharah Libbus - National Library of Medicine ISMB 2004
Protein Names Precisely Peeled Off Free Text, Sven Mika - Columbia University
Burkhard Rost - CUBIC/C2B2/NESG, Dept Biochemistry and Molecular Biophysics, Columbia
University 2004 ISMB