Scalable Text Mining with Sparse Generative Models A N T T I P UUR UL A P HD T HESIS P R ES ENTATION U.

Download Report

Transcript Scalable Text Mining with Sparse Generative Models A N T T I P UUR UL A P HD T HESIS P R ES ENTATION U.

Scalable Text Mining
with Sparse Generative
Models
A N T T I P UUR UL A
P HD T HESIS P R ES ENTATION
U N I V ERSITY OF WA I K ATO, N E W ZEA L A ND, 8 T H JU N E 2 0 1 5
1
Introduction
◦ This thesis presents a framework of probabilistic text mining based on
sparse generative models
◦ Models developed in the framework show state-of-the-art effectiveness in
both text classification and retrieval tasks
◦ Proposed sparse inference for using these models improves scalability,
enabling text mining for very large-scale tasks
2
Major Contributions of the Thesis
◦ Formalizing multinomial modeling of text
◦ Smoothing as two-state Hidden Markov Models
◦ Fractional counts as probabilistic data
◦ Weighted factors as log-linear models
◦ Scalable inference on text
◦ Sparse inference using inverted indices for statistical models
◦ Tied Document Mixture, a model benefiting from sparse inference
◦ Extensive evaluation using a combined experimental setup for
classification and retrieval
3
Defining Text Mining
◦ “Knowledge Discovery in Textual databases” (KDT) [Feldman and Dagan,
1995]
◦ “Text Mining as Integration of Several Related Research Areas” [Grobelnik
et al., 2000]
◦ Definition used in this thesis:
◦ Text mining is an interdisciplinary field of research on the automatic
processing of large quantities of text data for valuable information
4
Related Fields and Application Domains
5
Volume of Text Mining Publications
References per year found for related
fields using academic search engines
6
Volume of Text Mining Publications
References per year found for related
fields using academic search engines
7
Volume of Text Mining Publications
References per year found for related
fields using academic search engines
8
Scale of Text Data
◦ Existing collections:
◦ Google Books, 30M books (2013)
◦ Twitter, 200M users, 400M messages per day (2013)
◦ WhatsApp, 430M users, 50B messages per day (2014)
◦ Available research collections:
◦ English Wikipedia, 4.5M articles (2014)
◦ Google n-grams, 5-grams estimated from 1T words (2007)
◦ Annotated English Gigaword, 4B words with metadata (2012)
◦ TREC KBA, 394M annotated documents for classification (2014)
9
Text Mining Methodology in a Nutshell
◦ Normalize and map documents into a structured representation, such as a
vector of word counts
◦ Segment a problem into machine learning tasks
◦ Solve the tasks using algorithms, most commonly linear models
10
11
12
Linear Models for Text Mining
Multi-class linear scoring function:
13
Multinomial Naive Bayes
◦ Bayes model with multinomials conditioned on label variables:
◦ Priors
are categorical, label-conditionals
normalizer
is constant
◦ Directed generative graphical model
are multinomial, and
14
Formalizing Smoothing of Multinomials
◦ All smoothing methods for multinomials can be expressed as
, where
is an unsmoothed label-conditional model,
background model, and
is the smoothing weight
◦ Discounting of counts
by discounts
is the
is applied to
20
Two-State HMM Formalization of Smoothing
◦ Replace multinomial with a 0th order
categorical state-emission HMM,
with M=2 hidden states:
◦ Component m=2 is shared between
the 2-state HMMs for each label
Two-State HMM Formalization of Smoothing (2)
◦ Label-conditionals can be rewritten:
◦ Choosing
,
the smoothed multinomials
, and
implements
Two-State HMM Formalization of Smoothing (3)
◦ Maximum likelihood estimation is difficult, due to a sum over terms
◦ Given a prior distribution over component assignments
expected log-likelihood estimation decouples:
,
Formalizing Fractional Counts
◦ Fractional counts are undefined for categorical and multinomial models
◦ Formalization possible with probabilistic data
◦ A weight sequence
matching a word sequence
be interpreted as probabilities of words occurring in data
◦ Expected log-likelihoods and log-probabilities given expected counts
reproduce results from using fractional counts
can
Formalizing Fractional Counts (2)
◦ Estimation with expected log-likelihood
Formalizing Fractional Counts (3)
◦ Inference with expected log-probability
Extending MNB with Scaled Factors
◦ MNB with scaled factors for label priors and document lengths
◦ , where label prior and document length factors are scaled and
renormalized:
and
Sparse Inference for MNB
◦ Naive MNB posterior inference has complexity
◦ Sparse inference using an inverted index with precomputed values:
◦ , with
,
◦ This has time complexity
,
, and
for all
Sparse Inference for Structured Models
◦ Extension to hierarchically smoothed sequence models
◦ Complexity reduced from
to
Sparse Inference for Structured Models (2)
◦ A hierarchically smoothed sequence model:
◦ With Jelinek-Mercer smoothing, sparse marginalization:
◦ Marginalization complexity reduced from
to
Tied Document Mixture
◦ Replace label-conditional in MNB
with a mixture over hierarchically
smoothed document models:
Experiments
◦ Experiments on 16 text classification and 13 ranked retrieval datasets
◦ Development and evaluation segments used, both further split into
training and testing segments
◦ Classification evaluated with Micro-Fscore, retrieval with MAP and NDCG
◦ Models optimized for the evaluation measures using a Gaussian random
search on development test set
1
Power-law Discounting
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
Dirichlet Prior
0.4
0.5
Evaluated Modifications
◦ MNB, TDM, VSM, LR, and SVM
models with modifications
compared
◦ Generalized TF-IDF used:
, with for length scaling and
for IDF lifting
Scalability Experiments
◦ Large English Wikipedia dataset for multi-label classification, segmented
into 2.34M training documents and 23.6k test documents
◦ Pruned by features (10 to 100000), documents (10 to 100000) and
labelsets (1 to 100000) into smaller sets
◦ Scalability of naive vs. sparse inference evaluated on MNB and TDM
◦ Maximum of 4 hours of computing time allowed for each condition
Summary of Experiment Results
◦ Effectiveness improvements to MNB:
◦ Choice of smoothing – small effect
◦ Feature weighting and scaled factors - large effect
◦ Tied Document Mixture - very large effect
◦ BM25 for ranking outperformed, close to highly optimized SVM for
classification
◦ Scalability from sparse inference:
◦ 10* inference time reduction in largest completed case
Conclusion
◦ Modified Bayes models are strong models for text mining tasks: sentiment
analysis, spam classification, document categorization, ranked retrieval, …
◦ Sparse inference enables scalability for new types of tasks and models
◦ Possible future applications of the presented framework
◦ Text clustering
◦ Text regression
◦ N-gram language modeling
◦ Topic models
Conclusion (2)
◦ Thesis statement:
◦ “Generative models of text combined with inference using inverted indices
provide sparse generative models for text mining that are both versatile
and scalable, providing state-of-the-art effectiveness and high scalability
for various text mining tasks.”
◦ Truisms in theory that should be reconsidered:
◦ Naive Bayes as the “punching bag of machine learning”
◦ “the curse of dimensionality” and “
is optimal time complexity”