Scalable Text Mining with Sparse Generative Models A N T T I P UUR UL A P HD T HESIS P R ES ENTATION U.
Download ReportTranscript Scalable Text Mining with Sparse Generative Models A N T T I P UUR UL A P HD T HESIS P R ES ENTATION U.
Scalable Text Mining with Sparse Generative Models A N T T I P UUR UL A P HD T HESIS P R ES ENTATION U N I V ERSITY OF WA I K ATO, N E W ZEA L A ND, 8 T H JU N E 2 0 1 5 1 Introduction ◦ This thesis presents a framework of probabilistic text mining based on sparse generative models ◦ Models developed in the framework show state-of-the-art effectiveness in both text classification and retrieval tasks ◦ Proposed sparse inference for using these models improves scalability, enabling text mining for very large-scale tasks 2 Major Contributions of the Thesis ◦ Formalizing multinomial modeling of text ◦ Smoothing as two-state Hidden Markov Models ◦ Fractional counts as probabilistic data ◦ Weighted factors as log-linear models ◦ Scalable inference on text ◦ Sparse inference using inverted indices for statistical models ◦ Tied Document Mixture, a model benefiting from sparse inference ◦ Extensive evaluation using a combined experimental setup for classification and retrieval 3 Defining Text Mining ◦ “Knowledge Discovery in Textual databases” (KDT) [Feldman and Dagan, 1995] ◦ “Text Mining as Integration of Several Related Research Areas” [Grobelnik et al., 2000] ◦ Definition used in this thesis: ◦ Text mining is an interdisciplinary field of research on the automatic processing of large quantities of text data for valuable information 4 Related Fields and Application Domains 5 Volume of Text Mining Publications References per year found for related fields using academic search engines 6 Volume of Text Mining Publications References per year found for related fields using academic search engines 7 Volume of Text Mining Publications References per year found for related fields using academic search engines 8 Scale of Text Data ◦ Existing collections: ◦ Google Books, 30M books (2013) ◦ Twitter, 200M users, 400M messages per day (2013) ◦ WhatsApp, 430M users, 50B messages per day (2014) ◦ Available research collections: ◦ English Wikipedia, 4.5M articles (2014) ◦ Google n-grams, 5-grams estimated from 1T words (2007) ◦ Annotated English Gigaword, 4B words with metadata (2012) ◦ TREC KBA, 394M annotated documents for classification (2014) 9 Text Mining Methodology in a Nutshell ◦ Normalize and map documents into a structured representation, such as a vector of word counts ◦ Segment a problem into machine learning tasks ◦ Solve the tasks using algorithms, most commonly linear models 10 11 12 Linear Models for Text Mining Multi-class linear scoring function: 13 Multinomial Naive Bayes ◦ Bayes model with multinomials conditioned on label variables: ◦ Priors are categorical, label-conditionals normalizer is constant ◦ Directed generative graphical model are multinomial, and 14 Formalizing Smoothing of Multinomials ◦ All smoothing methods for multinomials can be expressed as , where is an unsmoothed label-conditional model, background model, and is the smoothing weight ◦ Discounting of counts by discounts is the is applied to 20 Two-State HMM Formalization of Smoothing ◦ Replace multinomial with a 0th order categorical state-emission HMM, with M=2 hidden states: ◦ Component m=2 is shared between the 2-state HMMs for each label Two-State HMM Formalization of Smoothing (2) ◦ Label-conditionals can be rewritten: ◦ Choosing , the smoothed multinomials , and implements Two-State HMM Formalization of Smoothing (3) ◦ Maximum likelihood estimation is difficult, due to a sum over terms ◦ Given a prior distribution over component assignments expected log-likelihood estimation decouples: , Formalizing Fractional Counts ◦ Fractional counts are undefined for categorical and multinomial models ◦ Formalization possible with probabilistic data ◦ A weight sequence matching a word sequence be interpreted as probabilities of words occurring in data ◦ Expected log-likelihoods and log-probabilities given expected counts reproduce results from using fractional counts can Formalizing Fractional Counts (2) ◦ Estimation with expected log-likelihood Formalizing Fractional Counts (3) ◦ Inference with expected log-probability Extending MNB with Scaled Factors ◦ MNB with scaled factors for label priors and document lengths ◦ , where label prior and document length factors are scaled and renormalized: and Sparse Inference for MNB ◦ Naive MNB posterior inference has complexity ◦ Sparse inference using an inverted index with precomputed values: ◦ , with , ◦ This has time complexity , , and for all Sparse Inference for Structured Models ◦ Extension to hierarchically smoothed sequence models ◦ Complexity reduced from to Sparse Inference for Structured Models (2) ◦ A hierarchically smoothed sequence model: ◦ With Jelinek-Mercer smoothing, sparse marginalization: ◦ Marginalization complexity reduced from to Tied Document Mixture ◦ Replace label-conditional in MNB with a mixture over hierarchically smoothed document models: Experiments ◦ Experiments on 16 text classification and 13 ranked retrieval datasets ◦ Development and evaluation segments used, both further split into training and testing segments ◦ Classification evaluated with Micro-Fscore, retrieval with MAP and NDCG ◦ Models optimized for the evaluation measures using a Gaussian random search on development test set 1 Power-law Discounting 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 Dirichlet Prior 0.4 0.5 Evaluated Modifications ◦ MNB, TDM, VSM, LR, and SVM models with modifications compared ◦ Generalized TF-IDF used: , with for length scaling and for IDF lifting Scalability Experiments ◦ Large English Wikipedia dataset for multi-label classification, segmented into 2.34M training documents and 23.6k test documents ◦ Pruned by features (10 to 100000), documents (10 to 100000) and labelsets (1 to 100000) into smaller sets ◦ Scalability of naive vs. sparse inference evaluated on MNB and TDM ◦ Maximum of 4 hours of computing time allowed for each condition Summary of Experiment Results ◦ Effectiveness improvements to MNB: ◦ Choice of smoothing – small effect ◦ Feature weighting and scaled factors - large effect ◦ Tied Document Mixture - very large effect ◦ BM25 for ranking outperformed, close to highly optimized SVM for classification ◦ Scalability from sparse inference: ◦ 10* inference time reduction in largest completed case Conclusion ◦ Modified Bayes models are strong models for text mining tasks: sentiment analysis, spam classification, document categorization, ranked retrieval, … ◦ Sparse inference enables scalability for new types of tasks and models ◦ Possible future applications of the presented framework ◦ Text clustering ◦ Text regression ◦ N-gram language modeling ◦ Topic models Conclusion (2) ◦ Thesis statement: ◦ “Generative models of text combined with inference using inverted indices provide sparse generative models for text mining that are both versatile and scalable, providing state-of-the-art effectiveness and high scalability for various text mining tasks.” ◦ Truisms in theory that should be reconsidered: ◦ Naive Bayes as the “punching bag of machine learning” ◦ “the curse of dimensionality” and “ is optimal time complexity”