Scalable Text Classification with Sparse Generative Modeling Antti Puurula Powerpoint Waikato University Templates Page 1 Sparse Computing for Big Data • “Big Data” – machine learning for processing.
Download ReportTranscript Scalable Text Classification with Sparse Generative Modeling Antti Puurula Powerpoint Waikato University Templates Page 1 Sparse Computing for Big Data • “Big Data” – machine learning for processing.
Scalable Text Classification with Sparse Generative Modeling Antti Puurula Powerpoint Waikato University Templates Page 1 Sparse Computing for Big Data • “Big Data” – machine learning for processing vast datasets • Current solution: Parallel computing – processing more data as expensive, or more • Alternative solution: Sparse computing – scalable solutions, less expensive Powerpoint Templates Page 2 Sparse Representation • Example: document vector – Dense: word count vector w = [w1, …, wN] • w = [0, 14, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0] – Sparse: vectors [v, c] of indices v and nonzeros c • v = [2, 10, 14, 17], c = [14, 2, 1, 3] • Complexity: |w| vs. s(w), number of nonzeros Powerpoint Templates Page 3 Sparse Inference with MNB • Multinomial Naive Bayes • Input word vector w, output label 1 ≤ m ≤ M • Sparse representation for parameters pm(n) – Jelinek-Mercer interpolation: αps(n) + (1-α)pu m(n) – estimation: represent pu m with a hashtable – inference: represent pu m with an inverted index Powerpoint Templates Page 4 Sparse Inference with MNB • Dense representation: pm(n) p1(n) p2(n) p3(n) p4(n) p5(n) p6(n) p7(n) p8(n) p9(n) p10(n) p11(n) p12(n) pm(1) pm(2) pm(3) pm(4) pm(5) pm(6) pm(7) pm(8) pm(9) 0.21 0.25 0.38 0.32 0.53 0.68 0.68 0.68 0.68 0.10 0.18 0.34 0.43 0.22 0.07 0.07 0.07 0.07 0.16 0.13 0.08 0.06 0.06 0.06 0.06 0.06 0.06 0.09 0.10 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.10 0.14 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.06 0.13 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.05 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.05 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.10 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.05 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Powerpoint Templates Page 5 Sparse Inference with MNB • Dense representation: pm(n) p1(n) p2(n) p3(n) p4(n) p5(n) p6(n) p7(n) p8(n) p9(n) p10(n) p11(n) p12(n) M classes pm(1) pm(2) pm(3) pm(4) pm(5) pm(6) pm(7) pm(8) pm(9) 0.21 0.25 0.38 0.32 0.53 0.68 0.68 0.68 0.68 0.10 0.18 0.34 0.43 0.22 0.07 0.07 0.07 0.07 0.16 0.13 0.08 0.06 0.06 0.06 0.06 0.06 0.06 0.09 0.10 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.10 0.14 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.06 0.13 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.05 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.05 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.10 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.05 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 N features Time complexity: O(s(w) M) Powerpoint Templates Page 6 Sparse Inference with MNB • Sparse representation: αps(n) + (1-α)pu m(n) ps(1) ps(2) ps(3) ps(4) ps(5) ps(6) ps(7) ps(8) ps(9) ps(10) ps(11) ps(12) ps(n) 0.18 0.07 0.06 0.06 0.04 0.02 0.02 0.01 0.01 0.01 0.01 0.01 + pum(1) pum(2) pum(3) pum(4) pum(5) pum(6) pum(7) pum(8) pum(9) pu1(n) 0.03 0.07 0.21 0.14 0.35 0.50 0.50 0.50 0.50 pu2(n) 0.02 0.11 0.27 0.36 0.15 0.00 0.00 0.00 0.00 u p 3(n) 0.10 0.07 0.02 0.00 0.00 0.00 0.00 0.00 0.00 pu4(n) 0.03 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 u p 5(n) 0.06 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pu6(n) 0.04 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 u p 7(n) 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pu8(n) 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 u p 9(n) 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pu10(n) 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 u p 11(n) 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pu12(n) 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Time complexity: O(s(w) + mn:pm(n)>01) Powerpoint Templates Page 7 Sparse Inference with MNB Powerpoint Templates Page 8 Sparse Inference with MNB O(s(w)) Powerpoint Templates Page 9 Sparse Inference with MNB O( n:pm(n)>01) Powerpoint Templates m Page 10 Multi-label Classifiers • Multilabel classification – binary labelvector l=[l1, …, lM] instead of label m – 2M possible labelvectors, not directly solvable • Solved with multilabel extensions: – Binary Relevance (Godbole & Sarawagi 2004) – Label Powerset (Boutell et al. 2004) – Multi-label Mixture Model (McCallum 1999) Powerpoint Templates Page 11 Multi-label Classifiers • Feature normalization: TFIDF – s(wu) length normalization (“L0-norm”) – TF log-transform of counts, corrects “burstiness” – IDF transform, unsmoothed Croft-Harper IDF Powerpoint Templates Page 12 Multi-label Classifiers • Classifier modification with metaparameters a – – – – a1 Jelinek-Mercer smoothing of conditionals pm(n) a2 Count pruning in training with a threshold a3 Prior scaling. Replace p(l) by p(l)a a4 Class pruning in classification with a threshold 3 Powerpoint Templates Page 13 Multi-label Classifiers • Direct optimization of a with random search – target f(a): development set Fscore • Parallel random search – iteratively sample points around current max f(a) – generate points by dynamically adapted steps – sample f(a) in I iterations of J parallel points • I= 30, J=50 → 1500 configurations of a sampled Powerpoint Templates Page 14 Experiment Datasets Powerpoint Templates Page 15 Experiment Results Powerpoint Templates Page 16 Conclusion • New idea: sparse inference – reduces time complexity of probabilistic inference – demonstrated for multi-label classification – applicable with different models (KNN, SVM, …) and uses (clustering, ranking, …) • Code available, with Weka wrapper: – Weka package manager: SparseGenerativeModel – http://sourceforge.net/projects/sgmweka/ Powerpoint Templates Page 17 Multi-label classification • Binary Relevance (Godbole & Sarawagi 2004) – each label decision independent binary problem – positive multinomial vs. negative multinomial • negatives approximated with background multinomial – threshold parameter for improved accuracy + fast, simple, easy to implement - ignores label correlations, poor performance Powerpoint Templates Page 18 Multi-label classification • Label Powerset (Boutell et al. 2004) – each labelset seen in training is mapped to a class – hashtable for converting classes to labelsets + models label correlations, good performance - takes memory, cannot classify new labelsets Powerpoint Templates Page 19 Multi-label classification • Multi-label Mixture Model – mixture for prior: – classification with greedy search (McCallum 1999) • complexity: q times MNB complexity, where q is maximum labelset size s(l) seen in training + models labelsets, generalizes to new labelsets - assumes uniform linear decomposition Powerpoint Templates Page 20 Multi-label classification • Multi-label Mixture Model – related models: McCallum 1999, Ueda 2002 – like label powerset, but labelset conditionals decompose into a mixture of label conditionals: • pl(n) = 1/s(l) ∑mlmpm(n) Powerpoint Templates Page 21 Classifier optimization • Model modifications – 1) Jelinek-Mercer smoothing of conditionals: – 2) Count pruning. Max 8M conditional counts. On each count update: online pruning with IDF running estimates and meta-parameter a2 Powerpoint Templates Page 22 Classifier optimization • Model modifications – 3) Prior scaling. Replace p(l) by p(l)a 3 • equivalent to LM scaling in speech recognition – 4) Pruning in classification. Sort classes by rank and stop classification on a threshold a4 • sort by: • prune: Powerpoint Templates Page 23 Sparse Inference with MNB • Multinomial Naive Bayes • Bayes • Naive • Multinomial p(w,m) = p(m) pm(w) p(w,m) = p(m) n pm(wn, n) p(w,m) p(m) n pm(n)wn Powerpoint Templates Page 24