Transcript and Class-based Generative Modeling for Text Classification
Integrated Instance- and Class based Generative Modeling for Text Classification
Antti Puurula Sung-Hyon Myaeng University of Waikato KAIST 5/12/2013 Australasian Document Computing Symposium
Instance vs. Class-based Text Classification
• Class-based learning • • • Multinomial Naive Bayes, Logistic Regression, Support Vector Machines, … Pros: compact models, efficient inference, accurate with text data Cons: document-level information discarded • Instance-based learning • • • K-Nearest Neighbors, Kernel Density Classifiers, … Pros: document-level information preserved, efficient learning Cons: data sparsity reduces accuracy
Instance vs. Class-based Text Classification 2
• Proposal: Tied Document Mixture • • • integrated instance- and class-based model retains benefits from both types of modeling exact linear time algorithms for estimation and inference • Main ideas: • • replace Multinomial class-conditional in MNB with a mixture over documents smooth document models hierarchically with class and background models
Multinomial Naive Bayes
• • Standard generative model for text classification Result of simple generative assumptions • Bayes • Naive • Multinomial
Multinomial Naive Bayes 2
Tied Document Mixture
• Replace Multinomial in MNB by a mixture over all documents • , where documents models are smoothed hierarchically • , where class models are estimated by averaging the documents
Tied Document Mixture 2
Tied Document Mixture 3
• • • Can be described as constraints on a two-level mixture Document level mixture: • • • Number of components= 𝑀 𝑙 Components assigned to instances Component weights= 1/ 𝑀 𝑙 Word level mixture: • • • Number of components= 3 (hierarchy depth) Components assigned to hierarchy Component weights= 1 − 𝑎 1 – 𝑎 2 , 𝑎 1 , and 𝑎 2
Tied Document Mixture 3
• • • Can be described as constraints on a two-level mixture Document level mixture: • • • Number of components= 𝑀 𝑙 Components assigned to instances Component weights= 1/ 𝑀 𝑙 Word level mixture: • • • Number of components= 3 (hierarchy depth) Components assigned to hierarchy Component weights= 1 − 𝑎 1 – 𝑎 2 , 𝑎 1 , and 𝑎 2
Tied Document Mixture 3
• • • Can be described as constraints on a two-level mixture Document level mixture: • • • Number of components= 𝑀 𝑙 Components assigned to instances Component weights= 1/ 𝑀 𝑙 Word level mixture: • • • Number of components= 3 (hierarchy depth) Components assigned to hierarchy Component weights= 1 − 𝑎 1 – 𝑎 2 , 𝑎 1 , and 𝑎 2
Tied Document Mixture 4
• • Can be described as a class-smoothed Kernel Density Classifier Document mixture equivalent to a Multinomial kernel density • Hierarchical smoothing corresponds to mean shift or data sharpening with class-centroids
Hierarchical Sparse Inference
• Reduces complexity from 𝑂(𝑁𝑀) 𝑂( 𝑛:𝑤 𝑛 ≠0 (1 + 𝑚:𝑝 𝑢 𝑚 𝑛 ≠0 1)) to • Same complexity as K-Nearest Neighbors based on inverted indices (Yang, 1994)
Hierarchical Sparse Inference 2
• • Precompile values: • • • • 𝑝(𝑙) 𝑝(𝒘|𝑚) ⟶ 𝑈 𝑛 ⟶ 𝑝 𝑢 𝑚 𝑝 𝑙 𝑠 (𝑛) ⟶ 𝑛 ⟶ decomposes: • Store 𝑝 ′𝑢 𝑚 (𝑛) and 𝑝 𝑙 ′𝑠 𝑛 in inverted indices
Hierarchical Sparse Inference 3
• • • • • Compute first 𝑈 𝑛 Update by 𝑝 𝑙 𝑠 (𝑛) Update by 𝑝 𝑢 𝑚 (𝑛) to get 𝑝(𝒘|𝑚) Compute 𝑝(𝒘|𝑙) = 1 𝑀 𝑙 𝑚 𝑝(𝒘|𝑚) Bayes rule 𝑝 𝑙 𝒘 ≈ 𝑝 𝑙 𝑝(𝒘|𝑙)
Hierarchical Sparse Inference 3
• • • • • Compute first 𝑈 𝑛 Update by 𝑝 𝑙 𝑠 (𝑛) Update by 𝑝 𝑢 𝑚 (𝑛) to get 𝑝(𝒘|𝑚) Compute 𝑝(𝒘|𝑙) = 1 𝑀 𝑙 𝑚 𝑝(𝒘|𝑚) Bayes rule 𝑝 𝑙 𝒘 ≈ 𝑝 𝑙 𝑝(𝒘|𝑙)
Hierarchical Sparse Inference 3
• • • • • Compute first 𝑈 𝑛 Update by 𝑝 𝑙 𝑠 (𝑛) Update by 𝑝 𝑢 𝑚 (𝑛) to get 𝑝(𝒘|𝑚) Compute 𝑝(𝒘|𝑙) = 1 𝑀 𝑙 𝑚 𝑝(𝒘|𝑚) Bayes rule 𝑝 𝑙 𝒘 ≈ 𝑝 𝑙 𝑝(𝒘|𝑙)
Hierarchical Sparse Inference 3
• • • • • Compute first 𝑈 𝑛 Update by 𝑝 𝑙 𝑠 (𝑛) Update by 𝑝 𝑢 𝑚 (𝑛) to get 𝑝(𝒘|𝑚) Compute 𝑝(𝒘|𝑙) = 1 𝑀 𝑙 𝑚 𝑝(𝒘|𝑚) Bayes rule 𝑝 𝑙 𝒘 ≈ 𝑝 𝑙 𝑝(𝒘|𝑙)
Hierarchical Sparse Inference 3
• • • • • Compute first 𝑈 𝑛 Update by 𝑝 𝑙 𝑠 (𝑛) Update by 𝑝 𝑢 𝑚 (𝑛) to get 𝑝(𝒘|𝑚) Compute 𝑝(𝒘|𝑙) = 1 𝑀 𝑙 𝑚 𝑝(𝒘|𝑚) Bayes rule 𝑝 𝑙 𝒘 ≈ 𝑝 𝑙 𝑝(𝒘|𝑙)
Hierarchical Sparse Inference 3
• • • • • Compute first 𝑈 𝑛 Update by 𝑝 𝑙 𝑠 (𝑛) Update by 𝑝 𝑢 𝑚 (𝑛) to get 𝑝(𝒘|𝑚) Compute 𝑝(𝒘|𝑙) = 1 𝑀 𝑙 𝑚 𝑝(𝒘|𝑚) Bayes rule 𝑝 𝑙 𝒘 ≈ 𝑝 𝑙 𝑝(𝒘|𝑙)
Hierarchical Sparse Inference 2
• • • • • Compute first 𝑈 𝑛 Update by 𝑝 𝑙 𝑠 (𝑛) Update by 𝑝 𝑢 𝑚 (𝑛) to get 𝑝(𝒘|𝑚) Compute 𝑝(𝒘|𝑙) = 1 𝑀 𝑙 𝑚 𝑝(𝒘|𝑚) Bayes rule 𝑝 𝑙 𝒘 ≈ 𝑝 𝑙 𝑝(𝒘|𝑙)
Experimental Setup
• 14 classification datasets used: • • • • 3 spam classification 3 sentiment analysis 5 multi-class classification 3 multi-label classification • Scripts and datasets in LIBSVM format: • http://sourceforge.net/projects /sgmweka/
Experimental Setup 2
• • Classifiers compared: • Multinomial Naive Bayes (MNB) • • • • • Tied Document Mixture (TDM) K-Nearest Neighbors (KNN) (Multinomial distance, distance-weighted vote) Kernel Density Classifier (KDC) (Smoothed multinomial kernel) Logistic Regression (LR, LR+) (L2-regularized) Support Vector Machine (SVM, SVM+) (L2-regularized L2-loss) LR+ and SVM+ weighted feature vectors by TFIDF • Smoothing parameters optimized for MicroFscore on held-out development sets using Gaussian Random Searches
Results
• • Training times for MNB, TDM, KNN and KDC linear • At most 70 s for MNB on for OHSU-TREC, 170 s for the others SVM and LR require iterative algorithms • • At most 936 s, for LR on Amazon12 Did not scale to multi-label datasets in practical times • Classification times for instance-based classifiers higher • At most mean 226 ms for TDM on OHSU-TREC, compared to 70 ms for MNB • (with 290k terms, 196k labels, 197k documents)
Results 2
TDM significantly improves on MNB, KNN and KDC • • Across comparable datasets, TDM is on par with SVM+ SVM+ is significantly better on multi-class datasets TDM is significantly better on spam classification
Results 2
TDM significantly improves on MNB, KNN and KDC • • Across comparable datasets, TDM is on par with SVM+ SVM+ is significantly better on multi-class datasets TDM is significantly better on spam classification
Results 3
TDM reduces classification errors compared to MNB by: >65% in spam classification >26% in sentiment analysis Some correlation between error reduction and number of instances/class. Task types form clearly separate clusters
Conclusion
• Tied Document Mixture • Integrated instance- and class-based model for text classification • • • • Exact linear time algorithms, with same complexities as KNN and KDC Accuracy substantially improved over MNB, KNN and KDC Competitive with optimized SVM, depending on task type Many improvements to the basic model possible • • Sparse inference scales to hierarchical mixtures of >340k components Toolkit, datasets and scripts available: • http://sourceforge.net/projects/sgmweka/
Sparse Inference
• Sparse Inference (Puurula, 2012) • Use inverted indices to reduce complexity of computing joint 𝑝(𝒘, 𝑙) given 𝒘 for a • Instead of computing 𝑝(𝒘, 𝑙) 𝑝 𝑙 (𝑛) for each 𝑝 𝑙 𝑢 (𝑛) ≠ 0 as dot products, compute 𝑝 𝑠 from the inverted index 𝑛 and update by • Reduces joint 𝑝(𝒘, 𝑙) 𝑂( 𝑛:𝑤 𝑛 ≠0 (1 + 𝑙:𝑝 𝑢 𝑙 inference time complexity from dense 𝑂 𝑁𝐿 𝑛 ≠0 1)) to
Sparse Inference 2
• Dense representation: 𝑝 𝑚 (𝑛) 𝑴 = number of classes 𝑵 = number of features 𝑝 1 (𝑛) 𝑝 2 (𝑛) 𝑝 3 (𝑛) 𝑝 4 (𝑛) 𝑝 5 (𝑛) 𝑝 6 (𝑛) 𝑝 7 (𝑛) 𝑝 8 (𝑛) 𝑝 9 (𝑛) 𝑝 10 (𝑛) 𝑝 11 (𝑛) 𝑝 12 (𝑛) 𝑝 𝑚 (1) 𝑝 𝑚 (2) 𝑝 𝑚 (3) 𝑝 𝑚 (4) 𝑝 𝑚 (5) 𝑝 𝑚 (6) 𝑝 𝑚 (7) 𝑝 𝑚 (8) 𝑝 𝑚 (9) 0.21
0.25
0.38
0.32
0.53
0.68
0.68
0.68
0.68
0.1
0.18
0.34
0.43
0.22
0.07
0.07
0.07
0.07
0.16
0.09
0.1
0.06
0.13
0.1
0.14
0.13
0.08
0.06
0.04
0.02
0.06
0.06
0.04
0.02
0.06
0.06
0.04
0.02
0.06
0.06
0.04
0.02
0.06
0.06
0.04
0.02
0.06
0.06
0.04
0.02
0.06
0.06
0.04
0.02
0.07
0.05
0.05
0.02
0.1
0.05
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
Time complexity: 𝑂 𝑁𝐿
Sparse Inference 3
• Sparse representation: 𝑝 𝑚 (𝑛) 𝑝 𝑠 (1) 𝑝 𝑠 (2) 𝑝 𝑠 (3) 𝑝 𝑠 (4) 𝑝 𝑠 (5) 𝑝 𝑠 (6) 𝑝 𝑠 (7) 𝑝 𝑠 (8) 𝑝 𝑠 (9) 𝑝 𝑠 (10) 𝑝 𝑠 (11) 𝑝 𝑠 (12) 𝑝 𝑠 (𝑛) 0.18
0.07
0.06
0.06
0.04
0.02
0.02
0.01
0.01
0.01
0.01
0.01
𝑝 𝑢 1 (𝑛) 𝑝 𝑢 2 (𝑛) 𝑝 𝑢 3 (𝑛) 𝑝 𝑢 4 (𝑛) 𝑝 𝑢 5 (𝑛) 𝑝 𝑢 6 (𝑛) 𝑝 𝑢 7 (𝑛) 𝑝 𝑢 8 (𝑛) 𝑝 𝑢 9 (𝑛) 𝑝 𝑢 10 (𝑛) 𝑝 𝑢 11 (𝑛) 𝑝 𝑢 12 (𝑛) 𝑝 𝑢 𝑚 (1) 𝑝 𝑢 𝑚 (2) 𝑝 𝑢 𝑚 (3) 𝑝 𝑢 𝑚 (4) 𝑝 𝑢 𝑚 (5) 𝑝 𝑢 𝑚 (6) 𝑝 𝑢 𝑚 (7) 𝑝 𝑢 𝑚 (8) 𝑝 𝑢 𝑚 (9) 0.03
0.07
0.21
0.14
0.35
0.5
0.5
0.5
0.5
0.02
0.1
0.03
0.06
0.04
0.05
0.03
0.04
0.01
0.09
0.03
0.11
0.07
0.04
0.1
0.11
0 0 0 0 0 0 0.27
0.02
0 0 0 0 0 0 0 0 0 0.36
0 0 0 0 0 0 0 0 0 0 0.15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Time complexity: 𝑂( 𝑛:𝑤 𝑛 ≠0 (1 + 𝑙:𝑝 𝑙 𝑢 𝑛 ≠0 1))