Scalable Text Classification with Sparse Generative Modeling Antti Puurula Powerpoint Waikato University Templates Page 1 Sparse Computing for Big Data • “Big Data” – machine learning for processing.

Download Report

Transcript Scalable Text Classification with Sparse Generative Modeling Antti Puurula Powerpoint Waikato University Templates Page 1 Sparse Computing for Big Data • “Big Data” – machine learning for processing.

Scalable Text Classification with
Sparse Generative Modeling
Antti Puurula Powerpoint
Waikato
University
Templates
Page 1
Sparse Computing for Big Data
• “Big Data”
– machine learning for processing vast datasets
• Current solution: Parallel computing
– processing more data as expensive, or more
• Alternative solution: Sparse computing
– scalable solutions, less expensive
Powerpoint Templates
Page 2
Sparse Representation
• Example: document vector
– Dense: word count vector w = [w1, …, wN]
• w = [0, 14, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0]
– Sparse: vectors [v, c] of indices v and nonzeros c
• v = [2, 10, 14, 17], c = [14, 2, 1, 3]
• Complexity: |w| vs. s(w), number of nonzeros
Powerpoint Templates
Page 3
Sparse Inference with MNB
• Multinomial Naive Bayes
• Input word vector w, output label 1 ≤ m ≤ M
• Sparse representation for parameters pm(n)
– Jelinek-Mercer interpolation: αps(n) + (1-α)pu m(n)
– estimation: represent pu m with a hashtable
– inference: represent pu m with an inverted index
Powerpoint Templates
Page 4
Sparse Inference with MNB
• Dense representation: pm(n)
p1(n)
p2(n)
p3(n)
p4(n)
p5(n)
p6(n)
p7(n)
p8(n)
p9(n)
p10(n)
p11(n)
p12(n)
pm(1) pm(2) pm(3) pm(4) pm(5) pm(6) pm(7) pm(8) pm(9)
0.21
0.25
0.38
0.32
0.53
0.68
0.68
0.68
0.68
0.10
0.18
0.34
0.43
0.22
0.07
0.07
0.07
0.07
0.16
0.13
0.08
0.06
0.06
0.06
0.06
0.06
0.06
0.09
0.10
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.10
0.14
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.06
0.13
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.07
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.10
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
Powerpoint Templates
Page 5
Sparse Inference with MNB
• Dense representation: pm(n)
p1(n)
p2(n)
p3(n)
p4(n)
p5(n)
p6(n)
p7(n)
p8(n)
p9(n)
p10(n)
p11(n)
p12(n)
M classes
pm(1) pm(2) pm(3) pm(4) pm(5) pm(6) pm(7) pm(8) pm(9)
0.21
0.25
0.38
0.32
0.53
0.68
0.68
0.68
0.68
0.10
0.18
0.34
0.43
0.22
0.07
0.07
0.07
0.07
0.16
0.13
0.08
0.06
0.06
0.06
0.06
0.06
0.06
0.09
0.10
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.10
0.14
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.06
0.13
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.07
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.10
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
N features Time complexity: O(s(w) M)
Powerpoint Templates
Page 6
Sparse Inference with MNB
• Sparse representation: αps(n) + (1-α)pu m(n)
ps(1)
ps(2)
ps(3)
ps(4)
ps(5)
ps(6)
ps(7)
ps(8)
ps(9)
ps(10)
ps(11)
ps(12)
ps(n)
0.18
0.07
0.06
0.06
0.04
0.02
0.02
0.01
0.01
0.01
0.01
0.01
+
pum(1) pum(2) pum(3) pum(4) pum(5) pum(6) pum(7) pum(8) pum(9)
pu1(n)
0.03
0.07
0.21
0.14
0.35
0.50
0.50
0.50
0.50
pu2(n)
0.02
0.11
0.27
0.36
0.15
0.00
0.00
0.00
0.00
u
p 3(n)
0.10
0.07
0.02
0.00
0.00
0.00
0.00
0.00
0.00
pu4(n)
0.03
0.04
0.00
0.00
0.00
0.00
0.00
0.00
0.00
u
p 5(n)
0.06
0.10
0.00
0.00
0.00
0.00
0.00
0.00
0.00
pu6(n)
0.04
0.11
0.00
0.00
0.00
0.00
0.00
0.00
0.00
u
p 7(n)
0.05
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
pu8(n)
0.03
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
u
p 9(n)
0.04
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
pu10(n)
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
u
p 11(n)
0.09
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
pu12(n)
0.03
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Time complexity: O(s(w) + mn:pm(n)>01)
Powerpoint Templates
Page 7
Sparse Inference with MNB
Powerpoint Templates
Page 8
Sparse Inference with MNB
O(s(w))
Powerpoint Templates
Page 9
Sparse Inference with MNB
O( n:pm(n)>01)
Powerpoint Templates m
Page 10
Multi-label Classifiers
• Multilabel classification
– binary labelvector l=[l1, …, lM] instead of label m
– 2M possible labelvectors, not directly solvable
• Solved with multilabel extensions:
– Binary Relevance (Godbole & Sarawagi 2004)
– Label Powerset (Boutell et al. 2004)
– Multi-label Mixture Model (McCallum 1999)
Powerpoint Templates
Page 11
Multi-label Classifiers
• Feature normalization: TFIDF
– s(wu) length normalization (“L0-norm”)
– TF log-transform of counts, corrects “burstiness”
– IDF transform, unsmoothed Croft-Harper IDF
Powerpoint Templates
Page 12
Multi-label Classifiers
• Classifier modification with metaparameters a
–
–
–
–
a1 Jelinek-Mercer smoothing of conditionals pm(n)
a2 Count pruning in training with a threshold
a3 Prior scaling. Replace p(l) by p(l)a
a4 Class pruning in classification with a threshold
3
Powerpoint Templates
Page 13
Multi-label Classifiers
• Direct optimization of a with random search
– target f(a): development set Fscore
• Parallel random search
– iteratively sample points around current max f(a)
– generate points by dynamically adapted steps
– sample f(a) in I iterations of J parallel points
• I= 30, J=50 → 1500 configurations of a sampled
Powerpoint Templates
Page 14
Experiment Datasets
Powerpoint Templates
Page 15
Experiment Results
Powerpoint Templates
Page 16
Conclusion
• New idea: sparse inference
– reduces time complexity of probabilistic inference
– demonstrated for multi-label classification
– applicable with different models (KNN, SVM, …)
and uses (clustering, ranking, …)
• Code available, with Weka wrapper:
– Weka package manager: SparseGenerativeModel
– http://sourceforge.net/projects/sgmweka/
Powerpoint Templates
Page 17
Multi-label classification
• Binary Relevance (Godbole & Sarawagi 2004)
– each label decision independent binary problem
– positive multinomial vs. negative multinomial
• negatives approximated with background multinomial
– threshold parameter for improved accuracy
+ fast, simple, easy to implement
- ignores label correlations, poor performance
Powerpoint Templates
Page 18
Multi-label classification
• Label Powerset (Boutell et al. 2004)
– each labelset seen in training is mapped to a class
– hashtable for converting classes to labelsets
+ models label correlations, good performance
- takes memory, cannot classify new labelsets
Powerpoint Templates
Page 19
Multi-label classification
• Multi-label Mixture Model
– mixture for prior:
– classification with greedy search (McCallum 1999)
• complexity: q times MNB complexity, where q is
maximum labelset size s(l) seen in training
+ models labelsets, generalizes to new labelsets
- assumes uniform linear decomposition
Powerpoint Templates
Page 20
Multi-label classification
• Multi-label Mixture Model
– related models: McCallum 1999, Ueda 2002
– like label powerset, but labelset conditionals
decompose into a mixture of label conditionals:
• pl(n) = 1/s(l) ∑mlmpm(n)
Powerpoint Templates
Page 21
Classifier optimization
• Model modifications
– 1) Jelinek-Mercer smoothing of conditionals:
– 2) Count pruning. Max 8M conditional counts. On
each count update: online pruning with IDF
running estimates and meta-parameter a2
Powerpoint Templates
Page 22
Classifier optimization
• Model modifications
– 3) Prior scaling. Replace p(l) by p(l)a
3
• equivalent to LM scaling in speech recognition
– 4) Pruning in classification. Sort classes by rank
and stop classification on a threshold a4
• sort by:
• prune:
Powerpoint Templates
Page 23
Sparse Inference with MNB
• Multinomial Naive Bayes
• Bayes
• Naive
• Multinomial
p(w,m) = p(m) pm(w)
p(w,m) = p(m) n pm(wn, n)
p(w,m)  p(m) n pm(n)wn
Powerpoint Templates
Page 24