Transcript Document

A Survey on Text Categorization with Machine Learning Chikayama lab.

Dai Saito

Introduction: Text Categorization  Many digital Texts are available  E-mail, Online news, Blog …  Need of Automatic Text Categorization is increasing  without human resource  Merits of time and cost

Introduction: Text Categorization  Application  Spam filter  Topic Categorization

Introduction: Machine Learning  Making Categorization rule automatically by Feature of Text  Types of Machine Learning (ML)  Supervised Learning  Labeling  Unsupervised Learning  Clustering

Introduction: flow of ML 1.

 Prepare training Text data with label Feature of Text 2.

3.

Learn Categorize new Text Label1 ? Label2

Outline      Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Number of labels    Binary-label  True or False (Ex. spam or not)  Applied for other types Multi-label  Many labels, but One Text has one label Overlapping-label  One Text has some labels L1 L2 L3 L4 Yes No L1 L2 L3 L4

Types of labels    Topic Categorization  Basic Task  Compare individual words Author Categorization Sentiment Categorization  Ex) Review of products  Need more linguistic information

Outline      Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Feature of Text  How to express a feature of Text?

 “ Bag of Words ”  Ignore an order of words  Structure  Ex) I like this car. | I don ’ t like this car.

 “ Bag of Words ” will not work well   (d:document = text) (t:term = word)

Preprocessing   Remove stop words  “ the ” “ a ” “ for ” … Stemming  relational -> relate, truly -> true

Term Weighting   Term Frequency  Number of a term in a document  Frequent terms in a document seems to be important for categorization tf ・ idf  Terms appearing in many documents are not useful for categorization

Sentiment Weighting    For sentiment classification, weight a word as Positive or Negative Constructing sentiment dictionary WordNet [04 Kamps et al.]  Synonym Database  Using a distance from ‘ good ’ and ‘ bad ’ bad happy d (good, happy) = 2 d (bad, happy) = 4 good

Dimension Reduction  Size of feature vector is (#terms)*(#documents)  #terms ≒ size of dictionary  High calculation cost  Risk of overfitting  Best for training data ≠ Best for real data  Choosing effective feature  to improve accuracy and calculation cost

Dimension Reduction   df-threshold  Terms appearing in very few documents (ex.only one) are not important Score   If t and cj are independent, Score is equal to Zero

Outline      Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Learning Algorithm  Many (Almost all?) algorithms are used in Text Categorization  Simple approach   Na ï ve Bayes K-Nearest Neighbor   High performance approach   Boosting Support Vector Machine Hierarchical Learning

Na ï ve Bayes  Bayes Rule  This value is hard to calculate  ?

 Assumption : each terms occurs independently

k-Nearest Neighbor    Define a “ distance ”  of two Texts Ex)Sim(d1, d2) = d1 d1 ・ d2 / |d1||d2| = cosθ k=3 θ d2 check k of high similarity Texts and categorize by majority vote If size of test data is larger, memory and search cost is higher

Boosting   BoosTexter [00 Schapire et al.] Ada boost  making many “ weak learner ” s with different parameters   Kth “ weak learner ” checks performance of 1..K-1th, and tries to classify right to the worst score training data BoosTexter uses Decision Stump as “ weak learner ”

Simple example of Boosting 1.

+ + + - - - + + - - 2.

+ + 3.

+ + - + + + - - - -

Support Vector Machine   Text Categorization with SVM [98 Joachims] Maximize margin

Text Categorization with SVM  SVM works well for Text Categorization  Robustness for high dimension  Robustness for overfitting  Most Text Categorization problems are linearly separable   All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)

Comparison of these methods   [02 Sebastiani] Reuters-21578 (2 versions)  difference: number of Categories Method Ver.1(90) k-NN .860

Na ï ve Bayes .795

Boosting SVM .878

.870

Ver.2(10) .823

.815

.920

Hierarchical Learning   TreeBoost[06 Esuli et al.]  Boosting algorithm for Hierarchical labels    Hierarchical labels and Texts with label as Training data Applying AdaBoost recursively Better classifier than ‘ flat ’   AdaBoost Accuracy : 2-3% up Time: training and categorization time down Hierarchical SVM[04 Cai et al.]

TreeBoost L1 root L2 L11 L12 L3 L4 L41 L42 L43 L421 L422

Outline      Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Conclusion   Overview of Text Categorization with Machine Learning  Feature of Text  Learning Algorithm Future Work  Natural Language Processing with Machine Learning, especially in Japanese  Calculation Cost