Transcript Multi-word
A comparative study of TF*IDF, LSI and multi-words for text classification Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA Intelligent Database Systems Lab Outlines Motivation Objectives Methodology Experiments Conclusions Comments Intelligent Database Systems Lab Motivation Although TF*IDF, LSI and multi-word have been proposed for a long time, there is no comparative study on these indexing methods, and no results are reported concerning their classification performances. Intelligent Database Systems Lab Objectives • A comparative study of TF*IDF, LSI and multi-words for text classification. - information retrieval - text categorization • indexing term: semantic quality statistical quality Intelligent Database Systems Lab Methodology - TF*IDF 1) 2) 3) 4) wi,j : the weight for term i in document j N : the number of documents in the collection tfi,j : is the term frequency of term i in document j dfi : is the document frequency of term i in the collection Terms (keywords) of the document collection documents Intelligent Database Systems Lab Methodology - LSI Given a term-document matrix X = [x1 , x2 , ... , xn ] є Rm and suppose the rank of X is r, LSI decomposes the X using SVD as follows: 1. 2. Xk=Uk’ΣkVkT’ Terms (keywords) of the document collection documents Intelligent Database Systems Lab Methodology - Multi-word its occurrence frequency should be at least twice in a document. the length of the multi-word should be between 2 and 6 Intelligent Database Systems Lab Experiments - Datasets Chinese corpus: TanCorpV1.0 14150 documents 20 categories Select 1200 documents agriculture 5,468,301 individual words 219,115 sentences history politics economy English corpus: Reuters-22173 distribution 1.0 22173 documents 135 categories Select 2032 documents Crude (520) 50,837 sentences agriculture (574) Trade (514) 281,111 individual words Interest (424) Intelligent Database Systems Lab Experiments - Evaluation Intelligent Database Systems Lab Experiments - Chinese Intelligent Database Systems Lab Experiments - English Intelligent Database Systems Lab Experiments – t-test Intelligent Database Systems Lab Comparison information retrieval TF*IDF Chinese LSI English multi-word text categorization computation complexity O(n m) best O(n2r3) O(ms2) Intelligent Database Systems Lab Conclusions • LSI can produce better indexing in discriminative power. • LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods. • The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification. Intelligent Database Systems Lab Comments • Advantages - Compare with TF*IDF, LSI and multi-words • Disadvantage - semantic quality and statistical quality are considered merely by our intuition instead of theory • Applications - text mining Intelligent Database Systems Lab