Transcript Multi-word

A comparative study of TF*IDF, LSI and
multi-words for text classification
Presenter : JIAN-REN CHEN
Authors
: Wen Zhang, Taketoshi Yoshida, Xijin Tang
2011.ESWA
Intelligent Database Systems Lab
Outlines
 Motivation
 Objectives
 Methodology
 Experiments
 Conclusions
 Comments
Intelligent Database Systems Lab
Motivation
Although TF*IDF, LSI and multi-word have been proposed for a
long time, there is no comparative study on these indexing
methods, and no results are reported concerning their
classification performances.
Intelligent Database Systems Lab
Objectives
• A comparative study of TF*IDF, LSI and multi-words
for text classification.
- information retrieval
- text categorization
• indexing term:
 semantic quality
 statistical quality
Intelligent Database Systems Lab
Methodology - TF*IDF
1)
2)
3)
4)
wi,j : the weight for term i in document j
N : the number of documents in the collection
tfi,j : is the term frequency of term i in document j
dfi : is the document frequency of term i in the collection
Terms (keywords) of the document collection
documents
Intelligent Database Systems Lab
Methodology - LSI
Given a term-document matrix X = [x1 , x2 , ... , xn ] є Rm
and suppose the rank of X is r, LSI decomposes the X using SVD as follows:
1.
2. Xk=Uk’ΣkVkT’
Terms (keywords) of the document collection
documents
Intelligent Database Systems Lab
Methodology - Multi-word
its occurrence frequency
should be at least twice in
a document.
the length of the multi-word
should be between 2 and 6
Intelligent Database Systems Lab
Experiments - Datasets
Chinese corpus: TanCorpV1.0
14150 documents
20 categories
Select
1200 documents
agriculture
5,468,301
individual words
219,115 sentences
history
politics
economy
English corpus: Reuters-22173 distribution 1.0
22173 documents
135 categories
Select
2032 documents
Crude (520)
50,837 sentences
agriculture (574)
Trade (514)
281,111
individual words
Interest (424)
Intelligent Database Systems Lab
Experiments - Evaluation
Intelligent Database Systems Lab
Experiments - Chinese
Intelligent Database Systems Lab
Experiments - English
Intelligent Database Systems Lab
Experiments – t-test
Intelligent Database Systems Lab
Comparison
information
retrieval
TF*IDF
Chinese
LSI
English
multi-word
text
categorization
computation
complexity
O(n m)
best
O(n2r3)
O(ms2)
Intelligent Database Systems Lab
Conclusions
• LSI can produce better indexing in discriminative
power.
• LSI and multi-word have better semantic quality
than TF*IDF, and TF*IDF has better statistical quality
than the other two methods.
• The number of dimension is still a decisive factor for
indexing when we use different indexing methods for
classification.
Intelligent Database Systems Lab
Comments
• Advantages
- Compare with TF*IDF, LSI and multi-words
• Disadvantage
- semantic quality and statistical quality are considered
merely by our intuition instead of theory
• Applications
- text mining
Intelligent Database Systems Lab