A Comparative Study on Feature Selection in Text

Download Report

Transcript A Comparative Study on Feature Selection in Text

A Comparative Study on Feature
Selection in Text Categorization
(Proc. 14th International Conference on Machine Learning –
1997)
Paper By:
Yiming Yang, CMU
Jan O. Pedersen, Verity, Inc.
Presented By:
Prerak Sanghvi
Computer Science and Engineering Department
State University of New York at Buffalo
Introduction
• This paper is a comparative study of feature
selection methods in statistical learning of text
categorization.
• Five methods were evaluated:
–
–
–
–
–
Document Frequency (DF)
Information Gain (IG)
Mutual Information (MI)
2 test (CHI)
Term Strength (TS)
Document Frequency (DF)
• Document Frequency is the number of documents
in which a term occurs.
• Terms whose document frequency is less than
some predetermined threshold, are removed from
the feature space.
• The basic assumption is that rare terms are either
non-informative for category prediction, or not
influential in global performance. However, this
assumption must be handled carefully.
Information Gain (IG)
• IG measures the number of bits of
information obtained for category prediction
by knowing the presence or absence of a
term in a document.
• For a term t, and set of classes ci:
G (t) = - i=1 to m Pr (ci) log Pr (ci) +
Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) +
Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t)
Information Gain (IG)…
• Given a training corpus, for each unique
term, IG is computed, and those terms are
removed from the feature space whose IG is
less than some predetermined threshold.
Mutual Information (MI)
• Each word is ranked according to its mutual
information with respect to the class labels.
• Mutual information criterion is defined as:
I(t, c) = log [ Pr (t  c) / {Pr(t) · Pr(c)} ]
• Category specific scores are often combined
as:
Iavg (t) = i=1 to m Pr (ci) I (t, ci)
Imax (t) = maxi=1 to m I (t, ci)
2 statistic (CHI)
• The 2 statistic measures the lack of
independence between t and c.
• 2 statistic is known to be not reliable for
low-frequency terms.
Term Strength (TS)
• This method estimates term importance based on
how commonly a term is likely to appear in
‘closely-related’ documents.
• It uses a training set of documents to derive
document pairs whose similarity is above a
threshold.
• This criterion is based on document clustering,
assuming that documents with many shared words
are related, and that terms in the heavily
overlapping area of related documents are
relatively informative.
Conclusion
• IG and CHI were found to be most effective in
aggressive term removal without losing
categorization accuracy in experiments with kNN
and LLSF (Linear Least Squares Fit) on Reuters
22173 and OHSUMED collection.
• DF is found comparable to IG and CHI with up to
90% term removal, while TS is comparable with
up to 50-60%.
• MI has inferior performance.