A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text

Download Report

Transcript A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text

A Probabilistic Analysis of the
Rocchio Algorithm with TFIDF
for Text Categorization
Thorsten Joachims
Carnegie Mellon University
Presented by Ning Kang
Outline
Summary
Text Categorization
Learning Methods for Text Categorization
PrTFIDF: A Probabilistic Classifier Derived
from TFIDF
Experiments and Results
Conclusions
Summary
A probabilistic analysis of the Rocchio
relevance feedback algorithm is presented in
text categorization framework. The analysis
results in a probabilistic version of Rocchio
classifier and offers an explanation for the
TFIDF word weighting heuristic.The Rocchio
classifier,its probabilistic variant and standard
naïve bayes classifier are compared on three
text categorization tasks.
Text Categorization
The goal of text categorization is the
classification of documents into a fixed
number of predefined categories.
The working definition used throughout
this paper assumes that each document
d is assigned to exactly one category.
The formal definition used through this
paper is defined as following.
Formal definition
A set of classes C and a set of training documents D.
A target concept T : D  C which maps documents to a
class. T (d) is known for the documents in the training set.
Through supervised learning the information contained in
the training examples can be used to find a model
H : DC which approximates T.
H(d) is the function defining the class to which the learned
hypothesis assigns document d; it can be used to classify
new documents.
The objective is to find a hypothesis which maximizes
accuracy.
Learning Methods for Text
Categorization
Learning algorithms


TFIDF Classifier
Naive Bayes Classifier
Bag-of-words representation
Feature selection
A combination of these three methods
is used to find a subset of words which
helps to discriminate between classes.



Pruning of infrequent words.
Pruning of high frequency words-remove
non-content word.
Choosing word that have high mutual
information with target concept.
Mutual information
E(X) is the entropy of the random variable X.
Pr(T(d)=C) is probability that an arbitrary article d is in
Category Cc. Pr(T(d)=C,w=0) and Pr(T(d)=C,w=1) is
probabilities article d is in Category C and it does or
does not contains the word w.
Top 15 words with the highest mutual
information for topic “wheat” in Reuters
Wheat, tonnes, agriculture, grain, usda
Washington, Department, Soviet, Export, Corn
Crop, cts, inc, winter, company
TF-IDF Classifier
TFIDF classifiers is based on relevance
feedback algorithm introduced by
Rocchio[Rocchio,1971] for the vector space
retrieval model[Salton,1991]. It is a nearest
neighbor learning method with prototype
vectors using the TFIDF[Salton,1991] word
weighting heuristic.
TF-IDF Classifier
Term frequency TF(wi,d) is the number of times word
wi occurs in document d
Document frequency DF(wi) is the number of
document in which word wi occurs at least once.
The inverse document frequency IDF(wi) can be
calculate from the document frequency
|D|
IDF( wi )  log(
)
DF ( wi )
Word weight d(i) = TF(wi,d) • IDF(wi)
TF-IDF
This word weighting means that a word
wi is an important indexing term for
document d if it occurs frequently in
it(tf is high).On the other hand,words
which occur in many documents are
rated less important indexing terms due
to their low idf.
Prototype vector
Learning is achieved by combining
 document
vectors into a prototype vector c for each
class Cc.


c 
d
d C
The resulting set of prototype vectors, one
vector for each Cc ,represented a learned
model.
TF-IDF Classifier
To classify a new document ď,the cosine of
the prototype vector of each class which is
calculated. The new document is assigned to
the class with which its document vector has
the highest cosine.
The cosine measures the angle between
the vector of document being classified and
the prototype vectors of each of the classes.
 
HTFIDF (d )  arg maxcos(d , c )
'
Cc
TF-IDF Classifier-summary
Decision rule of this classifier:
Naïve Bayes Classifier
Naïve Bayes classifiers is based on bagof –words representation.This algorithm
use probabilistic models to estimate the
likelihood that a given document is in a
class.They use thus probability estimate
for decision making.
Naïve Bayes Classifier
Assumption: Words are assumed to occur independently
of the other words in the document.
Bayes’ rule[James,1985] says that to achieve the highest
classification accuracy, ď should be assigned to the class
for which Pr(C| ď) is highest.
H BAYES (d ' )  arg max Pr(C | d ' )
Cc
Naïve Bayes Classifier
Decision rule for naïve Bayes Classifier
Here:
PrTFIDF: A Probabilistic Classier
Derived from TFIDF
PrTFIDF is TFIDF classier analyzed in a
probabilistic framework, which offers an
elegant way to distinguish between a
document and its representation.


A function  maps the document to its representation
The classifier uses this representation for decision
making.
H Pr TFIDF (d ' )  arg max Pr(C | d ' , )
Cc
PrTFIDF Algorithm
Pr(C|ď,) can be written in two parts.
Pr(C | d ' , )   Pr(C | x) Pr(x | d ' , )
x
Pr(x|ď,) maps document ď to its representation x with a
certain probability according to .Pr(C|x) is the probability
that document with representation x in class C. In
particular,documents will be represented by single words in
design choice with documents representation mapping .
So when x=w, Pr(x|ď,) = Pr(w|ď,)
The PrTFIDF Algorithm
The resulting decision rule for PrTFIDF is
Where
Bayes therom.
by using
The equivalence between TFIDF and
PrTFIDF
Assumption to achieve the equivalence


Uniform class priors:each class contains equal
number of documents.
Exist a , so that for all classes
This assumption states that Euclidean length of the
prototype vectors for each class is a linear
function of the number of words in that class.
Assumption to achieve the
equivalence
Refined version of IDF(w) suggested by
PrTFIDF algorithm.
Differece:
•relative term frequency
instead of occurrence of
the word.
•Square root instead of
logarithm.Both functions
are similar in shape and
reduced the impact of
high document
frequencies.
The Connection between TFIDF and
PrTFIDF
The PrTFIDF decision rule can be
transformed into the following formula
which is in format of TFIDF decision
rule.
Implication of the Analysis
The analysis shows how and under which
preconditions the TFIDF classifier fits into a
probabilistic framework.The close relationship to the
probabilistic classifier offers a theoretical justification
for the vector space model and the TFIDF word
weighting heuristic.



Use of Prior probabilities P(C).
Use of IDF’(w) for word weighting instead of IDF(w).
Use the number of words for normalization instead of the
Euclidean length.
Experiments
Newsgroup Dataset:Makes a total of 20000
(20*1000)documents in this collection. The
results reported on this dataset are averaged
over a number of random test/training splits
using binomial sign tests to estimate
significance. In each experiment 33% of the data
was used for testing.
Reuters dataset
The collection of 21450 articles which
are classified into 135 topic categories.
Each article can have multiple category
labels.
31%------have no category label
57%------ have exactly one label
12%-----have more than one and up to 12
class labels assigned.
Reuters dataset
The distribution of members in each class are very
uneven.Here is 20 most frequently used topic.
“acq” &”wheat” in reuters
The Category “acq” is the one with the
second most documents in it.
The “wheat” Category have a very
narrow definition.There is small number
of words which are very good clues as
to weather a document is in this
category or not.
14704-training 6746--testing
Experimental Results
20Newsgroups Reuters“acq”
Reuters“wheat”
PrTFIDF 90.3
89.3
95.6
Bayes
88.6
89.3
95.6
TFIDF
82.3
87.9
94.0
Maximum accuracy in percent for the best parameter settings
Results---20 Newsgroup(TS)
Results---Reuters “acq”(TS)
Results--- Reuters “wheat”(TS)
Results—20 Newsgroup(FV)
Results-- Reuters “acq”(FV)
Results- Reuters “wheat”(FV)
Result analysis
•How does the number of Training Examples influence
Accuracy?
--As expected the accuracy increases with the number of
training examples.PrTFIDF,Bayes and TFIDF show difference in
how quickly the accuracy increases.
--For newsgroup data, PrTFIDF performs well for small
numbers of training examples in contrast to Bayes. The
accuracy of te TFIDF classifier increases less quickly than for
the probabilistic methods.
--For reuters category “acq”,Bayes and PrTFIDF shows nearly
identical.TDIDF is significantly below those two probabilistic
methods.And there is no big difference for three methods for
the reuters category “wheat”.
Number of Features vs. Accuracy
What is the Influence of the Number of
Features on the Accuracy?
--For newsgroup data,keeping the number of training examples at
maximum,the performance of the system is higher.PrTFIDF and
Bayes are significantly above TFIDF.The overall highest
performance is achieved using PrTFIDF with the largest feature
set.
--The findings on Reuters category “acq” is similar.
--The reuters category “wheat” shows different characteristics. It
shows that for the probabilistic methods the accuracy does not
rise with the number of words used. The highest performance is
achieved by PrTFIDF and BAYES when only the minimum number
of 10 words is used.
Special findings for “wheat”
---The finding for the “wheat” category are probably
due to the different properties of this task. Since the
definition of category “wheat” is more narrow than
definition of other ones.The single word “wheat” is a
nearly perfect predictor for class membership.This
explains when small numbers of words can achieve
maximum performance.Adding more words adds
noise,since those words are words with lower
predictive power.
Which methods is Most Robust for small
numbers of Trainning Examples?
With a rising number of training examples,the
performance of BAYES approximates that of PrTFIDF.
Bayes becomes less accurate for big word-vector
sizes.Experiment on the newsgroups data have
shown that the smaller the size of the wordvector,the fewer training examples are needed for
Bayes to exceed the performance of PrTFIDF. Since
Bayes is very sensitive to inaccurate probability
estimates which arise a low number of training
examples.
Conclusion
Although the TFIDF method showed reasonable
accuracy on all classification tasks, the two probabilistic
methods BAYES and PrTFIDF showed great
performance improvements to all three tasks. These
empirical results suggest that a probabilistically founded
modeling is preferable to the heuristic TFIDF modeling.
The probabilistic methods are preferable from a
theoretical viewpoint too, since a probabilistic
framework allows the clear statement and easier
understanding of the simplifying assumptions made.