Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

Download Report

Transcript Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

Text Categorization
Moshe Koppel
Lecture 2: Naïve Bayes
Slides based on Manning, Raghavan and Schutze
Naïve Bayes: Why Bother?
• Tightly tied to text categorization
• Interesting theoretical properties.
• A simple example of an important class of
learners based on generative models that
approximate how data is produced
• For certain special cases, NB is the best
thing you can do.
Bayes’ Rule
P(C , X )  P(C | X ) P( X )  P( X | C ) P(C )
P( X | C ) P(C )
P(C | X ) 
P( X )
Maximum a posteriori
Hypothesis
hMAP  argmaxP(h | D)
hH
P ( D | h) P ( h)
 argmax
P( D)
hH
 argmaxP( D | h) P(h)
hH
As P(D) is
constant
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely, we only
need to consider the P(D|h) term:
hML  argmaxP( D | h)
hH
Naive Bayes Classifiers
Task: Classify a new instance D based on a tuple of
attribute values D  x1, x2 ,, xn
into one of
the classes cj  C
cMAP  argmaxP(c j | x1 , x2 ,, xn )
c j C
 argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P( x1 , x2 ,, xn )
 argmaxP( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Classifier:
Naïve Bayes Assumption
• P(cj)
– Can be estimated from the frequency of classes in the
training examples.
• P(x1,x2,…,xn|cj)
– O(|X|n•|C|) parameters
– Could only be estimated if a very, very large number of
training examples was available.
Naïve Bayes Conditional Independence Assumption:
• Assume that the probability of observing the
conjunction of attributes is equal to the product of the
individual probabilities P(xi|cj).
Smoothing to Avoid Overfitting
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )  1
N (C  c j )  k
# of values of Xi
• Somewhat more subtle version
Pˆ ( xi ,k | c j ) 
overall fraction in
data where Xi=xi,k
N ( X i  xi ,k , C  c j )  mpi ,k
N (C  c j )  m
extent of
“smoothing”
Naive Bayes for Text Categorization
• Attributes are text positions, values are words.
c NB  argmax P (c j ) P ( xi | c j )
c jC
i
 argmax P (c j ) P ( x1 " our"| c j )  P ( xn " text"| c j )
c jC
• Still too many possibilities
• Assume that classification is independent of the
positions of the words
– Use same parameters for each position
– Result is bag of words model (over tokens not
types)
Naïve Bayes: Learning
• From training corpus, extract Vocabulary
• Calculate required P(cj) and P(xk | cj) terms
– For each cj in C do
• docsj  subset of documents for which the target class is cj
•
P(c j ) 
| docsj |
| total# documents|
• Textj  single document containing all docsj
• for each word xk in Vocabulary
– nk  number of occurrences of xk in Textj
–
P( xk | c j ) 
nk  
n   | Vocabulary|
Naïve Bayes: Classifying
• positions  all word positions in current document
which contain tokens found in Vocabulary
• Return cNB, where
cNB  argmaxP(c j )
c jC
 P( x | c )
i positions
i
j
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
cNB  argmaxlog P(c j ) 
c jC
 log P( x | c )
i positions
i
j
Naïve Bayes as Stochastic Language Models
• Model probability of generating strings (each
word in turn) in the language (commonly all
strings over ∑). E.g., unigram model
Model M
0.2
the
0.1
a
0.01
man
0.01
woman
0.03
said
0.02
likes
…
the
man
likes
the
woman
0.2
0.01
0.02
0.2
0.01
multiply
P(s | M) = 0.00000008
Naïve Bayes as Stochastic Language Models
• Model probability of generating any string
Model M1
Model M2
0.2
the
0.2
the
0.01
class
0.0001 class
0.0001 sayst
0.03
0.0001 pleaseth
0.02
0.2
pleaseth 0.2
0.0001 yon
0.1
yon
0.0005 maiden
0.01
maiden
0.01
0.0001 woman
woman
sayst
the
class
pleaseth
0.01
0.0001
0.0001 0.02
yon
maiden
0.0001 0.0005
0.1
0.01
P(s|M2) > P(s|M1)
Unigram and higher-order models
P(
)
=P( ) P( | )P( |
)P ( |
)
• Unigram Language Models
P(
)P(
) P(
) P(
Easy.
Effective!
)
• Bigram (generally, n-gram) Language Models
P(
)P(
|
) P(
|
) P(
|
)
Smoothing and Backoff
• Suppose we’re using a trigram model.
We need to estimate P(w3 | w1,w2)
• It will often be the case that the trigram w1,w2,w3
is rare or non-existent in the training corpus.
(Similar to problem we saw above with unigrams.)
• First resort: backoff.
Estimate P(w3 | w1,w2) using P(w3 | w2)
• Alternatively, use some very large backup corpus.
• Various combinations have been tried.
Multinomial Naïve Bayes =
class conditional language model
Cat
w1
w2
w3
w4
w5
w6
• Think of wi as the ith word in the document
• Effectively, the probability of each class is done as
a class-specific unigram language model
But Wait! Another Approach
Cat
w1
w2
w3
w4
w5
w6
• Now think of wi as the ith word in the
dictionary (not the document)
• Each value is either 1 (in the doc) or 0 (not)
This is very different than the multinomial method. McCallum and
Nigam (1998) observed that the two were often confused.
Binomial Naïve Bayes
• One feature Xw for each word in dictionary
• Xw = true in document d if w appears in d
• Naive Bayes assumption:
Given the document’s topic, appearance of one word
in the document tells us nothing about chances that
another word appears
Parameter Estimation
• Binomial model:
of documents of topic c
Pˆ ( X w  t | c j )  fraction
in which word w appears
j
• Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all documents of topic cj
– Can create a mega-document for topic j by concatenating all
documents in this topic
– Use frequency of w in mega-document
Experiment: Multinomial vs Binomial
• M&N (1998) did some experiments to see which
is better
• Determine if a university web page is {student,
faculty, other_stuff}
• Train on ~5,000 hand-labeled web pages
– Cornell, Washington, U.Texas, Wisconsin
• Crawl and classify a new site (CMU)
Multinomial vs. Binomial
Conclusions
• Multinomial is better
• For Binomial, it’s really important to do
feature filtering
• Other experiments bear out these
conclusions
Feature Filtering
• If irrelevant words mess up the results, let’s
try to use only words that might help
• In training set, choose k words which best
discriminate the categories.
• Best way to choose: for each category build
a list of j most discriminating terms
Infogain
• Use terms with maximal Mutual Information
with the classes:
p(ew , ec )
I (w , c )    p(ew , ec ) log
p(ew )p(ec )
e { 0,1} e { 0,1}
w
c
– For each word w and each category c
(This is equivalent to the usual two-class Infogain
formula.)
Chi-Square Feature Selection
Term present Term absent
Document
belongs to
category
Document
does not
belong to
category
A
B
C
D
X2 = N(AD-BC)2 / ( (A+B) (A+C) (B+D) (C+D) )
For complete independence of term and category: AD=BC
Feature Selection
• Many other measures of differentiation have
been tried.
• Empirical tests suggest Infogain works best.
• Simply eliminating rare terms is easy and
usual doesn’t do much harm.
• Be sure not to use test data when you do
feature selection. (This is tricky when
you’re using k-fold cross-validation.)
Naïve Bayes: Conclusions
• Classification results of naïve Bayes (the class with
maximum posterior probability) are usually fairly
accurate, though not nearly as good as, say, SVM.
• However, due to the inadequacy of the conditional
independence assumption, the actual posteriorprobability numerical estimates are not.
– Output probabilities are generally very close to 0 or 1.
Some Good Things about NB
• Theoretically optimal if the independence
assumptions hold
• Fast
• Sort of robust to irrelevant features (but not really)
• Very good in domains with many equally important
features
• Probably only method useful for very short test
documents (Why?)