Transcript 幻灯片 1

Text Categorization
PengBo
10/31/2010
本次课大纲

Text Categorization



Problem definition
Build a Classifier
 Naïve Bayes Classifier
 K-Nearest Neighbor Classifier
Evaluation
Definition

Given:



实例 instance, xX, where X is the instance language
or instance space.
 Issue: how to represent text documents.
固定的类别集合 categories:
 C = {c1, c2,…, cn}
Determine:

The category of x : c(x)C,
where c(x) is a categorization function 分类函数
 We want to know how to build categorization
functions (“classifiers 分类器” ).
Text Categorization Examples
Assign labels to each document or web-page:
 Labels are most often topics such as Yahoo-categories


Labels may be genres


e.g., "editorials" "movie-reviews" "news“
Labels may be opinion


e.g., "finance," "sports," "news>world>asia>business"
e.g., “like”, “hate”, “neutral”
Labels may be domain-specific binary



e.g., "interesting-to-me" : "not-interesting-to-me”
e.g., “spam” : “not-spam”
e.g., “contains adult language” :“doesn’t”
Classification Methods

人工分类 Manual classification



自动文本分类 Automatic document classification



Used by Yahoo!, Looksmart, about.com, ODP, Medline
Accurate but expensive to scale
基于规则:Hand-coded rule-based systems
 Spam mail filter,…
有监督的学习:Supervised learning of a documentlabel assignment function
 No free lunch: requires 人工标注的训练集 handclassified training data
Note that many commercial systems use a
mixture of methods
Think about it…

How to represent text documents and
categories?



Vectors & Regions
String & Language (models)
How to build categorization functions?


Closeness/Similarity to regions
Probability to generate the string/language
model
K-Nearest Neighbors
Classes in a Vector Space
Government
Science
Arts
Classification Using Vector Spaces



Each training doc a point (vector) labeled by its
topic (= class)
Hypothesis: docs of the same class form a
contiguous region of space
We define surfaces to delineate classes in space
Test Document = Government
Similarity
hypothesis
true in
general?
Government
Science
Arts
k Nearest Neighbor Classification





To classify document d into class c
Define k-neighborhood N as k nearest neighbors of d
Count number of documents i in N that belong to c
Estimate P(c|d) as i/k
Choose as class argmaxc P(c|d) [ = majority class]
Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
Nearest-Neighbor Learning
Algorithm


Learning is just storing the representations of the
training examples in D.
Testing instance x:




Compute similarity between x and all examples in D.
Assign x the category of the most similar example in D.
Does not explicitly compute a generalization or
category prototypes.
Also called:



Case-based learning
Memory-based learning
Lazy learning
Why K?

Using only the closest example to determine the
categorization is subject to errors due to:




A single atypical example.
Noise (i.e. error) in the category label of a single
training example.
More robust alternative is to find the k mostsimilar examples and return the majority category
of these k examples.
Value of k is typically odd to avoid ties; 3 and 5
are most common.
kNN decision boundaries
Boundaries
are in
principle
arbitrary
surfaces –
but usually
polyhedra
Government
Science
Arts
Similarity Metrics

Nearest neighbor method depends on a similarity
(or distance) metric.



Simplest for continuous m-dimensional instance space
is Euclidean distance.
Simplest for m-dimensional binary instance space is
Hamming distance (number of feature values that
differ).
For text, cosine similarity of tf.idf weighted
vectors is typically most effective.
Illustration of 3 Nearest Neighbor
for Text Vector Space
Nearest Neighbor with Inverted
Index




Naively finding nearest neighbors requires a
linear search through |D| documents in collection
But determining k nearest neighbors is the same
as determining the top-k best retrievals using the
test document as a query to a database of
training documents.
Use standard vector space inverted index
methods to find the k nearest neighbors.
Testing Time: O(B|Vt|)
where B is the
average number of training documents in which a
test-document word appears.
 Typically B << |D|
kNN: Discussion



No training necessary
No feature selection necessary
Scales well with large number of classes


Classes can influence each other


Don’t need to train n classifiers for n classes
Small changes to one class can have ripple effect
Scores can be hard to convert to probabilities
Naïve Bayes
Bayes Classifiers
Task: Classify a new instance D based on a tuple of
attribute values
into one of
D  x1, x2 ,, xn
the classes cj  C
cMAP  argmaxP(c j | x1 , x2 ,, xn )
c j C
 argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P( x1 , x2 ,, xn )
 argmaxP( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Assumption

P(cj)


Can be estimated from the frequency of classes in the
training examples.
P(x1,x2,…,xn|cj)


O(|X|n•|C|) parameters
Could only be estimated if a very, very large number of
training examples was available.
Conditional Independence Assumption
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
Assume that the probability of observing the
conjunction of attributes is equal to the product
of the individual probabilities P(xi|cj).
 features detect term presence and are
independent of each other given the class:
P( X1,, X 5 | C)  P( X1 | C)  P( X 2 | C)  P( X 5 | C)

Learning the Model
C
X1

X2
X3
X4
X5
X6
First attempt: maximum likelihood estimates

simply use the frequencies in the data
Pˆ (c j ) 
Pˆ ( xi | c j ) 
N (C  c j )
N
N ( X i  xi , C  c j )
N (C  c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X1,, X 5 | C)  P( X1 | C)  P( X 2 | C)  P( X 5 | C)

What if we have seen no training cases where patient had no
flu and muscle aches?
N ( X 5  t , C  nf )
ˆ
P( X 5  t | C  nf ) 
0
N (C  nf )

Zero probabilities cannot be conditioned away, no matter the
other evidence!
  arg max c Pˆ (c)i Pˆ ( xi | c)
Smoothing to Eliminate Zeros
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )  1
N (C  c j )  k
# of values of Xi


Add one smooth (Laplace smoothing)
As a uniform prior (each attribute occurs once for each
class) that is then updated as evidence from the training
data comes in.
Document Generative Model



“Love is patient, love is kind. “
Basic : bag of words
a binary independence model




Love
kind
patient
Multivariate binomial generation
feature Xi is term
Value Xi = 1 or 0, indicate term Xi present in doc or not
a multinomial unigram language model




is
Multinomial generation
feature Xi is term position
Value of Xi = term at position i
position independent
Bernoulli Naive Bayes Classifiers

Multivariate binomial Model
One feature Xw for each word in dictionary
 Value Xw = true in document d if w appears
in d
 Naive Bayes assumption:
 Given the document’s topic, appearance
of one word in the document tells us
Love is kind
nothing about chances that another word
appears
patient

Multinomial Naive Bayes Classifiers II

Multinomial = Class conditional unigram
One feature Xi for each word pos in
document
 feature’s values are all words in
dictionary
 Value of Xi is the word in position i
 Naïve Bayes assumption:
 Given the document’s topic, word in one
position in the document tells us nothing
about words in other positions
 argmax P (c j ) P ( xi | c j )

c NB
c jC
i
 argmax P (c j ) P ( x1 " our"| c j )  P ( xn " text"| c j )
c jC
Multinomial Naive Bayes Classifiers



Still too many possibilities
Assume that classification is independent of the
positions of the words
Second assumption:

Word appearance does not depend on position
P( X i  w | c)  P( X j  w | c)
for all positions i,j, word w, and class c


Just have one multinomial feature predicting for all
words
Use same parameters for each position
Parameter estimation

Binomial model:
of documents of topic c
Pˆ ( X w  t | c j )  fraction
in which word w appears
j

Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all documents of topic cj
Naive Bayes algorithm (Multinomial model)
Naive Bayes algorithm (Bernoulli model)
NB Example

c(5)=?
NB Example

c(5)=?
Multinomial NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) = China
NB Example

c(5)=?
Bernoulli NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) <> China
Classification

Multinomial vs Multivariate binomial?

Multinomial is in general better
Classification Evaluation
Let’s think about it…

How to do evaluation experiments for classifiers?



Dataset
Measures
Generalization Performance
Training set
Test set
Recall
Precision
F1
Accuracy
Classic Reuters Data Set




Most (over)used data set
21578 documents
9603 training, 3299 test articles (ModApte split)
118 categories




Average document: about 90 types, 200 tokens
Average number of classes assigned


An article can be in more than one category
Learn 118 binary category distinctions
1.24 for docs with at least one category
Only about 10 out of 118 categories are large
Common categories
(#train, #test)
•
•
•
•
•
Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
Measuring Classification
Figures of Merit

Accuracy of classification


Speed of training statistical classifier


Some methods are very cheap; some very costly
Speed of classification (docs/hour)



Main evaluation criterion in academia
No big differences for most algorithms
Exceptions: kNN, complex preprocessing requirements
Effort in creating training set/hand-built classifier

human hours/topic
Measuring Classification
Figures of Merit

In the real world, economic measures:


Your choices are:
 Do no classification
 That has a cost (hard to compute)
 Do it all manually
 Has an easy to compute cost if doing it like that now
 Do it all with an automatic classifier
 Mistakes have a cost
 Do it with a combination of automatic classification and
manual review of uncertain/difficult/”new” cases
Commonly the last method is most cost efficient and is
adopted
Per class evaluation measures
Actual Class
A
B
C
A
Predicted
class



B
C
cii
Recall: Fraction of docs in class i classified  cij
j
correctly:
Precision: Fraction of docs assigned class i that cii
are actually about class i:
j c ji
“Correct rate”: (1- error rate) Fraction of docs
classified correctly:
 cii
i

j
i
cij
Measuring Classification

Overall error rate




Precision/recall for classification decisions
F1 measure: 1/F1 = ½ (1/P + 1/R)
Correct estimate of size of category



Not a good measure for small classes. Why?
Why is this different?
Stability over time / category drift
Utility


Costs of false positives / false negatives may be
different
For example, cost = tp-0.5fp
Generalization Performance

Generalization Performance



Results can vary based on sampling error due
to different training and test sets.
Average results over multiple training and test
sets (splits of the overall data) for the best
results.
Ideally, test and training sets are independent
on each trial.
But this would require too much labeled data.
Good practice department

N-Fold Cross-Validation





Partition data into N equal-sized disjoint
segments.
Run N trials, each time using a different
segment of the data for testing, and training on
the remaining N1 segments.
This way, at least test-sets are independent.
Report average classification accuracy over the
N trials.
Typically, N = 10.
Good practice department II

Learning Curves



Would like to know how performance varies with the
number of training instances.
Learning curves plot classification accuracy on
independent test data (Y axis) versus number of
training examples (X axis).
One can do both the above and produce learning
curves averaged over multiple trials from crossvalidation
how to combine multiple measures?


If we have more than one class, how do we
combine multiple performance measures into one
quantity?
Macroaveraging:


Compute performance for each class, then average.
Microaveraging:

Collect decisions for all classes, compute contingency
table, evaluate.
Micro- vs. Macro-Averaging: Example
Class 1
Class 2
Truth:
yes
Truth:
no
Classifi
er: yes
10
10
Classifi
er: no
10
970



Micro.Av. Table
Truth:
yes
Truth:
no
Truth:
yes
Truth:
no
Classifie
r: yes
90
10
Classifier: 100
yes
20
Classifie
r: no
10
890
Classifier: 20
no
1860
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
Microaveraged precision: 100/120 = .83
Why this difference?
Exercise

Federalist papers

谁是作者?

1787-1788年间由
Hamilton, Jay and
Madison用笔名发表的
77篇短文,来劝说NY
支持US Constitution
其中12篇papers的作者
存在争议
Author identification

In 1964 Mosteller and Wallace* solved the
problem
Mosteller, Frederick and Wallace, David L. 1964.
Inference and Disputed Authorship: The
Federalist.
Feature
It’s a Text Catergorization Problem
Selection
 They identified 70 function words as good
candidates for authorship analysis
 Using statistical inference they concluded the
author was Madison
Classifier


53
Function words for Author Identification
Function Words for Author Identification
Summary

Definition



K-Nearest Neighbor
Naïve Bayes

Bayesian Methods
Bernoulli NB classifier

Multinomial NB classifier

c NB  argmax P (c j ) P ( xi | c j )
c jC
i

The category of x: c(x)C
Categorization Evaluation


Training data/Test data
Over-fitting & Generalize
 argmax P (c j ) P ( x1 " our"| c j )  P ( xn " text"| c j )
c jC
Thank You!
Q&A
Readings


[1] IIR Ch13, Ch14.2
[2] Y. Yang and X. Liu, "A re-examination of text
categorization methods," presented at
Proceedings of ACM SIGIR Conference on
Research and Development in Information
Retrieval (SIGIR'99), 1999.
Bernoulli trial

Bernoulli trial is an
experiment whose
outcome is random and
can be either of two
possible outcomes,
"success" and "failure".
Binomial Distribution

binomial distribution is
the discrete probability
distribution of the number
of successes in a sequence
of n independent yes/no
experiments, each of which
yields success with
probability p.
Multinomial Distribution

multinomial distribution is
a generalization of the
binomial distribution.



each trial results in one of some
fixed finite number k of possible
outcomes, with probabilities
p1, ..., pk,
there are n independent trials.
We can use a random variable
Xi to indicate the number of
times outcome number i was
observed over the n trials.
Bayes’ Rule
prior
posterior
likelihood
Use Bayes Rule to Gamble


someone draws an envelope at random and
offers to sell it to you. How much should you pay?
before deciding, you are allowed to see one bead
drawn from the envelope. Suppose it’s red: How
much should you pay?
Prosecutor's fallacy


You win the lottery jackpot. You
are then charged with having
cheated, for instance with
having bribed lottery officials.
At the trial, the prosecutor
points out that winning the
lottery without cheating is
extremely unlikely, and that
therefore your being innocent
must be comparably unlikely.
Maximum a posteriori Hypothesis
hMAP  argmaxP(h | D)
hH
P ( D | h) P ( h)
 argmax
P( D)
hH
 argmaxP( D | h) P(h)
hH
As P(D) is
constant
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely, we only
need to consider the P(D|h) term:
hML  argmaxP( D | h)
hH
Likelihood


a likelihood function is a conditional probability function
considered as a function of its second argument with its
first argument held fixed
Given a parameterized family of probability density
functions
x  f (x |  )

Where θ is the parameter, the likelihood function is
L( | x)  f ( x |  )

where x is the observed outcome of an experiment.


when f(x | θ) is viewed as a function of x with θ fixed, it is a
probability density function,
when viewed as a function of θ with x fixed, it is a
likelihood function.
Reuters Text Categorization data set
(Reuters-21578) document
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981"
NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off
tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining
industry positions on a number of issues, according to the National Pork Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the
future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate
whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC
said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the
industry, the NPPC added. Reuter
&#3;</BODY></TEXT></REUTERS>
New Reuters: RCV1: 810,000 docs

Top topics in Reuters RCV1
北大天网: 中文网页分类





通过动员不同专业的几十个学生,人工选取形成了一个基
于层次模型的大规模中文网页样本集。
包括12,336个训练网页实例和3,269个测试网页实例,分布
在12个大类,共计733个类别中,每个类别平均有17个训练
实例和4.5个测试实例
天网免费提供网页样本集给有兴趣的同行,燕穹产品号:
YQ-WEBENCH-V0.8
中文信息检索论坛www.cwirf.org
全国搜索引擎和网上信息挖掘学术研讨会SEWM上进行分
类评测
北大天网: 中文网页分类
类别名称
24
训练
样本数
419
测试
样本数
110
新闻与媒体
7
125
19
3
商业与经济
48
839
214
4
娱乐与休闲
88
1510
374
5
计算机与因特网
58
925
238
6
教育
18
286
85
7
各国风情
53
891
235
8
自然科学
113
1892
514
9
政府与政治
18
288
84
10
社会科学
104
1765
479
11
医疗与健康
136
2295
616
12
社会与文化
66
1101
301
共计
733
12336
3269
类别编
号
1
人文与艺术
2
类别数
Concept Drift


Categories change over time
Example: “president of the united states”




1999: clinton is great feature
2002: clinton is bad feature
One measure of a text classification system is
how well it protects against concept drift.
Feature selection: can be bad in protecting
against concept drift
Think about it…



Describe the process, can you tell what is it?
How do you do it?
Why we talk about it here, or what does it mean
to Information Overloading?
鹰
肉鸡
Recall: Vector Space Representation



Each document is a vector, one component for
each term (= word).
Normalize to unit length.
High-dimensional vector space:



Terms are axes
10,000+ dimensions, or even 100,000+
Docs are vectors in this space