Categorization

Download Report

Transcript Categorization

Text Categorization
Categorization


Problem: Given a universe of objects and a predefined set of classes, or categories, assign
each object to its correct class.
Examples:
Problem
Objects
Categories
Tagging
words in context
POS tag
WSD
words in context
word sense
PP attachment
sentences
parse trees
Language identification
text
language
Text categorization
text
topic
Overview

Definition of Text Categorization

Techniques
– Decision Trees
– Maximum Entropy Modeling
– k-Nearest Neighbor Classification
Text categorization

Classification (= Categorization)
– Task of assigning objects to classes or
categories

Text categorization
– Task of classifying the topic or theme of
a document
Statistical classification
 Training
set of objects
 Data representation model
 Model class
 Training procedure
 Evaluation
Training set of objects


A set of objects, each labeled by one or more classes
Example from Reuters
<REUTERS TOPICS="YES" NEWID="2005">
<DATE> 5-MAR-1987 09:22:57.75</DATE>
<TOPICS><D>earn</D></TOPICS>
<PLACES><D>usa</D></PLACES>
<TEXT>&#2;
<TITLE>NORD RESOURCES CORP &lt;NRD> 4TH QTR NET</TITLE>
<DATELINE> DAYTON, Ohio, March 5 - </DATELINE>
<BODY>Shr 19 cts vs 13 cts
Net 2,656,000 vs 1,712,000
Revs 15.4 mln vs 9,443,000
Avg shrs 14.1 mln vs 12.6 mln
Shr 98 cts vs 77 cts
Net 13.8 mln vs 8,928,000
Revs 58.8 mln vs 48.5 mln
Avg shrs 14.0 mln vs 11.6 mln
NOTE: Shr figures adjusted for 3-for-2 split paid Feb 6, 1987.
Reuter &#3;</BODY></TEXT>
</REUTERS>
Data Representation Model
The training set is encoded via a data
representation model
 Typically, each object in the training
set is represented by a pair (x, c),
where:

– x: a vector of measurements
– c: class label
Data Representation

For text categorization:
– use words that are frequent in “earnings”
documents
– the 20 most representative words are:
vs, min, cts, loss, &, 000, profit, dlrs, pct, etc.

Each document is represented as vector

x j  s1j ,...,skj
where

1  log tf i j
j

si  round 10

1  log l j





and tfij is the number of occurrences of word i in
document j and lj is the length of document j.
word
sij
vs
5
mln
5
cts
3
;
3
&
3
000
4
loss
0
‘
0
“
0
3
4
profit
0
dlrs
3
1
2
pct
0
is
0
s
0
that
0
net
3
lt
2
at
0
Model Class and Training Procedure

Model class
– A parameterized family of classifiers
– e.g. a model class for binary
classification: g(x) = w∙x + w0
• if g(x) > 0, choose class c1, else c2

Training procedure
– Algorithm to select one classifier from
this family
– i.e., to select proper parameters values
(e.g. w, w0)
Evaluation



Borrowing from IR, NLP systems are evaluated
by precision, recall, etc.
Example: for text categorization, given a set of
documents of which a subset is in a particular
category (say, “earnings”), the system
classifies some other subset of the documents
as belonging to the “earnings” category.
The results of the system are compared with
the actual results as follows:
Correct
Incorrect
Assigned
tp (true positive)
fp (false positive)
Not Assigned
fn (false negative)
tn (true negative)
Evaluation measures
Precision
tp
tp  fp

Recall
tp
tp  fn

Accuracy
tp  tn
tp  tn  fp  fn
Error
fp  fn
tp  tn  fp  fn


Evaluation of text categorization

macro-averaging
– Compute an evaluation measure for each contingency
table separately and average over categories
– gives equal weight to each category
– macro-averaged precision =

a1
a2
a3


a1  b1 a 2  b2 a3  b3
3
micro-averaging
– Make a single contingency table for all categories by
summing the scores in each cell, then compute the
evaluation measure for the whole table
– gives equal weight to each object
– micro-averaged precision =
(a1  a 2  a 3)
(a1  a 2  a 3)  (b1  b 2  b3)
Classification Techniques
Naïve Bayes
 Decision Trees
 Maximum Entropy Modeling
 Support Vector Machines
 k-Nearest Neighbor

Bayesian Classifiers
Bayesian Methods





Learning and classification methods
based on probability theory
Bayes theorem plays a critical role in
probabilistic learning and classification
Build a generative model that
approximates how data is produced
Uses prior probability of each category
given no information about an item
Categorization produces a posterior
probability distribution over the possible
categories given a description of an item
Bayes’ Rule
P(C , X )  P(C | X ) P( X )  P( X | C ) P(C )
prior
probability
P( X | C ) P(C )
P(C | X ) 
P( X )
posterior
probability
Maximum a posteriori Hypothesis
hMAP  argmaxP(h | D)
hH
hMAP
P( D | h ) P( h )
 argmax
P( D )
hH
hMAP  argmax P( D | h) P(h)
hH
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely,
need only to consider the P(D|h) term:
hML  argmaxP( D | h)
hH
Naïve Bayes Classifiers
Task: Classify a new instance based on a
tuple of attribute values
x1, x2 ,, xn
cMAP  argmaxP(c j | x1 , x2 ,, xn )
c j C
cMAP  argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P(c1 , c2 ,, cn )
cMAP  argmaxP( x1 , x2 ,, xn | c j ) P(c j )
c j C
Naïve Bayes Classifier: Assumptions

P(cj)
– Can be estimated from the frequency of
classes in the training examples.

P(x1,x2,…,xn|cj)
– O(|X|n•|C|)
– Could only be estimated if a very, very large
number of training examples was available.
Conditional Independence Assumption:
 Assume that the probability of observing the
conjunction of attributes is equal to the product
of the individual probabilities.
The Naïve Bayes Classifier
Flu
X1
runnynose

X2
sinus
X3
cough
X4
fever
X5
muscle-ache
Conditional Independence Assumption:
features are independent of each other
given the class:
P( X1 ,, X 5 | C)  P( X1 | C)  P( X 2 | C)  ... P( X 5 | C)
Learning the Model
C
X1

X2
X3
X4
X5
X6
Common practice: maximum likelihood
– simply use the frequencies in the data
Pˆ (c j ) 
Pˆ ( xi | c j ) 
N (C  c j )
N
N ( X i  xi , C  c j )
N (C  c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P( X1 ,, X 5 | C)  P( X1 | C)  P( X 2 | C)  ... P( X 5 | C)

What if we have seen no training cases where patient had
no flu and muscle aches?
N ( X 5  t , C  nf )
ˆ
P( X 5  t | C  nf ) 
0
N (C  nf )

Zero probabilities cannot be conditioned away, no matter
the other evidence!
  arg max c Pˆ (c)i Pˆ ( xi | c)
Smoothing to Avoid Overfitting
Pˆ ( xi | c j ) 
N ( X i  xi , C  c j )  1
N (C  c j )  k
# of values of Xi

Somewhat more subtle version
overall
fraction in
data where
Xi=xi,k
Pˆ ( xi ,k | c j ) 
N ( X i  xi ,k , C  c j )  mpi ,k
N (C  c j )  m
extent of
“smoothing”
Text Classification Using Naïve Bayes:
Basic method

Attributes are text positions, values are
words.
c NB  argmax P (c j ) P ( xi | c j )
c jC
i
 argmax P (c j ) P ( x1 " our"| c j )  P ( xn " text"| c j )
c jC

Naive Bayes assumption is clearly violated.
– Example?


Still too many possibilities
Assume that classification is independent of the positions of
the words
– Use same parameters for each position
Text Classification Algorithms: Learning

From training corpus, extract Vocabulary
 Calculate required P(cj) and P(xk | cj) terms
– For each cj in C do
• docsj  subset of documents for which the
target class is cj
•
P(c j ) 
| docsj |
| total# documents|
• Textj  single document containing all docsj
• for each word xk in Vocabulary
– nk  number of occurrences of xk in Textj
–
P( xk | c j ) 
nk  1
n | Vocabulary|
Text Classification Algorithms: Classifying

positions  all word positions in current
document which contain tokens found in
Vocabulary

Return cNB, where
cNB  argmaxP(c j )
c jC
 P( x | c )
i positions
i
j
Naive Bayes Time Complexity

Training Time: O(|D|Ld + |C||V|))
where
Ld is the average length of a document in D
– Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld)
time during one pass through all of the data.
– Generally just O(|D|Ld) since usually |C||V| < |D|Ld


Test Time: O(|C| Lt)
where
Lt is the average length of a test document
Very efficient overall, linearly proportional to the
time needed to just read in all the data
Underflow Prevention

Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result
in floating-point underflow
 Since log(xy) = log(x) + log(y), it is better
to perform all computations by summing
logs of probabilities rather than
multiplying probabilities
 Class with highest final un-normalized log
probability score is still the most probable
Naïve Bayes Posterior Probabilities

Classification results of naïve Bayes (the
class with maximum posterior probability)
are usually fairly accurate
 However, due to the inadequacy of the
conditional independence assumption, the
actual posterior-probability numerical
estimates are not
– Output probabilities are generally very close to
0 or 1
Two Models

Model 1: Multivariate binomial
– One feature Xw for each word in
dictionary
– Xw = true in document d if w appears in
d
– Naive Bayes assumption:
• Given the document’s topic, appearance of
one word in document tells us nothing
about chances that another word appears
Two Models

Model 2: Multinomial
– One feature Xi for each word pos in document
• feature’s values are all words in dictionary
– Value of Xi is the word in position i
– Naïve Bayes assumption:
• Given the document’s topic, word in one position in
document tells us nothing about value of words in
other positions
– Second assumption:
• word appearance does not depend on position
P( X i  w | c)  P( X j  w | c)
for all positions i,j, word w, and class c
Parameter estimation

Binomial model:
fraction of documents of topic c
ˆ
P( X w  t | c j )  in which word w appears
j

Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all documents of topic cj
– creating a mega-document for topic j by concatenating all
documents in this topic
– use frequency of w in mega-document
Feature selection via Mutual Information

We might not want to use all words, but
just reliable, good discriminators
 In training set, choose k words which best
discriminate the categories.
 One way is in terms of Mutual Information:
p(ew , ec )
I (w, c)    p(ew , ec ) log
p(ew ) p(ec )
ew{0,1} ec {0,1}
– For each word w and each category c
Feature selection via MI (2)

For each category we build a list of k most
discriminating terms.
 For example (on 20 Newsgroups):
– sci.electronics: circuit, voltage, amp, ground, copy,
battery, electronics, cooling, …
– rec.autos: car, cars, engine, ford, dealer, mustang,
oil, collision, autos, tires, toyota, …

Greedy: does not account for correlations
between terms
 In general feature selection is necessary for
binomial NB, but not for multinomial NB
Evaluating Categorization

Evaluation must be done on test data that
are independent of the training data
(usually a disjoint set of instances).
 Classification accuracy: c/n where n is the
total number of test instances and c is the
number of test instances correctly
classified by the system.
 Results can vary based on sampling error
due to different training and test sets.
 Average results over multiple training and
test sets (splits of the overall data) for the
best results.
Example: AutoYahoo!

Classify 13,589 Yahoo! webpages in “Science”
subtree into 95 different topics (hierarchy depth 2)
Example: WebKB (CMU)

Classify webpages from CS departments into:
– student, faculty, course, project
WebKB Experiment

Train on ~5,000 hand-labeled web
pages
– Cornell, Washington, U.Texas,
Wisconsin
Crawl and classify a new site (CMU)
 Results:

Student
Extracted
180
Correct
130
Accuracy: 72%
Faculty
66
28
42%
Person
246
194
79%
Project
99
72
73%
Course
28
25
89%
Depart.
1
1
100%
NB Model Comparison
Sample Learning Curve
(Yahoo Science Data)
Importance of Conditional Independence
Assume a domain with 20 binary (true/false) attributes A1,…, A20, and
two classes c1 and c2.
Goal: for any case A=A1,…,A20 estimate P(A,ci).
A) No independence assumptions:
Computation of 221 parameters (one for each combination of values) !



The training database will not be so large!
Huge Memory requirements / Processing time.
Error Prone (small sample error).
B) Strongest conditional independence assumptions (all attributes
independent given the class) = Naive Bayes:
P(A,ci)=P(A1,ci)P(A2,ci)…P(A20,ci)
Computation of 20*22 = 80 parameters.



Space and time efficient.
Robust estimations.
What if the conditional independence assumptions do not hold??
C) More relaxed independence assumptions
Tradeoff between A) and B)
Conditions for Optimality of Naïve Bayes
Answer
Fact
Sometimes NB performs
well even if the
Conditional
Independence
assumptions are badly
violated.
Questions
Assume two classes c1 and c2.
A new case A arrives.
NB will classify A to c1 if:
P(A, c1) > P(A, c2)
P(A,c1)
P(A,c2)
Class of A
Actual Probability
0.1
0.01
c1
Estimated Probability by NB
0.08
0.07
c1
WHY? And WHEN?
Hint
Classification is about
predicting the correct
class label and NOT
about accurately
estimating probabilities.
Despite the big error in estimating the
probabilities the classification is still correct.
Correct estimation  accurate prediction
but NOT
accurate prediction  Correct estimation
Naïve Bayes is Not So Naïve

Naïve Bayes: First and Second place in KDD-CUP 97 competition,
among 16 (then) state of the art algorithms
Goal: Financial services industry direct mail response prediction model.
Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
Instead Decision Trees & Nearest-Neighbor methods can heavily suffer from this.

Very good in Domains with many equally important features
Decision Trees suffer from fragmentation in such cases – especially if little data


A good dependable baseline for text classification (but not the
best)!
Optimal if the Independence Assumptions hold:
– If assumed independence is correct, then it is the Bayes Optimal Classifier for
problem

Very Fast:
– Learning with one pass over the data; testing linear in the number of attributes,
and document collection size


Low Storage requirements
Handles Missing Values
Interpretability of Naïve Bayes
(From R.Kohavi, Silicon Graphics MineSet Evidence Visualizer)
Naïve Bayes Drawbacks


Doesn’t do higher order interactions
Typical example: Chess end games
– Each move completely changes the context for
the next move
– C4.5  99.5 % accuracy : NB  87% accuracy.


What if you have BOTH high order
interactions AND few training data?
Doesn’t model features that do not equally
contribute to distinguishing the classes.
– If few features ONLY mostly determine the class,
additional features usually decrease the accuracy.
– Because NB gives same weight to all features.
Decision Trees
Decision Trees
Example: decision whether to assign documents to the category "earning"
node 1
7681 articles
P(c|n1) = 0.300
split: cts
value: 2
cts < 2
cts  2
node 2
5977 articles
P(c|n2) = 0.116
split: net
value: 1
net < 1
node 3
5436 articles
P(c|n3) = 0.050
node 5
1704articles
P(c|n5) = 0.943
split: vs
value: 2
net  1
vs < 2
node 4
541 articles
P(c|n4) = 0.649
node 6
201 articles
P(c|n6) = 0.694
vs  2
node 7
1403 articles
P(c|n7) = 0.996
Decision Trees - Training procedure (1)

Growing a tree with training data
– splitting criterion
• for finding the feature and its value on which to split
• e.g. maximum information gain
– stopping criterion
• determines when to stop splitting
• e.g. all elements at node have same category

Pruning it back to reasonable size
– to avoid overfitting the training set
e.g. ‘dlrs’ and ‘pct’ in just one document
– to optimize performance
Maximum Information Gain

Information gain
H(t) – H(t|a) = H(t) – (pLH(tL) + pRH(tR))
where:
a is attribute we split on
t is distribution of the node we split
pL and pR are the percent of nodes passed on to left and right
nodes
tL and tR are the distributions of left and right nodes

Choose attribute which maximizes IG
Example:
H(n1) = - 0.3 log(0.3) - 0.7 log(0.7) = 0.881
H(n2) = 0.518
H(n5) = 0.315
H(n1) – H(n1| ‘cts’) =
0.881 – (5977/7681) · 0.518 – (1704/7681) · 0.315 = 0.408
Decision Trees – Pruning (1)


At each step, drop a node considered least
helpful
Find best tree using validation on validation set
– validation set: portion of training data held out from
training

Find best tree using cross-validation
– I. Determine the optimal tree size
• 1. Divide training data into N partitions
• 2. Grow using N-1 partitions, and prune using held-out
partition
• 3. Repeat 2. N times
• 4. Determine average pruned tree size as optimal tree size
– II. Training using total training data, and pruning back to
optimal size
Decision Trees - Pruning (2)


Effect of pruning on accuracy
Optimal performance on test set pruning 951 nodes
Decision Trees Summary


Useful for non-trivial classification tasks (for
simple problems, use simpler methods)
Tend to split the training set into smaller and
smaller subsets:
– may lead to poor generalizations
– not enough data for reliable prediction
– accidental regularities


Volatile: very different model from slightly
different data
Can be interpreted easily
– easy to trace the path
– easy to debug one’s code
– easy to understand a new domain
Maximum Entropy Modeling
Maximum Entropy Modeling

Maximum Entropy Modeling
– The model with maximum entropy of all
the models that satisfy the constraints
– desire to preserve as much uncertainty
as possible
Model class: log linear model
 Training procedure: generalized
iterative scaling

Maximum Entropy Modeling (2)
word

Model class: loglinear model


1 K
( x ,c )
f
i
p( x , c)   i
Z i 1
1 if sij  0 and c  1

f i ( x j , c)  
otherwise
0
– i : weight for i-th feature
– Z : normalizing constant

Class of new document
– compute p(x, 0), p(x, 1)
– choose the class label with the greater
probability
log(i)
vs
0.613
mln
-0.110
cts
1.298
;
-0.432
&
-0.429
000
-0413
loss
-0.332
‘
-0.085
“
0.202
3
-0.463
profit
0.360
dlrs
-0.202
1
-0.211
pct
-0.260
is
-0.546
s
-0.490
that
-0.285
net
-0.300
lt
1.016
at
-0.465
fK+1
0.009
Maximum Entropy Modeling (3)

Training procedure: generalized interative
scaling


– Expected value of p:
E f
p
i
  p x , c 

x ,c
f
i
( x , c)
– Epfi maximum entropy distribution p*: Ep*fi = Epfi

Algorithm
1. initialize (1). compute Epfi. n=1
2. compute p(n)(x, c) for each training data
3. compute Ep(n)fi
4. update (n+1)
5. if converged, stop. otherwise n=n+1, goto 2
Maximum Entropy Modeling (4)

Define (K+1)th feature: for the constraint that sum of
fi is equal to C
N


f K 1 ( x , c)  C   f i ( x , c)
i 1

K
C  max


x ,c
i 1

f i ( x , c)
Expected value of p is defined as


E p fi   px, c  fi ( x, c)

x ,c

Expected value for the empirical distribution is
computed as



1 N
~
E ~p f i   p x, c  f i ( x, c)   f i ( x j , c)

N j 1
x ,c

Expected value of p is approximately computed as


1 N
E p f i   p(c | x j ) f i ( x j , c)
N j 1 c
GIS Algorithm (full)
1.
Initialize {ai(1)}.
Maximum Entropy Modeling (6)
word

Application to text categorization
– trained on 9603 articles, 500
iteration
– test result: 88.6% accuracy
log(i)
vs
0.613
mln
-0.110
cts
1.298
;
-0.432
&
-0.429
000
-0413
loss
-0.332
‘
-0.085
“
0.202
3
-0.463
profit
0.360
dlrs
-0.202
1
-0.211
pct
-0.260
is
-0.546
s
-0.490
that
-0.285
net
-0.300
lt
1.016
at
-0.465
fK+1
0.009
Maximum Entropy Modeling (7)

Shortcoming of MEM
– restricted to binary feature
• low performance in some situations
– computationally expensive: slow convergence
– the lack of smoothing can cause problem

Strength of MEM
– can specify all possible relevant information
• complex features can be defined
– can use heterogeneous features and weighting
feature
– an integrated framework for feature selection &
classification
• a very large number of features could be down to a
manageable size during training procedure
Vector Space Classifiers
Vector Space Representation
Each document is a vector, one
component for each term (= word).
 Normalize to unit length.
 Properties of vector space

– terms are axes
– n docs live in this space
– even with stemming, may have 10,000+
dimensions, or even 1,000,000+
Classification Using Vector Spaces
Each training doc a point (vector)
labeled by its class
 Similarity hypothesis: docs of the
same class form a contiguous region
of space. Or: Similar documents are
usually in the same class.
 Define surfaces to delineate classes
in space

Classes in a Vector Space
Similarity
hypothesis
true in
general?
Government
Science
Arts
Given a Test Document
Figure out which region it lies in
 Assign corresponding class

Test Document = Government
Government
Science
Arts
Binary Classification
Consider 2 class problems
 How do we define (and find) the
separating surface?
 How do we test which region a test
doc is in?

Separation by Hyperplanes

Assume linear separability for now:
– in 2 dimensions, can separate by a line
– in higher dimensions, need hyperplanes

Can find separating hyperplane by
linear programming (e.g.
perceptron):
– separator can be expressed as ax + by =
c
Linear Programming / Perceptron
Find a,b,c, such that
ax + by  c for red points
ax + by  c for blue
points.
Relationship to Naïve Bayes?
Find a,b,c, such that
ax + by  c for red points
ax + by  c for blue points.
Linear Classifiers
Many common text classifiers are
linear classifiers
 Despite this similarity, large
performance differences

– For separable problems, there is an
infinite number of separating
hyperplanes. Which one do you
choose?
– What to do for non-separable
problems?
Which Hyperplane?
In general, lots of possible
solutions for a,b,c
Which Hyperplane?




Lots of possible solutions for a,b,c.
Some methods find a separating
hyperplane, but not the optimal one
(e.g., perceptron)
Most methods find an optimal
separating hyperplane
Which points should influence
optimality?
– All points
• Linear regression
• Naïve Bayes
– Only “difficult points” close to decision
boundary
• Support vector machines
• Logistic regression (kind of)
Hyperplane: Example


Class: “interest” (as in interest rate)
Example features of a linear classifier
(SVM)
wi
ti
0.70 prime
0.67 rate
0.63 interest
0.60 rates
0.46 discount
0.43 bundesbank
wi
ti
-0.71 dlrs
-0.35 world
-0.33 sees
-0.25 year
-0.24 group
-0.24 dlr
More Than Two Classes

One-of classification: each document
belongs to exactly one class
– How do we compose separating surfaces into
regions?

Any-of or multiclass classification
– For n classes, decompose into n binary
problems

Vector space classifiers for one-of
classification
– Use a set of binary classifiers
– Centroid classification
– K nearest neighbor classification
Composing Surfaces: Issues
?
?
?
Set of Binary Classifiers

Build a separator between each class and
its complementary set (docs from all other
classes).
 Given test doc, evaluate it for membership
in each class.
 For one-of classification, declare
membership in classes for class with:
– maximum score
– maximum confidence
– maximum probability

Why different from multiclass
classification?
Negative Examples

Formulate as above, except negative
examples for a class are added to its
complementary set.
Positive examples
Negative examples
Centroid Classification
Given training docs for a class,
compute their centroid
 Now have a centroid for each class
 Given query doc, assign to class
whose centroid is nearest
 Compare to Rocchio

Example
Government
Science
Arts
k-Nearest Neighbor
k Nearest Neighbor Classification
To classify document d into class c
 Define k-neighborhood N as k
nearest neighbors of d
 Count number of documents l in N
that belong to c
 Estimate P(c|d) as l/k

Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
Cover and Hart 1967
Asymptotically, the error rate of 1nearest-neighbor classification is
less than twice the Bayes rate.
 Assume: query point coincides with
a training point.
 Both query point and training point
contribute error -> 2 times Bayes rate
 In particular, asymptotic error rate 0
if Bayes rate is 0.

kNN Classification: Discussion

1NN for “earning”
– 95.3% accuracy

KNN Classification
– no training
– needs effective similarity measure
• cosine, Euclidean distance, Value Difference Metric
• Performance is very dependent on the right similarity
metric
– is computationally expensive
– is robust and conceptually simple method
kNN vs. Regression






Bias/Variance tradeoff
Variance ≈ Capacity
kNN has high variance and low bias.
Regression has low variance and high bias.
Consider: Is an object a tree? (Burges)
Too much capacity/variance, low bias
– Botanist who memorizes
– Will always say “no” to new object (e.g., #leaves)

Not enough capacity/variance, high bias
– Lazy botanist
– Says “yes” if the object is green
kNN: Discussion

Classification time linear in training set
 No feature selection necessary
 Scales well with large number of classes
– Don’t need to train n classifiers for n classes

Classes can influence each other
– Small changes to one class can have ripple effect

Scores can be hard to convert to
probabilities
 No training necessary
– Actually: not true. Why?
Number of Neighbors
Support Vector Machines
Recall: Which Hyperplane?


In general, lots of possible
solutions for a,b,c.
Support Vector Machine
(SVM) finds an optimal
solution.
Support Vector Machine (SVM)

SVMs maximize the
margin around the
separating hyperplane.
 The decision function is
fully specified by a
subset of training
samples, the support
vectors.
 Quadratic programming
problem
 Text classification
method du jour
Support vectors
Maximize
margin
Maximum Margin: Formalization

w: hyperplane normal
 x_i: data point i
 y_i: class of data point i (+1 or -1)

Constraint optimization formalization:

(1)

(2) maximize margin: 2/||w||
Quadratic Programming

One can show that hyperplane w with maximum
margin is:
w   i yi xi
i
i: lagrange multipliers
xi: data point i
yi: class of data point i (+1 or -1)
where the i are the solution to maximizing:
LD   i  12  i j yi y j xi  x j
i
i
Most i will be zero
Building an SVM Classifier
Now we know how to build a
separator for two linearly separable
classes
 What about classes whose
exemplary docs are not linearly
separable?

Not Linearly Separable
Find a line that penalizes
points on “the wrong side”
Penalizing Bad Points
Negative for
bad points.
Define distance for each point with
respect to separator ax + by = c:
(ax + by) - c for red points
c - (ax + by) for blue points.
Solve Quadratic Program
Solution gives “separator” between
two classes: choice of a,b
 Given a new point (x,y), can score its
proximity to each class:

– evaluate ax+by
– Set confidence threshold
3
5
7
Performance of SVM





SVM are seen as best-performing method
by many
Statistical significance of most results not
clear
There are many methods that perform
about as well as SVM
Example: regularized regression
(Zhang&Oles)
Example of a comparison study: Yang&Liu
Yang&Liu: SVM vs Other Methods
Yang&Liu: Statistical Significance
Yang&Liu: Small Classes
Results for Kernels (Joachims)
SVM: Summary

SVM have optimal or close to optimal
performance
 Kernels are an elegant and efficient way to map
data into a better representation
 SVM can be expensive to train (quadratic
programming)
 If efficient training is important, and slightly
suboptimal performance ok, don’t use SVM
 For text, linear kernel is common
 So most SVMs are linear classifiers (like many
others), but find a (close to) optimal separating
hyperplane
SVM: Summary (cont.)

Model parameters based on small subset
(SVs)
 Based on structural risk minimization
h(log(2l / h)  1)  log( / 4)
R( )  Remp ( ) 
l

Supports kernel
LD   i  12  i j yi y j xi  x j
i
i
Resources









Manning and Schütze. Foundations of Statistical Natural Language
Processing. Chapter 16. MIT Press.
Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of
Statistical Learning: Data Mining, Inference and Prediction" SpringerVerlag, New York.
Christopher J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition (1998)
Data Mining and Knowledge Discovery
R.M. Tong, L.A. Appelbaum, V.N. Askman, J.F. Cunningham. Conceptual
Information Retrieval using RUBRIC. Proc. ACM SIGIR 247-253, (1987).
S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems,
13(4), Jul/Aug 1998.
Yiming Yang, S. Slattery and R. Ghani. A study of approaches to hypertext
categorization. Journal of Intelligent Information Systems, Volume 18,
Number 2, March 2002.
Yiming Yang, Xin Liu. A re-examination of text categorization
methods. 22nd Annual International SIGIR, (1999)
Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized
Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001)