Partially Supervised Classification of Text Documents

Download Report

Transcript Partially Supervised Classification of Text Documents

Partially Supervised
Classification of Text Documents
Authors:
Bing Liu
Philip S. Yu
Wee Sun Lee
Xiaoli Li
Presented by: Swetha Nandyala
CIS 525: Neural Computation
1
Overview
Introduction
Theoretical Foundation
Background Methodology
NB-C
 EM-Algorithm

Proposed Strategy
Evaluation Measures & Experiments
Conclusion
CIS 525: Neural Computation
2
Text Categorization
… the activity of labeling natural language texts with thematic categories from a predefined set [Sebastiani, 2002]
Text Categorization is a task of automatically assigning to a
text document d from a given domain D, a category label c
selected among a predefined set of category labels C.
D
Categorization
System
c1
c2
C
…
c1
c2
…
…..
cj
…
…
ck
……...
ck
CIS 525: Neural Computation
3
Text Categorization(Contd.)
Standard Supervised Learning Problem

Bottleneck: Need for very large number of labeled
training documents to build accurate classifier
Goal: To identify a particular class of documents
from a set of mixed unlabeled documents


Standard classifications inapplicable
Partially Supervised Classification is used
CIS 525: Neural Computation
4
Theoretical foundations
AIM: To show PSC is a constrained optimization problem
fixed distribution D over X x Y, where Y = {0,1}
X, Y: sets of possible documents, classes
Two sets of documents


labeled as positive P of size n1 drawn from X for DX|Y=1
Unlabeled U of size n2 drawn from X for DX independently
GOAL: Find the positive documents in U
CIS 525: Neural Computation
5
Theoretical foundations
learning algorithm: selects a function f  F: X  {0, 1}(a
class of functions) to classify unlabeled documents
probability of error: Pr[f(X)  Y] is sum of “false
positive” and “false negative” cases
rewritten as
Pr[f(X)  Y] = Pr[f(X) = 1  Y=0]+Pr[f(X) = 0  Y=1]
After transforming
Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1]
+ 2Pr[f(X) = 0|Y = 1]Pr[Y = 1]
CIS 525: Neural Computation
6
Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1]
Theoretical foundations (contd..)
Note that Pr[Y = 1] is constant
approximation:
keeping Pr[f(X) = 0|Y = 1] small
error  Pr[f(X) = 1] - Pr[Y = 1]
 Pr[f(X) = 1] – const
i.e. minimizing Pr[f(X) = 1]  minimizing error
 minimizing PrU[f(X) = 1]) & keeping PrP[f(X) = 1])  r
NOTHING BUT CONSTRAINT OPTIMIZATION
PROBLEM
 Learning Possible
CIS 525: Neural Computation
7
Naïve Bayesian Text Classification
D be set of training documents
C = {c1, c2, ...,c|C|}: predif. classes, here: c1, c2
For diD, Pr[cj|di]: posterior probs are calculated
in NB model: class with the highest Pr[cj|di] is
assigned to the document
CIS 525: Neural Computation
8
The EM-Algorithm
Iterative algorithm for maximum likelihood
estimation in problems with incomplete data
Two step method

Expectation Step


Fills in missing data
Maximization Step

Estimate parameters after the missing data is filled
CIS 525: Neural Computation
9
Proposed Strategy
Step 1: Re-initialization


Iterative-EM: by applying EM-algorithm over P and U
Identifying a set of reliable negative documents from the
unlabeled set, by introducing spies
Step 2: Building and selecting a classifier


Spy-EM: building a set of classifiers iteratively
selecting a good classifier from the set of classifiers
constructed above
CIS 525: Neural Computation
10
Iterative EM with NB-C
Assign each document in P(ositive) to class label c1
and in U(nlabeled) to class c2


Pr[c1/ di] = 1 & Pr[c2/ di] = 0 for each di in P
Pr[c2/ dj] = 1 & Pr[c1/ dj] = 0 for each dj in U
After initial labeling, a NB-C is built and used to
classify documents in U

revise posterior probabilities for documents in U
After revising, a NB-C with new posterior probs. is
built

Iterative process goes on until EM converges
Setback: strongly biased towards positive documents
CIS 525: Neural Computation
11
Step1: Re-Initialization
Sample a certain % of positive examples say “S” and
put them into unlabeled set to act as “spies”
I-EM algorithm is utilized but the U(nlabeled) set now
has some spy documents
After EM completes, the probabilistic labels of spies
are used to decide which documents are most likely
negative(LN)
threshold t used for decision making:


if Pr[c1|dj] < t: denoted as L(ikely)N(egative)
if for dj  S Pr[c1|dj] > t: denoted as U(nlabeled)
CIS 525: Neural Computation
12
positives
negatives
Step-1 effect
BEFORE
U(unlabeled)
AFTER
LN (likely
negative)
U un-labeled
spies
P(positive)
some spies
P(positive)
initial situation: U = P  N
help of spies: most positives in U get into
no clue which are P and N
spies from P added to U
unlabeled set, while most negatives get into
LN; purity of LN higher than that of U
CIS 525: Neural Computation
13
Step-2: S-EM
Apply EM over P, LN and U
algorithm proceeds as follows:
1.
2.
3.
4.
5.
put all spies S back to P
(where they were before)
diP:  c1 (i.e. Pr[c1|di] = 1); (fixed thru iterations)
djLN:  c2 (i.e. Pr[c2|dj] = 1); (changing thru EM)
dkU: initially assigned no label (will be after EM(1))
run EM using P, LN and U until it converges
final classifier is produced when EM stops
CIS 525: Neural Computation
14
Pr[f(X)  Y] = Pr[f(X) = 1] - Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1]
Selecting Classifier
S-EM generates set of classifiers but classification is
not necessarily improving
remedy: stop iterating of EM at some point
estimating the change of the probability error between
iterations i and i+1


i = Pr[fi+1(X)  Y] - Pr[fi(X)  Y]
if i > 0 for the first time, then ith classifier produced is the
final classifier
CIS 525: Neural Computation
15
Truth
Evaluation measures
Yes
No
Yes
a
b
No
c
d
System
Accuracy (of a classifier) A = m/(m+i) , where m, i
are numbers of correct and incorrect decisions,
respectively
F-Score:


F = 2pr / (p+r) is a classification performance measure
Where
recall r = a/(a+c)
precision p = a/(a+b)
The F-value reflects the average effect of both precision
and recall
CIS 525: Neural Computation
16
Experiments
30 datasets created from 2 large document corpora
objective:

recovering positive documents placed into mixed sets
for each experiment:



dividing full positive set into two subsets: P and R
P: positive set used in the algorithm with a% of the full
positive set
R: set of remaining documents with b% have been put
into U (not all in R put to U)
CIS 525: Neural Computation
17
Experiments (contd…)
techniques used
NB-C: applied directly to P (c1) and U(c2) to built a
classifier to classify data in set U
I-EM: applies EM to P and U as long as converges (no
spy yet); final classifier to be applied to U to identify its
positives
S-EM: spies used to re-initialize; I-EM to build the final
classifier; threshold t used
CIS 525: Neural Computation
18
Experiments (contd…)
S-EM outperforms NB and I-EM in F dramatically
S-EM outperforms NB and I-EM in A as well
comment: datasets skewed, so A is not a reliable
measure of classifier’s performance
CIS 525: Neural Computation
19
Experiments (contd…)
results show great effect of re-initialization with spies:

S-EM outperforms I-EMbest
re-initialization is not, however, the only factor of
improvement:

S-EM outperforms S-EM4
conclusions: both Step-1 (reinitializing) and Step-2 (selecting
the best model) are needed!
CIS 525: Neural Computation
20
Conclusion
Gives an overview of the theory on learning with
positive and unlabeled examples
Describes a two-step strategy for learning which
produces extremely accurate classifiers
Partially supervised classification is most helpful
when initial model is insufficiently trained
CIS 525: Neural Computation
21
Questions?
CIS 525: Neural Computation
22