Text classification: Stan Matwin School of Information Technology and Engineering
Download
Report
Transcript Text classification: Stan Matwin School of Information Technology and Engineering
Text classification:
In Search of a Representation
Stan Matwin
School of Information Technology
and Engineering
University of Ottawa
[email protected]
1
Matwin 1999
Outline
Supervised learning=classification
ML/DM at U of O
Classical approach
Attempt at a linguistic representation
N-grams – how to get them?
Labelling and co-learning
Next steps?…
2
Matwin 1999
Supervised learning
(classification)
Given:
a set of training instances T={et}, where each
t is a class label : one of the classes C1,…Ck
a concept with k classes C1,…Ck (but the
definition of the concept is NOT known)
Find:
a description for each class which will perform
well in determining (predicting) class
membership for unseen instances
3
Matwin 1999
Classification
Prevalent practice:
examples are represented as vectors of
values of attributes
Theoretical wisdom,
confirmed empirically: the more
examples, the better predictive accuracy
4
Matwin 1999
ML/DM at U of O
Learning from imbalanced classes:
applications in remote sensing
a relational, rather than propositional
representation: learning the maintainability
concept
Learning in the presence of background
knowledge. Bayesian belief networks and how
to get them. Appl to distributed DB
5
Matwin 1999
Why text classification?
Automatic file saving
Internet filters
Recommenders
Information extraction
…
6
Matwin 1999
Text classification: standard approach
1. Remove stop words and markings
2. remaining words are all attributes
3. A document becomes a vector
<word, frequency>
4. Train a boolean classifier for each
class
5. Evaluate the results on an unseen
sample
7
Matwin 1999
Text classification: tools
RIPPER
A “covering”learner
Works well with large sets of binary
features
Naïve Bayes
Efficient (no search)
Simple to program
Gives “degree of belief”
8
Matwin 1999
“Prior art”
Yang: best results using k-NN:
82.3% microaveraged accuracy
Joachim’s results using Support
Vector Machine + unlabelled data
SVM insensitive to high
dimensionality, sparseness of
examples
9
Matwin 1999
SVM in Text classification
SVM
Transductive SVM
Maximum separation
Margin for test set
Training with 17 examples in 10
most frequent categories gives test
performance of 60% on 3000+ test
cases available during training
10
Matwin 1999
Problem 1: aggressive feature selection
AI
“Machine”: 50%
“Learning”: 75%
“Machine Learning”: 50%
RIPPER (B.O.W.):
“Machine”: 4%
“Learning”: 75%
“Machine Learning”: 0%
FLIPPER (Cohen):
EP
MT
“Machine”: 80%
“Learning”: 5%
“Machine Learning”: 0%
machine & learning = AI
machine & learning & near & after = AI
•
•
RIPPER (Phrases):
“machine learning” = AI
•
11
Matwin 1999
Problem 2: semantic relationships are missed
weapon
knife
gun
dagger
sword
rifle
slingshot
• Semantically related words may
be sparsely distributed through
many documents
• Statistical learner may be able to
pick up these correlations
• Rule-based learner is
disadvantaged
12
Matwin 1999
Proposed solution (Sam Scott)
Get noun phrases and/or key
phrases (Extractor) and add to the
feature list
Add hypernyms
13
Matwin 1999
Hypernyms - WordNet
weapon
“instance of”
“is a”
knife
gun
“Synset”
pistol,
revolver
“synset”
=> SYNONYM
“is a”
=> HYPERNYM
“instance of” => HYPONYM
14
Matwin 1999
Evaluation (Lewis)
•Vary the “loss ratio” parameter
• For each parameter value
• Learn a hypothesis for each
class (binary classification)
• Micro-average the confusion
matrices (add component-wise)
• Compute precision and recall
• Interpolate (or extrapolate) to
find the point where microaveraged precision and recall are
equal
15
Matwin 1999
Results
Micro-averaged b.e.
BW
BWS
NP
NPS
KP
KPS
H0
H1
NPW
Reuters
.821
.810
.827
.819
.817
.816
.741e
.734e
.823
DigiTrad
.359
.360
.357
.356
.288e
.297e
.283
.281
N/A
No gain over BW in
alternative
representations
But…
Comprehensibility…
16
Matwin 1999
Combining classifiers
#
1
3
5
Reuters
representations
NP
BW, NP, NPS
BW, NP, NPS, KP, KPS
b.e.
.827
.845
.849
DigiTrad
representations
BWS
BW, BWS, NP
BW, BWS, NP, KPS, KP
b.e.
.360
.404e
.422e
Comparable to best known results (Yang)
17
Matwin 1999
Other possibilities
Using hypernyms with a small
training set (avoids ambiguous
words)
Use Bayes+Ripper in a cascade
scheme (Gama)
Other representations:
18
Matwin 1999
Collocations
Do not need to be noun phrases,
just pairs of words possibly
separated by stop words
Only the well discriminating ones are
chosen
These are added to the bag of
words, and…
Ripper
19
Matwin 1999
N-grams
N-grams are substrings of a given length
Good results in Reuters [Mladenic, Grobelnik]
with Bayes; we try RIPPER
A different task: classifying text files
Attachments
Audio/video
Coded
From n-grams to relational features
20
Matwin 1999
How to get good n-grams?
We use Ziv-Lempel for frequent
substring detection (.gz!)
abababa
a
a
b
a
b
b
a
21
Matwin 1999
N-grams
Counting
Pruning:
substring occurrence ratio < acceptance
threshold
Building relations: string A almost always
precedes string B
Feeding into relational learner (FOIL)
22
Matwin 1999
Using grammar induction
(text files)
Idea: detect patterns of substrings
Patterns are regular languages
Methods of automata induction: a
recognizer for each class of files
We use a modified version of RPNI2
[Dupont, Miclet]
23
Matwin 1999
What’s new…
Work with marked up text (Word,
Web)
XML with semantic tags: mixed
blessing for DM/TM
Co-learning
Text mining
24
Matwin 1999
Co-learning
How to use unlabelled data? Or How
to limit the number of examples that
need be labelled?
Two classifiers and two redundantly
sufficient representations
Train both, run both on test set,
add best predictions to training set
25
Matwin 1999
Co-learning
Training set grows as…
…each learner predicts independently
due to redundant sufficiency (different
representations)
would also work with our learners if we
used Bayes?
Would work with classifying emails
26
Matwin 1999
Co-learning
Mitchell experimented with the task
of classifying web pages (profs,
students, courses, projects) – a
supervised learning task
Used
Anchor text
Page contents
Error rate halved (from 11% to 5%)
27
Matwin 1999
Cog-sci?
Co- learning seems to be cognitively
justified
Model: students learning in groups
(pairs)
What other social learning
mechanisms could provide models
for supervised learning?
28
Matwin 1999
Conclusion
A practical task, needs a solution
No satisfactory solution so far
Fruitful ground for research
29
Matwin 1999