Transcript 下載/瀏覽
Atul Kumar, Megha Jain
Speaker: Shau-Shiang Hung
Date : 99/3/16
1
Introduction
Problem Statement
Learning Methods
Experiment
Conclusion
2
Increase in volume of information on Internet, we
need to have an efficient tool to help to better manage
these information.
Auto Text classification can be formulate as follows:
Dtrain {( d 1, L1)....( dn, Ln)} belongs to document set D
label Li Li(di)
di within a predefined set of categories C {ci,....cn}
3
The goal is to devise a learning algorithm that give the
training set Dtrain input generate classifier.
The most popular classifier is Naïve Bayes.
For almost similar categories, the result is very badly.
Not considering conceptual and contextual meaning of
the words.
4
• Focus on comparing the performance of Naive Bayes
and K-Nearest Neighbor classification algorithms.
• Using two datasets
– One mostly different categories
– other somewhat similar.
• Add contextual and conceptual information
– Synonyms of the frequent words of documents.
– Title of the documents.
– Considering phrase instead of words in the document.
5
1.
Preprocessing
2.
Removal HTML/SGML tags, commonly occurring
words and word stemming.
Indexing
Each document represent as a vector of word.
Give weighting techniques
Boolean weight
Word frequency weight
Tf*idf weight
6
3.
Dimensionality Reduction
Document Frequent Threshold
Mutual information
Information gain (IG)
A term-goodness criterion
Measures the number of bits of information obtained for
category prediction by knowing the presence or absence of
a term in a document.
G (t ) i 1 Pr( ci ) log 2 Pr( ci )
m
Pr(t )i 1 Pr(ci | t ) log 2 Pr(ci | t )
m
Pr(t )i 1 Pr( ci | t ) log 2 Pr(ci | t )
m
7
4.
Data set
Using Reuters-21578 data set
Opted for category Grain
Used 11 categories
6 categories not related to categories Grain
5 categories related to Grain
1/3rd formed Test set, remaining 2/3rd formed training set
8
Novel tagging techniques
1.
Appending words of the title by “title_”
2.
Synthesized new features form original set of features
by grouping consequent words into phrases (bigram
and trigram model)
3.
Add synonyms of the 10 most important
words(determined on basis of mutual information)
9
Ran K-NN and Naïve Bayes on the 2 dataset
10
Apply Bigram model and Trigram model on the data.
11
Select top 10 words having largest information gain
with each category
12
We tag each word in the title with the prefix as “title_”
13
First select 10 words having largest information gain
among all the documents. After that we get all the
synonym of those words and replace those words by all
the synonyms in each document
14
For similar categories, Naïve Bayes and KNN classifier
performs poorly.
After adding contextual and conceptual information,
the title replication and synonyms inclusion increases
the accuracy by significant amount for Naïve Bayes
classifier
Bigram, trigram and information gain does not
increases the accuracy for any of the classifiers.
15