下載/瀏覽

Download Report

Transcript 下載/瀏覽

Atul Kumar, Megha Jain
Speaker: Shau-Shiang Hung
Date : 99/3/16
1
 Introduction
 Problem Statement
 Learning Methods
 Experiment
 Conclusion
2
 Increase in volume of information on Internet, we
need to have an efficient tool to help to better manage
these information.
 Auto Text classification can be formulate as follows:
Dtrain  {( d 1, L1)....( dn, Ln)} belongs to document set D
label Li  Li(di)
di within a predefined set of categories C  {ci,....cn}
3
 The goal is to devise a learning algorithm that give the
training set Dtrain input generate classifier.
 The most popular classifier is Naïve Bayes.
 For almost similar categories, the result is very badly.
 Not considering conceptual and contextual meaning of
the words.
4
• Focus on comparing the performance of Naive Bayes
and K-Nearest Neighbor classification algorithms.
• Using two datasets
– One mostly different categories
– other somewhat similar.
• Add contextual and conceptual information
– Synonyms of the frequent words of documents.
– Title of the documents.
– Considering phrase instead of words in the document.
5
1.
Preprocessing

2.
Removal HTML/SGML tags, commonly occurring
words and word stemming.
Indexing


Each document represent as a vector of word.
Give weighting techniques

Boolean weight

Word frequency weight

Tf*idf weight
6
3.
Dimensionality Reduction


Document Frequent Threshold
Mutual information

Information gain (IG)

A term-goodness criterion

Measures the number of bits of information obtained for
category prediction by knowing the presence or absence of
a term in a document.
G (t )  i 1 Pr( ci ) log 2 Pr( ci ) 
m
Pr(t )i 1 Pr(ci | t ) log 2 Pr(ci | t ) 
m
Pr(t  )i 1 Pr( ci | t  ) log 2 Pr(ci | t  )
m
7
4.
Data set


Using Reuters-21578 data set
Opted for category Grain

Used 11 categories



6 categories not related to categories Grain
5 categories related to Grain
1/3rd formed Test set, remaining 2/3rd formed training set
8
 Novel tagging techniques
1.
Appending words of the title by “title_”
2.
Synthesized new features form original set of features
by grouping consequent words into phrases (bigram
and trigram model)
3.
Add synonyms of the 10 most important
words(determined on basis of mutual information)
9
 Ran K-NN and Naïve Bayes on the 2 dataset
10
 Apply Bigram model and Trigram model on the data.
11
 Select top 10 words having largest information gain
with each category
12
 We tag each word in the title with the prefix as “title_”
13
 First select 10 words having largest information gain
among all the documents. After that we get all the
synonym of those words and replace those words by all
the synonyms in each document
14
 For similar categories, Naïve Bayes and KNN classifier
performs poorly.
 After adding contextual and conceptual information,
the title replication and synonyms inclusion increases
the accuracy by significant amount for Naïve Bayes
classifier
 Bigram, trigram and information gain does not
increases the accuracy for any of the classifiers.
15