下載/瀏覽Download

Download Report

Transcript 下載/瀏覽Download

Neural Text Categorizer for Exclusive Text Categorization
Journal of Information Processing Systems, Vol.4, No.2,
June 2008
Taeho Jo*
報告者:林昱志
Outline
 Introduction
 Related Work
 Method
 Experiment
 Conclusion
Introduction
 Two types of approaches to text categorization
 Rule based - Define manually in form of if-then-else
 Advantage
1) High precision
 Disadvantages
1) Poor recall
2) Poor flexibility
Introduction
 Machine learning - Using sample labeled documents
 Advantage
1) Much High recall
 Disadvantages
1) Slightly lower precision than rule based
2) Poor flexibility
Introduction
 Focuses on machine learning based , discarding rule based
 All the raw data should be encoded into numerical vectors
 Encoding documents leads to two main problems
1) Huge dimensionality
2) Sparse distribution
Introduction
 Propose two way
1) String vector –
Provide more transparency in classification
2) NTC (Neural Text Categorizer) –
Classify documents with its sufficient robustness
Solves the huge dimensionality
Related Work
 Machine learning algorithms applied to text categorization
1) KNN (K Nearest Neighbor)
2) NB (Naïve Bayes)
3) SVM (Support Vector Machine)
4) BP (Back Propagation)
Related Work
 KNN is evaluated as a simple and competitive algorithm with
Support Vector Machine by Sebastiani in 2002
 Disadvantage
1) Costs very much time for classifying objects
Related Work
 Evaluated feature selection methods within the application of NB
by Mladenic and Grobellink in 1999
 NB for implementing a spam mail filtering system as a real system
based on text categorization by Androutsopoulos in 2000
 Requires encoding documents into numerical vectors
Related Work
 SVM becomes more popular than the KNN and NB machine
learning algorithms
 Defining a hyper-plane as a boundary of classes
 Applicable to only linearly separable distribution of training
examples
 Optimizes the weights of the inner products of training examples
and input vector, called Lagrange multipliers
Related Work
 Define two hyper-planes as a boundary of two classes with a
maximal margin, figure 1.
Figure 1.
Related Work
 Advantage
1) Tolerant to huge dimensionality of numerical vectors
 Disadvantage
1) Applicable to only binary classification
1) Fragile in representing documents into numerical vectors
Related Work
 A hierarchical combination of BPs, called HME (Hierarchical
Mixture of Experts), instead of a single BP by Ruiz and Srinivasan
in 2002
 Observed that HME is the better combination of BPs
 Disadvantage
1) Cost much time and slowly
2) Not practical
Study Aim
 Two problems
1) Huge dimensionality
2) Sparse distribution
 Two successful methods
1) String vectors
2) A new neural network
Method
 Numerical Vectors
Figure 2.
Method
:
Frequency of the word , wk
: Total number of documents in the corpus
:
The number of documents including the word
in the corpus
Figure 3.
Method
 Encoding a document into its string vector
Figure 4.
Method
 Text Categorization Systems
 Proposed neural network (NTC)
 Consists of the three layers
1) Input layer
2) Output layer
3) Learning layer
Method
 Input Layer - Corresponds to each word in the string vector
 Learning Layer - Corresponding to predefined categories
 Output Layer - Generates categorical scores , and correspond to
predefined categories.
Figure 5.
Method
 String vector is denoted by
x = [t
1
,t2,...,td ] , ti , 1 ≤ i ≤ d
 Predefined categories is denoted by C = [c1,c2,…..c |c|] , 1≤ j ≤ |C|
 Wji denote the weight by
Figure 6.
Method
 Oj :Output node corresponding to the category , Cj
 Membership of the given input vector, x in the category, Cj
Figure 7.
Method
 Each string vector in the training set has its own target label, Cj
 If its classified category, Ck, , is identical to target category, C
Figure 8.
Method
 Inhibit weights for its misclassified category
 Minimize the classification error
Figure 9.
Experiment
 Evaluate the five approaches on test bed, called ‘20NewsGroups
 Each category contain identical number of test documents
 Test bed consists of 20 categories and 20,000 documents
 Using micro-averaged and macro-averaged average methods
Experiment
 Back propagation is the best approach
 NB is the worst approach with the decomposition of the task
Figure 10 . Evaluate the five text classifiers in
20Newsgroup with decomposition
Experiment
 Classifier answers to each test document by providing one of 20
categories
 Exits two groups
1)
Better group - BP and NTC
2)
Worse group – NB and KNN
Figure 11. Evaluate the five text classifiers in
20Newsgroup without decomposition
Conclusion
 Used a full inverted index as the basis for the operation on string
vectors, instead of a restricted sized similarity matrix
 Note trade-off between the two bases for the operation on string
vectors
 NB and BP are considered to be modified into their adaptable
versions to string vetors , but may be insufficient for modifying
other
 Future research for modifying other machine learning algorithms