下載/瀏覽Download
Download
Report
Transcript 下載/瀏覽Download
Neural Text Categorizer for Exclusive Text Categorization
Journal of Information Processing Systems, Vol.4, No.2,
June 2008
Taeho Jo*
報告者:林昱志
Outline
Introduction
Related Work
Method
Experiment
Conclusion
Introduction
Two types of approaches to text categorization
Rule based - Define manually in form of if-then-else
Advantage
1) High precision
Disadvantages
1) Poor recall
2) Poor flexibility
Introduction
Machine learning - Using sample labeled documents
Advantage
1) Much High recall
Disadvantages
1) Slightly lower precision than rule based
2) Poor flexibility
Introduction
Focuses on machine learning based , discarding rule based
All the raw data should be encoded into numerical vectors
Encoding documents leads to two main problems
1) Huge dimensionality
2) Sparse distribution
Introduction
Propose two way
1) String vector –
Provide more transparency in classification
2) NTC (Neural Text Categorizer) –
Classify documents with its sufficient robustness
Solves the huge dimensionality
Related Work
Machine learning algorithms applied to text categorization
1) KNN (K Nearest Neighbor)
2) NB (Naïve Bayes)
3) SVM (Support Vector Machine)
4) BP (Back Propagation)
Related Work
KNN is evaluated as a simple and competitive algorithm with
Support Vector Machine by Sebastiani in 2002
Disadvantage
1) Costs very much time for classifying objects
Related Work
Evaluated feature selection methods within the application of NB
by Mladenic and Grobellink in 1999
NB for implementing a spam mail filtering system as a real system
based on text categorization by Androutsopoulos in 2000
Requires encoding documents into numerical vectors
Related Work
SVM becomes more popular than the KNN and NB machine
learning algorithms
Defining a hyper-plane as a boundary of classes
Applicable to only linearly separable distribution of training
examples
Optimizes the weights of the inner products of training examples
and input vector, called Lagrange multipliers
Related Work
Define two hyper-planes as a boundary of two classes with a
maximal margin, figure 1.
Figure 1.
Related Work
Advantage
1) Tolerant to huge dimensionality of numerical vectors
Disadvantage
1) Applicable to only binary classification
1) Fragile in representing documents into numerical vectors
Related Work
A hierarchical combination of BPs, called HME (Hierarchical
Mixture of Experts), instead of a single BP by Ruiz and Srinivasan
in 2002
Observed that HME is the better combination of BPs
Disadvantage
1) Cost much time and slowly
2) Not practical
Study Aim
Two problems
1) Huge dimensionality
2) Sparse distribution
Two successful methods
1) String vectors
2) A new neural network
Method
Numerical Vectors
Figure 2.
Method
:
Frequency of the word , wk
: Total number of documents in the corpus
:
The number of documents including the word
in the corpus
Figure 3.
Method
Encoding a document into its string vector
Figure 4.
Method
Text Categorization Systems
Proposed neural network (NTC)
Consists of the three layers
1) Input layer
2) Output layer
3) Learning layer
Method
Input Layer - Corresponds to each word in the string vector
Learning Layer - Corresponding to predefined categories
Output Layer - Generates categorical scores , and correspond to
predefined categories.
Figure 5.
Method
String vector is denoted by
x = [t
1
,t2,...,td ] , ti , 1 ≤ i ≤ d
Predefined categories is denoted by C = [c1,c2,…..c |c|] , 1≤ j ≤ |C|
Wji denote the weight by
Figure 6.
Method
Oj :Output node corresponding to the category , Cj
Membership of the given input vector, x in the category, Cj
Figure 7.
Method
Each string vector in the training set has its own target label, Cj
If its classified category, Ck, , is identical to target category, C
Figure 8.
Method
Inhibit weights for its misclassified category
Minimize the classification error
Figure 9.
Experiment
Evaluate the five approaches on test bed, called ‘20NewsGroups
Each category contain identical number of test documents
Test bed consists of 20 categories and 20,000 documents
Using micro-averaged and macro-averaged average methods
Experiment
Back propagation is the best approach
NB is the worst approach with the decomposition of the task
Figure 10 . Evaluate the five text classifiers in
20Newsgroup with decomposition
Experiment
Classifier answers to each test document by providing one of 20
categories
Exits two groups
1)
Better group - BP and NTC
2)
Worse group – NB and KNN
Figure 11. Evaluate the five text classifiers in
20Newsgroup without decomposition
Conclusion
Used a full inverted index as the basis for the operation on string
vectors, instead of a restricted sized similarity matrix
Note trade-off between the two bases for the operation on string
vectors
NB and BP are considered to be modified into their adaptable
versions to string vetors , but may be insufficient for modifying
other
Future research for modifying other machine learning algorithms