下載/瀏覽

Download Report

Transcript 下載/瀏覽

CATEGORIZATION OF
NEWS ARTICLES USING
NEURAL TEXT
CATEGORIZER
FUZZ-IEEE 2009, Korea, August 20-24, 2009 Taeho Jo Inha University
Reporter:洪紹祥
Adviser:鄭淑真
OUTLINE
 Introduction
 Previous Works
 Framework
 Empirical Results
 Conclusions
INTRODUCTION(1/2)
 Text categorization is necessary for managing textual text as
efficiently as possible.
 Text categorization is requires the two manual preliminary:
• The predefined of categorization
• The preparation of sample labeled documents.
INTRODUCTION(2/2)
 Traditional Text Categorization requires encoding documents into
numerical vector
• Cause two main problems:
• Huge dimensionality.
• Sparse distribution.
 Solve the two main problem
• Encoded into String Vector
• Different from numerical Vector, words are given as feature value.
• Propose a neural network, called NTC(Neural Text Categorizer)
PREVIOUS WORKS(1/2)
 Popular approaches for Text categorization
•
•
•
•
KNN(K Nearest Neighbor)
NB(Naïve Bayes)
SVM(Support Vector Machine)
Neural Networks
• Causes two Main problems
• Huge dimensionality.
• Sparse distribution.
 Using String kernel in SVM
• Failed to improve the performance.
PREVIOUS WORKS(2/2)
 String kernel
• Receives two raw texts as inputs and computes their syntactical
similarity between them
• Advantage
• Don’t need to be encoded into numerical vector.
• More transparent than numerical vector .
• Easier to trace why each document is classified.
• Disadvantages
• Cost too much time for computing the similarity.
FRAMEWORK(1/2)
Bag of Words
FRAMEWORK(2/2)
EMPIRICAL RESULTS(1/3)
 The collection of news articles, called NewsPage.
 News articles
• 500 dimensional numerical vectors.
• 50 dimensional string vectors.
EMPIRICAL RESULTS(2/3)
 The configuration of participating approaches
EMPIRICAL RESULTS(3/3)
 The Results of This Set of Experiments
CONCLUSIONS
 The four contributions are considered as the significance of this
research.
1.
According to the results of the set of experiments, this research
proposes the practical approach.
It solved the two main problems,
2.
•
•
3.
4.
the huge dimensionality
the sparse distribution
Created a new neural network, called NTC.
Provides the potential easiness for tracing why each document is
classified so.