下載/瀏覽

Download Report

Transcript 下載/瀏覽

Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine
Learning Methods for Medical Text Categorization," paccs, pp.494-497, 2009
Pacific-Asia Conference on Circuits, Communications and Systems, 2009
Speaker:Shau-Shiang Hung (洪紹祥)
Adviser :Shu-Chen Cheng (鄭淑真)
Date:99/05/04
1
Outline
 Introduction
 Document indexing
 Classification Algorithm
 Experiments
 Conclusion
2
Introduction
 Text categorization (TC) is the process of
automatically assigning one or more predefined
category labels to text documents.
 Digital medical information is rapidly increasing with
the development of network.
 How to effectively deal with and organize them is a
problem in the field of medical informatics.
3
Document indexing
 Because classifiers cannot directly interpret
documents, it is necessary to transform them into the
forms that classifiers can identify.
 Vector space model (VSM) is a famous statistical
model.
dj  ( w1 j , w2 j , w3 j ,..., wij,..w | T | j )
文章1  (古蹟, 觀光, 服務,...)
4
Document indexing
A. Standard Term Frequency Inverse Document
Frequency (TFIDF)
| Tr |
tfidf (tk , dj ) # (tk , dj )  log
# Tr (tk )
tk
tall_j
10
1000
tfidf (古蹟, 文章1) 
 log
100
100
5
Document indexing

In order for the weights to fall the [0,1] interval and for
the documents to be represent by vectors of equal
length, the weights resulting from tfidf are often
normalized by cosine normalization.
wkj 
tfidf(tk,dj)
 (tfidf(ts,dj))
|T|
s 1
2
, 0  wkj  1
文章1所有關鍵字的TFIDF平方相加
6
Document indexing
B. Improvement
 Term Frequency, Inverted Document Frequency and
Inverted Entropy (TFIDFIE)

In the field of text classification, the importance of term
depends on not only its term frequency, but also its
contribution to classification.

For example:

Term1 客房 and Term2 風景 has same weight
7
Document indexing
 In order to stand out the relation between terms and
categories, we also calculate the distribution of those
documents in categories in course of weighting terms.
This distribution can be weight by information entropy
H.
DFkl
DFkl
log
| Tr
#
Tr (tk )
#
Tr (jt|k )
k
# (tk , dj )  log
H (tk , dj )  l 1
|C |
w
kj 
tdidfie(10tk , dj )10
tfidfie(t , d )
# Tr (tk )
1
1
20| 2
202 253
253 2 154
154
|T
H (客房, 文章1)  -(
log

log

log

log
)
H
(
t
k
,
d
j
)
(
tfidf
(
t
s
,
d
j
))

100
100 s100
1 100 100 100 100 100
8
Classification Algorithm
A. K-Nearest Neighbor (KNN)
B. Support Vector Machine (SVM)
C. Naïve Bayes (NB)
D. Clonal Selection Algorithm Based on Antibody
Density (CSABAD)

Because the nature of immune algorithm is to
distinguish between self and non-self, it can be used
in text categorization.
9
Classification Algorithm
• CSABAD
The cosine value of two vectors is used to measure the
 In text categorization,
affinity
f(xi,dj) between of B cell xi and antigen dj
 Antigen
The antibody
affinity f(x
xi and N antigens
is defined
i) of B cell
The
selection
probability
P(x
)
is
defined
 training text.
i
as
the
average
value
of
all
N
affinities.
 B cellas follows:
M
An individual of classifier.
| f ( xi )  f ( xj ) |

j 1
P
(
x
i
)

 Antibody
M
M
xi ) 
f ( xj ) |documents.
  | f (and
 affinity between the individual
training

i 1
j 1
 The final classifier is composed with many memory B
cells.
10
Experiments
A. Data collection
 OHSUMED is a bibliographical document collection.
 Using a single-label subset of OHSUMED is called
OHSCAL, which consists of 11162 documents include 10
categories.
11
Experiments
B. Experiment results and analysis
 Randomly divided the OHSCAL dataset into a training
set and a test set in the proportion of 2:1.
 For eliminating the chanciness of experimental results,
we made ten independent experiments on OHSCAL.
12
Conclusion
 In this paper, we propose an improved approach, called
TFIDFIE. It considers the distribution of documents in the
training set in which the term occurs.
 The experiments show that SVM and CSABAD outperform
significantly kNN and Naive Bayes, and TFIDFIE is more
effective than TFIDF.
 Considering the characteristics of professional medical
words, we will study the feature selection in the medical
text classification in further work.
13