Transcript 下載/瀏覽
Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine Learning Methods for Medical Text Categorization," paccs, pp.494-497, 2009 Pacific-Asia Conference on Circuits, Communications and Systems, 2009 Speaker:Shau-Shiang Hung (洪紹祥) Adviser :Shu-Chen Cheng (鄭淑真) Date:99/05/04 1 Outline Introduction Document indexing Classification Algorithm Experiments Conclusion 2 Introduction Text categorization (TC) is the process of automatically assigning one or more predefined category labels to text documents. Digital medical information is rapidly increasing with the development of network. How to effectively deal with and organize them is a problem in the field of medical informatics. 3 Document indexing Because classifiers cannot directly interpret documents, it is necessary to transform them into the forms that classifiers can identify. Vector space model (VSM) is a famous statistical model. dj ( w1 j , w2 j , w3 j ,..., wij,..w | T | j ) 文章1 (古蹟, 觀光, 服務,...) 4 Document indexing A. Standard Term Frequency Inverse Document Frequency (TFIDF) | Tr | tfidf (tk , dj ) # (tk , dj ) log # Tr (tk ) tk tall_j 10 1000 tfidf (古蹟, 文章1) log 100 100 5 Document indexing In order for the weights to fall the [0,1] interval and for the documents to be represent by vectors of equal length, the weights resulting from tfidf are often normalized by cosine normalization. wkj tfidf(tk,dj) (tfidf(ts,dj)) |T| s 1 2 , 0 wkj 1 文章1所有關鍵字的TFIDF平方相加 6 Document indexing B. Improvement Term Frequency, Inverted Document Frequency and Inverted Entropy (TFIDFIE) In the field of text classification, the importance of term depends on not only its term frequency, but also its contribution to classification. For example: Term1 客房 and Term2 風景 has same weight 7 Document indexing In order to stand out the relation between terms and categories, we also calculate the distribution of those documents in categories in course of weighting terms. This distribution can be weight by information entropy H. DFkl DFkl log | Tr # Tr (tk ) # Tr (jt|k ) k # (tk , dj ) log H (tk , dj ) l 1 |C | w kj tdidfie(10tk , dj )10 tfidfie(t , d ) # Tr (tk ) 1 1 20| 2 202 253 253 2 154 154 |T H (客房, 文章1) -( log log log log ) H ( t k , d j ) ( tfidf ( t s , d j )) 100 100 s100 1 100 100 100 100 100 8 Classification Algorithm A. K-Nearest Neighbor (KNN) B. Support Vector Machine (SVM) C. Naïve Bayes (NB) D. Clonal Selection Algorithm Based on Antibody Density (CSABAD) Because the nature of immune algorithm is to distinguish between self and non-self, it can be used in text categorization. 9 Classification Algorithm • CSABAD The cosine value of two vectors is used to measure the In text categorization, affinity f(xi,dj) between of B cell xi and antigen dj Antigen The antibody affinity f(x xi and N antigens is defined i) of B cell The selection probability P(x ) is defined training text. i as the average value of all N affinities. B cellas follows: M An individual of classifier. | f ( xi ) f ( xj ) | j 1 P ( x i ) Antibody M M xi ) f ( xj ) |documents. | f (and affinity between the individual training i 1 j 1 The final classifier is composed with many memory B cells. 10 Experiments A. Data collection OHSUMED is a bibliographical document collection. Using a single-label subset of OHSUMED is called OHSCAL, which consists of 11162 documents include 10 categories. 11 Experiments B. Experiment results and analysis Randomly divided the OHSCAL dataset into a training set and a test set in the proportion of 2:1. For eliminating the chanciness of experimental results, we made ten independent experiments on OHSCAL. 12 Conclusion In this paper, we propose an improved approach, called TFIDFIE. It considers the distribution of documents in the training set in which the term occurs. The experiments show that SVM and CSABAD outperform significantly kNN and Naive Bayes, and TFIDFIE is more effective than TFIDF. Considering the characteristics of professional medical words, we will study the feature selection in the medical text classification in further work. 13