Transcript Document
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad [email protected] Outline 1.Introduction 2.Background and Motivation 3.Experimental Setup 4.Preprocessing 5.Representation 6.Single-neuro tagger 7.Experiments 8.Multi-neuro tagger 9.Results 10.Discussion 11.Future Work Introduction POS-Tagging: It is the process of assigning the part of speech tag to the NL text based on both its definition and its context. Uses: Parsing of sentences, MT, IR, Word Sense disambiguation, Speech synthesis etc. Methods: 1. Statistical Approach 2. Rule Based Background: Previous Approaches Lots of work has been done using various machine learning algorithms like TNT CRF for Hindi. Trade-off: Performance versus Training time - Less precision affects later stages - For a new domain or new corpus parameter tuning is a non-trivial task. Background: Previous Approaches & Motivation Empirically chosen context. Effective Handling of corpus based features Need of the hour: - Good performance - Less training time - Multiple contexts - exploit corpus based features effectively Two Approaches and their comparison with TNT and CRF Word level tagging Experimental Setup : Corpus statitstics Tag set of 25 tags Corpus Size (in words) Training 187,095 Unseen words (in percentage) - Development 23,565 5.33% Testing 8.15% 23,281 Experimental Setup: Tools and Resources Tools - CRF++ - TNT - Morfessor Categories – MAP Resources - Universal word – Hindi Dictionary - Hindi Word net - Morph Analyzer Preprocessing XC tag is removed (Gadde et. Al., 2008). Lexicon - For each unique word w of the training corpus => ENTRY(t1,……,t24) - where tj = c(posj , w) / c(w) Representation: Encoding & Decoding Each word w is encoded as an n-element vector INPUT(t1,t2,…,tn) where n = size of the tag set. INPUT(t1,t2,…,tn) comes from lexicon if training corpus contains w. If w is not in the training corpus - N(w) = Number of possible POS tags for w - tj = 1/N(w) if posj is a candidate = 0 otherwise Representation: Encoding & Decoding For each word w, Desired Output is encoded as D = (d1,d2,….,dn). - dj = 1 if posj is a desired ouput = 0 otherwise In testing, for each word w, an n-element vector OUTPUT(o1,…,on) is returned. - Result = posj, if oj = max(OUTPUT) Single – neuro tagger: Structure Single – neuro tagger: Training & Tagging Error Back-propagation learning Algorithm Weights are Initialized with Random values Sequential mode Momentum term Eta = 0.4 and Alpha = 0.1 In tagging, it can give multiple outputs or a sorted list of all tags. Experiments: Development Data Features Precision Corpus based and contextual 93.19% Root of the word 93.38% Length of the word 94.04% Handling of unseen words Root->Dictionary->Word net->Morfessor {tj = c(posj ,s) + c(posj ,p)/ c(s) + c(p)} 95.62% Development of the system Multi – neuro tagger: Structure Multi – neuro tagger: Training Multi – neuro tagger: Learning curves Multi – neuro tagger: Results Structure Context Development Test 97-48-24 121-48-24 121-48-24 145-72-24 169-72-24 169-72-24 193-96-24 3 4_prev 4_next 5 6_prev 6_next 7 95.44% 95.64% 95.66% 95.55% 95.56% 95.54% 95.46% 91.87% 92.05% 91.95% 92.15% 92.14% 92.14% 92.07% Multi – neuro tagger: Comparison Precision after voting : 92.19% Tagger Development Test Training Time TNT 95.18% 91.58% 1-2 (Seconds) Multi – neuro 95.78% tagger 92.19% 13-14 (Minutes) CRF 92.92% 2-2.5(Hours) 96.05% Conclusion Single versus Multi-neuro tagger Multi-neuro tagger versus TNT and CRF Corpus and Dictionary based features More parameters need to be tuned 24^5 = 79,62,624 n-grams, while 250,560 weights Well suited for Indian Languages Future Work Better voting schemes (Confidence point based) Finding the right context (Probability based) Various Structures and algorithms - Sequential Neural Network - Convolution Neural Network - Combination with SVM Queries??? Thank You!!