Transcript Document

POS Tagger and Chunker
for Tamil
Guided by
Dr.K.P.Soman
Head, CEN
Amrita University.
Presented by
Dr.S.Rajendaran
Head, Dept.Linguistics
Tamil University.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
V.Dhanalakshmi
M.Anand Kumar
CEN, Amrita.
2
Overview









CEN
Introduction
Tamil POS Tagging
AMRITA Tagset
Tamil POS Tagging
SVMTool
Chunking
Yamcha
Results
Conclusion
Amrita Vishwa Vidyapeetham
Coimbatore.
3
Introduction
 Part-of-speech (POS) tagging , also called
grammatical tagging, is the process of
assigning POS tags to each and every word in
a sentence.
 It is like assigning the grammatical category
such as Noun, Verb, Adjective, Adverb etc .
 The next process after POS tagging is
chunking, which divides sentences into non
recursive inseparable Phrases.
i.e. only one head in a phrase.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
4
Introduction
 There are many Tools available for POS
tagging and Chunking.
 We have used SVM based Tools for Tamil POS
tagging and Chunking.
 SVMTOOL
 YAMCHA
CEN
POS Tagging
Chunking
Amrita Vishwa Vidyapeetham
Coimbatore.
5
Introduction
 POS tagging and Chunking is considered as
an important process in speech recognition,
natural language parsing, information
retrieval and machine translation.
 Here POS Tagging problem is converted
into classification problem.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
6
POS Tagging
 INPUT: a string of words (sentence)
 OUTPUT: a single best tag for each
word (POS Tagged sentence)
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
7
Example of Tamil POS Tagging
 Assigning the words grammatical
category in a sentence .
< Six feet tall bell is in the temple>
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
8
Example of POS Tagging
NN
CRD NN
ADJ
NN
VF
<Six feet tall bell is in the temple>
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
9
LEXICAL AMBIGUITY IN TAMIL.
 Assign POS tags to words in a
sentence considering its lexical
ambiguity.
NN
NN
NN NN
CRD VF
ADJ
ADJ
NN VF
NNP VF
<Six feet tall bell is in the temple>
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
10
POS Tagging Example
 Assigning the words grammatical
category considering its lexical
ambiguity.
NN
NN
NN NN
ADJ
NN VF
CRD VF
ADJ
NNP VF
(Ambiguity tags)
Six feet tall bell is in the temple.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
11
COMPLEXITY IN TAMIL POS
TAGGING
 Tamil is a morphologically rich agglutinative
language.
 Mostly we depend on syntactic function or
context to decide upon whether one word is a
noun or adjective or adverb or post position.
Example:
 <varum> can be <VF> OR <VNAJ>
 This leads to the complexity of Tamil in POS
tagging.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
12
AMRITA TAGSET
Considering the Lexical ambiguities and
syntactical complexities, we have created a
new tag set <AMRITA tagset> to tag our
corpus for SVM based POS Tagger for
Tamil.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
13
AMRITA TAGSET
 We considered the guidelines from “Annotating Corpora
Guidelines For POS And Chunk Annotation For Indian
Languages [IIIT, Hyderabad] ” while creating our
AMRITA Tagset:

1. The tags should be simple.

2. Maintaining simplicity for Ease of Learning and
Consistency in annotation.

3. POS tagging is not a replacement for morph
analyzer.
 4. A 'word' in a text carries grammatical category and
grammatical features such as case, tense, person,
number, gender, etc. The POS tag should be based on
the 'category' of the word and the features can be
acquired from the morph analyzer.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
14
AMRITA Tagset
 Tagset is simple.
 It is based on the 'category' of the
word, does not considers the
grammatical features of the word.
 Tagset size:
32 Tags
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
15
AMRITA Tag set for Tamil
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
16
Corpus development :
 We have developed our corpus of 2.50 LAKHS words,
collecting corpora from Dinamani newspaper, yahoo tamil
news, That’s Tamil, online Tamil short stories etc.
Three stages in corpus development
 Pre-editing
 Manual Tagging
 Tagging using SVMTagger
 Corpus size: 2.50 lakhs words
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
17
SVM(Support Vector Machine)

Support vector machine is a training
algorithm for learning classification and
regression rules from data.
 SVM is based on the idea of structural risk
minimization, a principled technique for
selecting a model which minimizes
generalization error.
 SVM is increasingly being used in
processing NLP tasks
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
18
SVMTool
 This implementation is based on the principle
of Support Vector Machines (SVM).
 This Tool is developed by Jes´us Gim´enez and
Llu´ıs M`arquez.
 Trains efficiently and solve real NLP problems
like POS tagging
 SVMTool is freely available at
http://www.lsi.upc.es/~nlp/SVMTool
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
19
Training Data Format
…….
இந்த <DET>
ஆண்டில் <NN>
3500 <CRD>
பஸ்கள் <NN>
வாங்கப்படும்<VF>
. <DOT>
இதில் <PRP>
சென்னை <NNP>
…..
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
20
Tagger Implementation
Corpus
Tokenization
Tagging
Training
UnTagged words
CEN
SVMTagger
Amrita Vishwa Vidyapeetham
Coimbatore.
Tagged words
21
CHUNKING
 A subsequent step after tagging focuses on the
identification of basic structural relations
between groups of words. This is usually
referred to as phrase chunking.
 Input: Word sequence and POS tags
 Output : A single best Chunk Tag for each
word along with its POS tag.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
22
Chunking in Tamil
 Tamil being an agglutinative language have a
complex
morphological
and
syntactical
structure.
 It is a relatively free word order language but
in the phrasal and clausal construction it
behaves like a fixed word order language.
 The process of chunking in Tamil is less
complex compared to the process of POS
tagging.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
23
EXAMPLE
Assigning
BNP
CEN
Chunk Tags to words in a sentences.
B-NP
I-NP
BNP
Amrita Vishwa Vidyapeetham
Coimbatore.
I-NP
B-VP
24
Chunk tagset
S.N Chunk
o
Tag
Tag Name
Possible POS Tags
1
NP
Noun Phrase
NN,NNP,NNPC,NNC,NNQ,PRP,
QTF,DET,CRD,ORD,ADJ,INT
2
AJP
Adjectival Phrase
CRD, ADJ
3
AVP
Adverbial Phrase
ADV,INT,CRD
4
VFP
Verb Finite Phrase
VF,VAX
5
VNP
Verb Nonfinite
Phrase
VNAJ,VNAV,VINT,CVB
6
VGP
Verb Gerund Phrase
VBG
7
CJP
Conjunctional
CNJ
8
COMP
Complimentizer
COM
9
. ?
Symbols
O
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
25
Chunk Tagset
 IOB Tag:
The IOB tags are used to indicate the
boundaries for each chunk



CEN
B – the current word is the beginning of a
chunk, which may be followed by another
chunk.
O - indicates the boundary of the sentence.
I – the current word is inside a chunk.
Amrita Vishwa Vidyapeetham
Coimbatore.
26
Yamcha
 YamCha is a generic, customizable,
and open source text chunker.
 YamCha is using a state-of-the-art
machine learning algorithm called
Support Vector Machines (SVMs), first
introduced by Vapnik in 1995.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
27
TRAINING AND TEST FILE FORMAT
 Both the training file and the test file need to be in a
particular format for Yamcha to work properly.
 The training and test file must consist of multiple
tokens.
 A token consists of multiple (but fixed-numbers)
columns. The tokens are simply correspond to words.
Each token must be represented in one line, with the
columns separated by white space (spaces or tabular
characters). A sequence of token becomes a sentence.
To identify the boundary between sentences, an empty
line is put.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
28
TRAINING AND TEST FILE FORMAT
 We can give as many columns as we like,
however the number of columns must be
fixed through all tokens.
 There are some kinds of "semantics" among
the columns. For example, First column is
'word', second column is 'POS tag' third
column is ‘CHUNK tag' and so on.
 The last column represents a true answer tag
which is going to be trained by Yamcha.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
29
Training data - sample
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
30
Tagger Implementation
POS TAGGED Corpus
Manual Tagging
Yamcha Training
POS Tagged Input
CEN
Trained
Model
Amrita Vishwa Vidyapeetham
Coimbatore.
Chunked output
31
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
32
CONCLUSION
 Chunking plays an important role in various
Natural language processing applications.
 Chunked corpus can be used for parsing
which will provide important syntactic
information for machine translation.
 Future possible work is to increase the corpus
size i.e. To build Annotated corpus for Tamil.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
33
REFERENCES
 Gim´enez, J. and L.M`arquez. “Fast and Accurate Part-ofSpeech Tagging”: The SVM Approach Revisited”. In
Proceedings of the Fourth RANLP, 2003.
 Rajendran S, “ Parsing in tamil -Present state of art”,
language in india, Volume 6 : 8-th August 2006
 Abney S, “Parsing by Chunks”, Principle-based parsing.
Kluwer Academic Publishers, Dordrecht, pp 257-278,
1991.
 Sobha L, Vijay Sundar Ram R. “Noun Phrase Chunking in
Tamil”, In proceeding of the MSPIL-06, Indian Institute of
Technology,Bombay.pp-194-198.
 Taku Kudo, 2003. CRF++:Yet Another CRFToolkit.
http://chasen.org/~taku/software/CRF++/.
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
34
நன்றி
THANK YOU
CEN
Amrita Vishwa Vidyapeetham
Coimbatore.
35