Development of NE Wordnet: An Integrated - cfilt

Download Report

Transcript Development of NE Wordnet: An Integrated - cfilt

Development of NE Wordnet:
An Integrated Wordnet for Languages of the
North-East India
Assamese & Bodo
by
Utpal Saikia
Biswajit Brahma
Dibyajyoti Sarmah
Dept. of Computer Science & Information Technology
Gauhati University
INTRODUCTION

NE Wordnet Project for Assamese & Bodo started in 2009.

NE Wordnet Project for Assamese & Bodo have been developed with
expansion approach with the Original Hindi Wordnet structure
against the IDs and Concept of Hindi Wordnet.
NE Wordnet Development Project outcomes till now
Validation

All the Assamese and Bodo Wordnet activities have been
reviewed by the Professors of the Department of
Assamese, Modern Indian Language & Bodo of Gauhati
University as well as other invited resource persons.
Contd….

The developed NE Wordnet structured in the form of Database,
integrated with interactive Interface is ready for different NLP
research and Development. Different NLP application and research
related works already started using the NE WordNet.

Automatic Bilingual Dictionary Construction: Assamese-Bodo
Dictionary Construction : Prototype developed at Gauhati University.

Web based Automatic Multilingual Dictionary Construction:
Assamese-Bodo-Nepali-Hindi-English Dictionary Construction: Full
Web based System ready: By Gauhati University Team.

Intelligent Document Categorizing System: Prototype Developed and
Tested at Gauhati University: Research Paper already accepted for
GWA-2010.
NE Wordnet Development Project outcomes till now
Following are the glosses which are completed in Assamese
language till now:







common Synset completed = 11579
Pan Indian Synset all Completed
Universal Synset (Total= 7168) completed = 7147
Adjective Synset Completed = 2376 (Total = 3605)
Adverb Synset Completed = 174 (Total= 209)
Verb Synset Completed = 1588 (Total = 1798)
Language Specific completed = 127 (Total =1000)
Total linked Number =24,338
NE Wordnet Development Project outcomes till now
Following are the glosses which are completed in
Bodo language till now:

common synset completed = 11522

Pan Indian synset all Completed

Universal Synset (Total= 7168) completed

Adverb Synset Completed = 192 (Total= 209)

Adjective Synset Completed = 2473 (Total = 3605)

Verb Synset Completed = 1752 (Total = 1798)

Synset Ranker = 34264 (34378)

Language Specific = 74
= 7143
Total linked number = 24,493
Problems Faced During Development for Assamese

Synset related: In common synsets of Assamese and Bodo, a few
number of synsets do not have proper Assamese word to represent. So
they are not entered yet. Those left synsets have been send to the expert
committee to review.

Expansion from Hindi/English: The main challenge in expansion
approach is in one to one mapping.
Problems Faced During Development for Bodo

Challenges in Expansion
Bodo is a developing language. It does not have a very strong
linguistic resource. Also literature resource is very limited. The
language does not have enough vocabulary, and new and new words
are being discovered, coined and added. As a result, the
development of Bodo Wordnet faces typical and frequent problems,
and overcoming the problems to accommodate expansion of the
Hindi Wordnet with one to one mapping has been a big challenge
Workshop/conference organized and participated by the member groups:
1.Global Wordnet Conference in IIT, Mumbai from 31st Jan.-4th Feb. 2010
2. Indo Wordnet Conference in Amrita University, Coimbatore, in June,
2010
3. NE Wordnet Workshop, Guwahati, Assam, 2010
4. Indo Wordnet Workshop, IIT Kharagpur, 2010
5.Attended Spell checker training, C-DAC Pune, 2010
6. Indo Wordnet Workshop, Shillong, 2011
7. CLIA developers workshop, C-DAC Pune, 2011
8. Multiword Expression Workshop, University of Kashmir, Srinagar, 2011
Tools, Applications & Research

During this period, language specific tools have been developed .
Language specific Synset creation tools interface
Multi_lingual_dictionary
[Online Bodo, Assamese and Hindi Language]:
Step1:
First select the language
Step2:
Type the word of the language
Step3:
When word automatically come then
select the word
Step4:
After search the word
Published paper in conferences/journals/workshop
1.
A Novel Approach for Document Classification using Assamese
WordNet, Jumi Sarmah, Navanath Saharia and Shikhar K. Sarma,
Global Wordnet Conference (GWC), Japan, 2012
2.
Assamese Vocabulary and Assamese Wordnet Building: An Analysis,
Shikhar Kr. Sarma, Utpal Saikia, Mayashree Mahanta, Himadri
Bharali, Global Wordnet Conference (GWC), Japan, 2012
3.
Foundation and Structure of Developing an Assamese Wordnet,
Shikhar Kr. Sarma, Moromi Gogoi, Rakesh Medhi, Utpal Saikia,
Global Wordnet Conference, IIT Bombay, 2010
4.
A Wordnet for Bodo Language: Structure and Development, Shikhar
Kr. Sarma, Moromi Gogoi, Biswajit Brahma, Mane Bala Ramchiary,
Global Wordnet Conference, IIT Bombay, 2010
Published paper in conferences/journals/workshop
5.
A Novel Approach for Document Classification using Assamese
WordNet, Jumi Sarmah, Navanath Saharia and Shikhar K. Sarma,
Global Wordnet Conference (GWC), Japan, 2012
6.
Assamese Vocabulary and Assamese Wordnet Building: An Analysis,
Shikhar Kr. Sarma, Utpal Saikia, Mayashree Mahanta, Himadri
Bharali, Global Wordnet Conference (GWC), Japan, 2012
7.
Foundation and Structure of Developing an Assamese Wordnet,
Shikhar Kr. Sarma, Moromi Gogoi, Rakesh Medhi, Utpal Saikia,
Global Wordnet Conference, IIT Bombay, 2010
8.
A Wordnet for Bodo Language: Structure and Development, Shikhar
Kr. Sarma, Moromi Gogoi, Biswajit Brahma, Mane Bala Ramchiary,
Global Wordnet Conference, IIT Bombay, 2010
Contd…
9. Kinship Terms in Assamese Language, Shikhar Kumar Sarma, Utpal
Saikia, Mayashree Mahanta, Indo Wordnet Workshop, IIT, Kharagpur,
2010
10. Formation of Kinship Terms in Bodo Langauge, Shikhar Kr. Sarma,
Biswajit Brahma, Mane Bala Ramchiary, Indowordnet Workshop, IIT
Kharagpur, 2010
11. Architecture of a Spell Checker for An Indo-Aryan Language: Assamese,
Gogoi, Ambeswar. Shikhar Kr. Sarma and Kishore Baishya, International
journal of Computational Linguistics, Volume (1): Issue (1), 2009
12. A case study of Dictionary Annotation As A Pre-procesing task to develop
Assamese Spell checker, Ambeswar Gogoi and Kishore Baishya, Making
of Electronic Dictionary, Linguistic Data Consortium for Indian Languages,
CIIL Mysore, 2009
Conclusion

Integration and collaboration of the man powers in the field of
Linguistics and Computing; Trained man power development in
the field of NLP, Local Language Technology Development.

Through this project a new breed of researchers in language
technologies have been trained for proper skills and knowledge
sets. As in these local languages the linguistic and Literature
studies in formal education are with minimum computational
linkage, and with no training/exposure for interlinking of
linguistics and computing, the project facilitates in developing a
team of interdisciplinary researchers. The project has
contributed in expertise development and awareness creation in
latest in machine translation, lexical semantics, cross lingual IR
etc. in specific.
THANK YOU