Transcript 1. Porting of a Morphological Analyzer cum Spell Checker
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY
CALTS has been in NLP for over a decade.
It has participated in the following major projects: 1.
2.
3.
NLP-TTP, DOE Govt. of India.
IPDA, DOE Govt. of India.
TRCT, TDIL, MCIT 4.
English-Telugu, T2TMT UPE, UGC, UOH.
1.
Morphological Analyzer cum Spell Checker for Telugu
• A robust Morphological analyzer cum Spell Checker for Telugu.
• With 97% recognition rate.
• Tested on 5 million word corpora.
• For the users of Windows O.S & Linux.
2.
A Multilingual Encyclopedic Electronic thesaurus for translators, MEET, a Web based linguistic application.
• MEET enables quick access to various synonyms.
• Provides equivalents in other Indian languages and English.
• Also provides grammatical and Semantic information.
• A useful application for translators.
• Provides access to information in Indian languages on the web.
• Currently includes only Marathi, Hindi, Bangla, Konkani and English.
• The 2 nd Oriya.
phase proposes to include Telugu, Kannada and • Word net for individual languages may be linked to the system.
3.
Telugu Hyper Grammar.
• The Telugu Hyper Grammar, designed as a dynamically accessed and non-linearly organized grammar of Telugu grammar.
• A user can access information at a particular module from any other module.
• Provides access to a Morphological Analyzer, Generator and a Chunker.
• Can access various bilingual and bi-directional digital lexica of Telugu and other Indian Languages like Hindi, Kannada, Tamil, Marathi, Oriya, Malayalam and English.
4.
English-Telugu Parallel Corpora.
• Parallel Corpora are a set of thematically corresponding digital texts of some selected works.
• Recent trends in Machine Translation are revolutionized by the use of Parallel Corpora.
• Parallel Corpora give way to discover similarities and differences between a pair of languages.
• A program for aligning parallel texts in English and Telugu is developed and in the process of testing.
• Selected parallel texts in Telugu, Kannada, Tamil, Marathi and Malayalam are digitized.
5.
English-Telugu T2T Machine Translation System
• English-Telugu Machine Translation System is being built at CALTS in collaboration with, IIIT, Hyderabad; Telugu University, Osmania University, Hyderabad.
Hyderabad; • Uses an English-Telugu MAT lexicon of 42K.
• A wordform synthesizer for Telugu is developed and incorporated.
• It incorporates an evolutionary semantic lexicon • It handles English sentences of a variety of complexity
6.
MAT Lexica.
• Bilingual and Multidirectional.
• Machine Readable Dictionaries for Telugu-Hindi, Telugu-Kannada, Telugu-Tamil, Telugu-Marathi, Telugu-Oriya, Telugu-Bangla, Telugu-Malayalam, of 10K are being developed in collaboration with the Telugu Academy.
• The entries were based on the frequency of their occurrence in the corpus of Telugu.
• The Dictionaries of Telugu-Hindi, Telugu Kannada, Telugu-Tamil are already completed.
• Major part of these dictionaries are developed through realigning the lexical resources existing at CALTS.
7.
Collocations in Indian Languages.
• Collocations or specialized word sequences play a crucial role in a language. It is extremely difficult to identify and translate effectively. They present one of the most challenging tasks in Natural Language Processing.
• In the first phase, Telugu data was collected and analyzed.
• A long list of collocations are collected and checked whether the existing criteria are valid.
• These collocations are compared against other specialized word sequences in the language to understand their functional and distributional properties.
8.
Machine Readable Dictionary of Idioms (Telugu-English).
• Idioms are extremely important but the most ubiquitous, and less understood categories of language.
• Machine-readable Idioms in English and their equivalents in Telugu and the mechanics of their recognition developed.
and transfer rules are being • The Machine Readable text will be implemented in XML so that access and retrieval becomes easier and faster.
9.
Electronic Adult Literacy Primer for Telugu
• This is developed as part of CALTS participation in Arohan (a literacy campaign adopted by the university).
• Aimed at teaching the script or the written form of the language rather than the language itself.
• Based on frequency of characters in the written texts.
• Learning the most frequent but few characters would ensure greater coverage in learning recognition of characters.
• Special features include characters with animation and speech.
• A special attention on the presentation of allographs.
10.
A generic system for morphological generation for Indian languages
• Morphological generators for various Indian languages particularly for Telugu, Kannada, Tamil, Malayalam, Bangla and Oriya are in different stages of development.
• A generic framework for wordform synthesis for Indian languages.
• Includes testing module to find the efficiency and coverage of the system.
11.
Telugu-Tamil Machine translation system
• Using the available resources at CALTS a Telugu-Tamil MT is being developed.
• Uses the Telugu Morphological analyzer.
• Uses the Tamil generator developed at CALTS.
• Uses Telugu-Tamil dictionary developed as part of MAT Lexica.
• Uses verb sense disambiguator based on verbs argument structure.
12.
Word Sense Disambiguation using Argument Structure:
• A system, based on the argument structure of Telugu verbs.
• Uses feature based semantic lexicon.
• Efficiently disambiguates polysemy of verbs in the context.
• Is incorporated in Telugu-Tamil MT system.
13.
A case sensitive roman translation for Indian languages as overall pattern
• A roman transliteration Scheme for unwritten languages of India is developed.
• A common transliteration scheme for the scripts of Brahmi derivates and non Brahmi derivates is developed.
• Supra segmentals characters mapped on to roman • No nonunique character mapping • Allows complete conversion between various languages