1. Porting of a Morphological Analyzer cum Spell Checker

Download Report

Transcript 1. Porting of a Morphological Analyzer cum Spell Checker

CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY

CALTS has been in NLP for over a decade.

It has participated in the following major projects: 1.

NLP-TTP, DOE Govt. of India.

IPDA, DOE Govt. of India.

TRCT, TDIL, MCIT 4.

English-Telugu, T2TMT UPE, UGC, UOH.

Morphological Analyzer cum Spell Checker for Telugu

• A robust Morphological analyzer cum Spell Checker for Telugu.

• With 97% recognition rate.

• Tested on 5 million word corpora.

• For the users of Windows O.S & Linux.

A Multilingual Encyclopedic Electronic thesaurus for translators, MEET, a Web based linguistic application.

• MEET enables quick access to various synonyms.

• Provides equivalents in other Indian languages and English.

• Also provides grammatical and Semantic information.

• A useful application for translators.

• Provides access to information in Indian languages on the web.

• Currently includes only Marathi, Hindi, Bangla, Konkani and English.

• The 2 nd Oriya.

phase proposes to include Telugu, Kannada and • Word net for individual languages may be linked to the system.

Telugu Hyper Grammar.

• The Telugu Hyper Grammar, designed as a dynamically accessed and non-linearly organized grammar of Telugu grammar.

• A user can access information at a particular module from any other module.

• Provides access to a Morphological Analyzer, Generator and a Chunker.

• Can access various bilingual and bi-directional digital lexica of Telugu and other Indian Languages like Hindi, Kannada, Tamil, Marathi, Oriya, Malayalam and English.

English-Telugu Parallel Corpora.

• Parallel Corpora are a set of thematically corresponding digital texts of some selected works.

• Recent trends in Machine Translation are revolutionized by the use of Parallel Corpora.

• Parallel Corpora give way to discover similarities and differences between a pair of languages.

• A program for aligning parallel texts in English and Telugu is developed and in the process of testing.

• Selected parallel texts in Telugu, Kannada, Tamil, Marathi and Malayalam are digitized.

English-Telugu T2T Machine Translation System

• English-Telugu Machine Translation System is being built at CALTS in collaboration with, IIIT, Hyderabad; Telugu University, Osmania University, Hyderabad.

Hyderabad; • Uses an English-Telugu MAT lexicon of 42K.

• A wordform synthesizer for Telugu is developed and incorporated.

• It incorporates an evolutionary semantic lexicon • It handles English sentences of a variety of complexity

MAT Lexica.

• Bilingual and Multidirectional.

• Machine Readable Dictionaries for Telugu-Hindi, Telugu-Kannada, Telugu-Tamil, Telugu-Marathi, Telugu-Oriya, Telugu-Bangla, Telugu-Malayalam, of 10K are being developed in collaboration with the Telugu Academy.

• The entries were based on the frequency of their occurrence in the corpus of Telugu.

• The Dictionaries of Telugu-Hindi, Telugu Kannada, Telugu-Tamil are already completed.

• Major part of these dictionaries are developed through realigning the lexical resources existing at CALTS.

Collocations in Indian Languages.

• Collocations or specialized word sequences play a crucial role in a language. It is extremely difficult to identify and translate effectively. They present one of the most challenging tasks in Natural Language Processing.

• In the first phase, Telugu data was collected and analyzed.

• A long list of collocations are collected and checked whether the existing criteria are valid.

• These collocations are compared against other specialized word sequences in the language to understand their functional and distributional properties.

Machine Readable Dictionary of Idioms (Telugu-English).

• Idioms are extremely important but the most ubiquitous, and less understood categories of language.

• Machine-readable Idioms in English and their equivalents in Telugu and the mechanics of their recognition developed.

and transfer rules are being • The Machine Readable text will be implemented in XML so that access and retrieval becomes easier and faster.

Electronic Adult Literacy Primer for Telugu

• This is developed as part of CALTS participation in Arohan (a literacy campaign adopted by the university).

• Aimed at teaching the script or the written form of the language rather than the language itself.

• Based on frequency of characters in the written texts.

• Learning the most frequent but few characters would ensure greater coverage in learning recognition of characters.

• Special features include characters with animation and speech.

• A special attention on the presentation of allographs.

10.

A generic system for morphological generation for Indian languages

• Morphological generators for various Indian languages particularly for Telugu, Kannada, Tamil, Malayalam, Bangla and Oriya are in different stages of development.

• A generic framework for wordform synthesis for Indian languages.

• Includes testing module to find the efficiency and coverage of the system.

11.

Telugu-Tamil Machine translation system

• Using the available resources at CALTS a Telugu-Tamil MT is being developed.

• Uses the Telugu Morphological analyzer.

• Uses the Tamil generator developed at CALTS.

• Uses Telugu-Tamil dictionary developed as part of MAT Lexica.

• Uses verb sense disambiguator based on verbs argument structure.

12.

Word Sense Disambiguation using Argument Structure:

• A system, based on the argument structure of Telugu verbs.

• Uses feature based semantic lexicon.

• Efficiently disambiguates polysemy of verbs in the context.

• Is incorporated in Telugu-Tamil MT system.

13. A case sensitive roman translation for Indian languages as overall pattern

• A roman transliteration Scheme for unwritten languages of India is developed.

• A common transliteration scheme for the scripts of Brahmi derivates and non Brahmi derivates is developed.

• Supra segmentals characters mapped on to roman • No nonunique character mapping • Allows complete conversion between various languages

1. Porting of a Morphological Analyzer cum Spell Checker

Transcript 1. Porting of a Morphological Analyzer cum Spell Checker

CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY

13.

A case sensitive roman translation for Indian languages as overall pattern

Directory