A PROPOSAL FOR
CREATION OF A
Focus: linguistic data
What is ‘Linguistic Data’?
• Printed words - in different
scripts, fonts, platforms &
• Domain-specific texts (e.g.
90-odd ones in current Indian
• Samples of Spoken Corpus –
telephone talk, public
lectures, formal discussions,
in-group conversations, radio
talks, natural language
• Hand-written samples
• Ritualistic Use of languages –
scriptures, chanting, etc.
• Language of Performance Reading, recitations,
But this data is of
use only if it comes
‘Cause it must be tagged
and aligned to be of use
THAT’S WHAT CREATES AN
IMPORTANT ROLE FOR
LINGUISTS IN THIS ENTERPRISE
How the •
Idea of an•
The Brown University text corpus was adopted to build statistical language
TI-46 & TI DIGITS databases, of Texas Instruments (early 80's) distributed by
The LDC at U-Penn was established in 1992.
CIIL houses 45 million Word Corpora in 15 Indian lgs with
DoE-TDIL support. CIIL has been distributing it to R&D groups
the world over.
Now converted into UNICODE jointly with the U of Lancaster
and with another 45 million word Corpora from five Indian
languages under Emille project coming in, it has been
released in early 2004.
CIIL is now working with Universities of Uppsala on corpora
of lesser-known languages of India; See www.ciiluppsalaspokencorpus.net
SO WHAT made us PROPOSE a LDC-IL?
•The giant strides in IT that India has made.
•Because demands were made by several
Software and Telecom giants – Reliance,
IBM, HPLabs, Modular Syetems & Infosys.
•Due to suggestions of the Hindi Committee
•As decided in the 1st ILPC meeting, 2004.
Proposal evolved through discussion held
with many Institutions in India and abroad.
August 13, 2003: 1st presentation at the
MHRD, with the then ES in the chair, and
FA, AS, J.S.(L), Director (L) and experts
from C-DAC and IIT-Kanpur.
August 17 and 18, 2003: An International Workshop on
LDC was held at the CIIL, Mysore in collaboration with
IIIT-Hyderabad and HPLabs, India. It was inaugurated by
Smt. Kumud Bansal (the then AS & now Secretary,
Elementary Ed), and attended by the J.S. (L). Those who
created LDC in USA had participated.
August 19, 2003: a follow up meeting of a smaller group was held
at the Indian Institute of Science to thrash out further details. A
Project Committee was set up.
Committee had top NLP
specialists and linguists
with the Director CIIL as
• Five experts from IIT-B,
IIT-M, IISc, IIIT- Hyd, &
CIIL with inputs from the
• All changes were made
through email chats and
exchanges, and after four
during Sept-Oct, 2003.
• Nov 18,’03: Modified
• Dec 19, 2003:
representatives of lead
Institutes met in
Mysore to discuss the
draft sent to the
Ministry. Prof. Aravind
Joshi also participated.
• January, 2004: With
additional inputs, the
proposal was modified.
• Feb 24, '04: A number
of suggestions made
(see ndminutes) during
the 2 Presentation for
ES, AS, JS(L), & IFD.
• April 16, 2004: After the
TDIL Advisory Comm.,
DoE offers full support.
• The importance of creation of a large dataarchive of Indian languages is undeniable. In
fact, it is this realization that resulted in
government’s plan for corpora development in
• Indian languages often pose a difficult
challenge for the specialists in AI/NLP.
• The technology developers building massapplication tools/products, have for long been
calling for availability of linguistic data on a
• However, the data should be collected,
organized and stored in a manner that suits
different groups of technology developers.
• These issues require us to involve a number of
disciplines like linguistics, statistics, & CS.
• Further, this data must be of high quality with
• Resources must be shared, so that all R&D
groups are benefited.
• All these are possible with a data consortium.
Spoken language data &
importance of phoneticians
• Numerous Indian languages,
each with so many sound
patterns identified/studied by
phoneticians for centuries.
• The inventory of IPA is
invaluable for spoken language
corpus, but their identification
from speech data requires
• For speech technology,we have
to create both phonetics/
acoustics models of languages
• Even when it is now aided and
eased by Visual Phonetics
technology, as available in CIIL
or TIFR labs, what we need in
addition is trained phoneticians.
•An ideal model of Consortium could be
seen if we consider the Linguistic Data
Consortium (LDC) hosted by the
University of Pennsylvania.
•LDC (USA) is an open consortium of
universities, companies & government
R&D labs that creates, collects and
distributes speech and text databases,
lexicons, and other resources for R&D.
• This ‘LDC’ has 100 plus agencies as its active users and members.
Includes some non-western languages:Arabic,Chinese, Korean.
• The core operations of are self-supporting after ten years.
• The activities include maintaining the data archives, producing and
distributing CD-ROMs, and arranging networked data distribution, etc.
• All these have provided a great impetus to R&D in the field of
language technology for English and other European languages.
• It is proposed to adopt a similar approach in the Indian context.
Who funded LDC in US?
• LDC was supported initially by
US Govt grant IRI-9528587
from the Information and
Intelligent Systems division
• Also by a grant 9982201 from
the Human Computer
Interaction Program of the
National Science Foundation
• Powered in part by Academic
Equipment Grant 7826-990 237US from Sun Microsystems.
• No member institution could
afford to produce this
Who will set up LDC-IL in India?
What will it do actually?
• The Ministry of HRD through the Central Institute of Indian
Languages (CIIL), Mysore along with other institutions
working on Indian Languages technology like Indian
Institute of Science, Bangalore, Indian Institutes of
Technology at Mumbai and Chennai, as well as the
International Institute of Information Technology, Hyderabad
propose to set up this LDC-IL.
• It is proposed that they will be the Lead Institutions in this
initiative, with CIIL as the coordinating body.
•LDC-IL will be an archive plus.
•Besides data, tools and standards of data representation
and analysis must be developed.
•It will create, analyze, segment, tag, align, and upload
different kinds of linguistic resources.
•It will accept electronic resources from authors,
newspapers, publishers, film, TV, radio & process them for
use of the community.
Potential Participants /
Institutions in India
All academic institutes, research organizations and Corporate R&D
groups from India and abroad working on Indian languages will be
encouraged to participate in LDC-IL. The following have already
•All Indian Institutes of Technology;
•IIITs at Hyderabad and elsewhere;
•Universities like U of Hyderabad; DU; JNU; NEHU
•HP Labs India;
•IBM; Infosys; Reliance Infocom;
•Language institutions like CIEFL, KHS, NCPUL & RSKS;
Major areas of Linguistic Resource
Development as proposed
• Speech Recognition
• Character Recognition
• Creation of different
kinds of Corpora
• By-products : Word
finders, lexicons of
different kind, thesauri,
Usage compilations etc.
Other possible applications
• Collocational restrictions for
• TTS: Statistical
• Build a speech recognition
Develop Tree-bank tools
Will form a basis of MAT or MT
IN A WAY, ALL THESE WILL ONLY BE COMPLEMENTARY
TO WHAT IS BEING PLANNED / ENCOURAGED BY TDIL
of MCIT, and will complement it perfectly
Funding & Management
• The core funding from the Government of India. It
will span over two plan periods.
• All activities will be in a project mode and through
CIIL’s PL account.
• All staff will be on contract.
• All receipts and payments through internet
gateways, or through conventional means, will go
to this special bank account.
• Will attempt to leverage expertise already available
to cut avoidable cost and delay.
• As the nodal agency, CIIL will further distribute the
relevant funding for specific sub-components of the
scheme to other academic institutions.
• An annual progress report will be submitted to the
LDC-IL : Open to institutions, Research Organizations, and
Corporate sector from all over the world.
Will encourage members to contribute databases and
share revenues from sale of the data they contribute.
The databases will be available for R&D purposes to all
members and non-members on payment of the
appropriate fee, with a license for use only.
General membership will entitle all to get a large chunk of
tagged/aligned data for free; However, for specialized
parts, depending on the data contributors, they will have
to pay additional amounts.
The organization will be asked to sign a License Agreement
that the databases will not be distributed by it to others
either free or for a fee.
The IP and the copyright of any product developed as a
result of such an R&D activity shall lie with the
organization that has created the product.
PAC of LDC-IL
1. The LDC–IL will have a Project Advisory Committee (PAC).
2. Permanent members: Directors or nominees of lead
3. The PAC may be expanded later.
4. Lead institutions may be made expandable, with major
enterprises joining by putting in a major corpus grant.
5. It is to be understood that even if institutions from abroad
join this Consortium the administration/governance of it will
remain with Indian members only.
6. An official of the language Bureau nominated by the
MHRD and a nominee of the MCIT will be members of the
PAC. The FA of MHRD will also be a member.
7. The Director, Central Institute of Indian Languages will be
the Head of the LDC-IL. He will be assisted by a Project
Director nominated/ appointed for the purpose.
8 One Expert in IPR matters normally drawn from
Institutions like National Law School University, Bangalore
Differential rate of annual fee
• 1. Individual Researchers:
Rs.2000/- per annum
• 2. Educational
Institutions: Rs.20,000/- per
3. Software and related
industry : Rs.2,00,000/- per
Other countries :
• 1. Individual Researchers:
$ 2,000/- per annum
• 2. Educational
Institutions: $ 20,000/- per
• 3. Software and related
$ 50,000/per annum
GOES WITHOUT SAYING THAT THIS WOULD
REQUIRE CONSTANT UPDATION AND
UPGRADATION AS WELL AS EXPANSION OF OUR
DATA / TOOLS / PRODUCTS
• It is estimated that by the third
year, LDC-IL will have 50
Institutional members from
India, and 200 Indian scholars as
individual members, contributing
to Rs. 12 lakh annually.
• In addition, it is estimated to
have at least 20 researchers from
abroad as individual members,
contributing to $ 40,000 or Rs. 20
• The attempt will be to secure
industrial support from the IT
sector internationally to raise at
least 10 institutional
memberships initially, creating a
corpus of $ 200,000 annually
by/during the third year. Should
that happen, it will generate a
substantial amount for LDC-IL.
Budget: A broad indication*
Rs. 221.60 lakhs per year. Total: Rupees
1772.8 lakhs for the next 8 years.
• 1. Human Resources:
• 2. Tasks:
• 3. Events (Meetings, workshops,
seminars & Training programs) : 50,00,000
• 4. Equipments & maintenance: 27,00,000
• 5. IPR costs & publications:
Total: Rs. 2,21,60,000
•NB: The Director CIIL on the advise of the Project Advisory Committee of the
LDC-IL may be authorized to re-appropriate funds from among the heads
indicated here, without exceeding the overall budget.
•In case the people in service in the Government or Autonomous Institutions in
substantial capacity are selected their service and salary will be protected.
Project Director (1) Rs. 30,000 (variable)30,000x12
Scientist A (3) 29,000 x 3 persons x 12 man-months
Scientist B (4) 21,000 x 8 x 12 m
Scientist C (5) 14,000 x 6 x 12 m
Scientist D (8) 11,000 x 8 x 12 m
Project technicians (Rs.5,000 x 20 x 12 m)
Maint Personnel – Accounts (Rs.11,000 x 1 12m)
Maint Personnel – Sales & Promo (Rs.7,000 x 1 x
Maint Personnel - General (Rs.7,000 x 1 x 12m)
Tasks at various Participating Institutions (as in
Academic Meetings in diff. Instt x 2
LDC-IL PAC meetings at CIIL x 2
Seminars & Events in diff Instt – 7 every year in
Seminars (National) in diff Instt x 2
Seminars (Regional) in diff Instt x 4
Seminars (Int’l) rotating in diff participating Instt x
1 per year
(Prod) Workshops for production (6)
Training Programmes x 4 per year
Travel & Incidentals
Equipments & Maintenance
Maintenance of LDC-IL
IPR/Copyright payments (variable)
Publications, incl E-pub (10 a year)
Resource Generation- Details
The first 2 years of the project are
incubation years. It would take time
to set up, and test-run tools and
deliverables & advertise.
It is estimated that from the third
year onwards, the annual revenue
may be 8% to 10% of the annual
investment, i.e. Rs. 17.73 lakhs to Rs.
22.16 lakhs contributing to Corpus
6th year on, it will be around 25% to
35% of the amount invested, i.e.
Rs.55.4 lakhs to Rs.66.48 lakhs
At the end of eight years, there will be
at least Rs. 201.66 lakhs to Rs. 243.76
lakhs plus interests in corpus funds.
Hopefully, there will be new lead
institutions to contribute to corpus
fund further, once LDC-IL works in full
Core Operations to be selfsupporting
• Beyond eight years, Govt may
support only events (Rs.50 lakhs
from CIIL’s OC-Plan), tasks of
software development (Rs.64.76
lakhs from our OE-Plan), and
maintenance of equipments
(Rs.15.24 lakhs from OE-Non-Plan),
i.e. Rs.130 lakhs a years.
• The services of the personnel and
the IPR costs will be paid from 6%
interests of the corpus funds
(Rs.14.63 lakhs) plus anticipated
annual income, i.e. 66.48 lakhs, i.e.
Rs.81.11 lakhs generated annually.
With Rs.130 lakhs as above, the
total comes to Rs.211.11 lakhs
Speech Recognition and Synthesis: Objectives
Primarily to build speech recognition and synthesis systems.
Although there are ASR & TTS systems for many western languages,
commercially viable speech systems are unavailable.
Voice User Interfaces for IT applications and services, useful especially in
If such technology is available in Indian languages, people in various semiurban and rural parts of India will be able to use telephones and Internet to
access a wide range of services and information on health, agriculture, travel,
However, for this a computer has to be able to accept speech input in the
user’s language and provide natural speech output.
Also in India, if speech technology is coupled with translation systems
between the various Indian languages.
The main obstacle is to customize this technology for various Indian
languages is the lack of appropriate annotated speech databases.
Focus: (i) to collect data that can be used for building speech enabled
systems in Indian languages and (ii) to develop tools that facilitate collection of
high quality speech data.
Goals – long & short term
Long Term Goal:
The grand vision of this project is to collect data to provide speech-to-speech translation from each and every language
to each and every other language spoken in India (including Indian English). Such a system would include unlimited
vocabulary speech synthesis and recognition systems for every Indian language coupled with machine translation systems
between those languages. The block diagram given below describes the basic architecture of such a system.
in language A
Recognized Text in Language A
Short Term Goal:
To create databases for building (a) bi-directional speech to speech translation system of read speech for a pair of
Indian languages, namely, Hindi-Telugu, (b) a speech recognition system for Indian English. Further, it is desired to collect
large vocabulary isolated data for the 22 Scheduled Indian languages.
Text in Language A
Language A to B
Text to Speech
Translated Text in Language Bconversion in
in Language B
Data collection Effort for Automatic Speech Recognition (ASR)
Data required: Read speech corpora for two Indian languages and Indian English.
Channels: 1. Close talking microphone, on a desktop or laptop.
2. Telephone, both landline and mobile .
Annotation: The data will be annotated at phoneme, syllable, word and sentence levels.
Data Collection for Isolated Speech Recognition
Channels: 1. Close talking microphone, on a desktop or laptop
2. Telephone, both landline and mobile
Demography: 10,000 words from 300 speakers (150 male, 150 female)
Data Collection for Text to Speech Synthesis
Data Required: Data will be collected in the form of read-out phonetically balanced text which will ensure
coverage of all speech sounds of the language concerned in different prosodic and phonological contexts. The
phonetically balanced text will be extracted from a huge text corpus.
Channels: Speech Synthesis requires high quality recording in an anechoic chamber using high quality
microphones and recording equipment.
Demography: 6 speakers: 3 males and 3 females per language.
Annotation: Data to be annotated at phone, phoneme, syllable, word, and phrase level.
• Speech to Speech translation for a pair of Indian
languages, namely, Hindi and Telugu.
• Command and control applications.
• Multimodal interfaces to the computer in Indian
• E-mail readers over the telephone.
• Readers for the visually disadvantaged.
• Speech enabled Office Suite.
The effort for both Speech Recognition and Speech Synthesis will be repeated
across all 22 Scheduled languages. For Speech Recognition, spontaneous speech
data will be collected along with read speech. For speech synthesis, data will be
collected from professional speakers, with very good voice quality. Additional
speech data will be collected to come out with models for prosody (intonation,
duration, etc.) to improve the naturalness of synthesized speech. A database
(lexicon) of proper names (of Indian origin) will be created, with the equivalent
phonetic representation for each of the names.
Character Recognition refers to the conversion of printed or
handwritten characters to a machine-interpretable form.
”Online” handwriting recognition or Online HWR refers to the
interpretation of handwriting captured dynamically using a handheld
or tablet device. It allows the creation of more natural handwritingbased alternatives to keyboards for data entry in Indian scripts, and
also for imparting of handwriting skills using computers.
“Offline” handwriting recognition or Offline HWR refers to the
interpretation of handwriting captured statically as an image.
Optical character recognition or OCR refers to the interpretation of
printed text captured as an image. It can be used for conversion of
printed or typewritten material such as books and documents into
These different areas of language technology require different
algorithms and linguistic resources.
They are all hard research problems because of the variety of writing
styles and fonts encountered.
Of these, OCR has seen some research in a few Indian scripts because
of support from the TDIL program. However the technology is not yet
mature and there is only one commercial offering.
1. Handwriting Interface to Computers
Indian scripts are complex and not suitable for keyboard-based entry. Replacing the
keyboard with a simpler and more natural interface based on handwriting would make
computers much more accessible to the common man and to educators in particular. The
solution would also need to support numerals, punctuation, and editing gestures, and
functionally replace the keyboard.
2. Handwriting Tutor
3. Multilingual Digital Libraries for Education
A wealth of literature and other education material in Indian languages is trapped in
books, which require storage and are subject to physical decay. Online books may be
easily made available to students all over in their schools, homes or hostels.
The proposed solution will use a complete OCR pipeline for converting scanned images of
book pages into electronic form, with serch using the local language, using either spoken
(using Speech Recognition) or written (using Online HWR) queries.
4. Automatic Forms Processing/Educational Testing
With millions of application forms filled in every year in Indian languages especially in the
education sector, a solution for automatically reading handwriting from scanned images of
forms is valuable.
The proposed solution is a complete forms-processing system.
The interpreted results can be stored into a database (for applications) or compared with
correct responses (for educational testing).
Natural Language Processing
Electronic dictionaries are a primary requisite for developing any software in NLP.
ED 1 Monolingual/bilingual dictionaries
25,000 words per year (per language)
ED 2. Transfer Lexicon and Grammar(TransLexGram) (per language)
Transfer Lexicon and Grammar above involves developing a language
resource which would contain
Their grammatical category
Their various senses in Hindi
Corresponding sense in the other Indian language
An example sentence in English for each sense of a word
Corresponding translation in the concerned Indian language
o In case of verbs, parallel verb-frames from English to Indian language.
As is obvious from the above, TransLexGram will be a rich lexicon which will not
only contain the word level information but also the crucial information of verbargument structure and the vibhaktis with specific senses of a verb.
The resource, once created will be a parallel resource not only between
English and Indian languages but also across all Indian languages.
Creation of Corpora
Domain Specific Corpora:
Apart from these basic text corpora creation an attempt will be made to create
domain specific corpora in the following areas :
Child language corpus
Pathological speech/language data
Speech error Data
Historical/Inscriptional databases of Indian languages which is one of the
most important to trace not only as the living documents of Indian History but
also historical linguistics of Indian languages.
Grammars of comparative/descriptive/reference are needed to be
considered as corpus of databases.
Morphological Analyzers and morphological generators.
POS tagged corpora
• Part-of-speech (or POS) tagged corpora are collections of texts
in which part of speech category for each word is marked.
• To be developed in a bootstrapping manner.
• First, manual tagging will be done on some amount of text.
• Then, a POS tagger which uses learning techniques will be used
to learn from the tagged data.
• After the training, the tool will automatically tag another set of
the raw corpus.
• Automatically tagged corpus will then be manually validated
which will be used as additional training data for enhancing the
performance of the tool.
Other kinds of Corpora
Semantically tagged corpora:
• The chunked corpora will
also be prepared in a
manner similar to the POS
tagging. Here also the initial
training set will be a
complete manual effort.
Thereafter, it will be a manmachine effort. That is why,
the target in the first year is
less and double in the
successive years. Chunked
corpora is a useful resource
for various applications.
The real challenge in any NLP and
text information processing
application is the task of
disambiguating senses. In spite of
long years of R & D in this area,
fully automatic WSD with 100%
accuracy has remained an elusive
goal. One of the reasons for this
shortcoming is understood to be
the lack of appropriate and
adequate lexical resources and
tools. One such resource is the
"semantically tagged corpora".
Syntactic tree bank:
Parallel aligned corpora:
Preparation of this resource
requires higher level of linguistic
expertise and needs more human
effort. First, experts will manually
tag the data for syntactic parsing.
Since, a crucial point related to
this task is to arrive at a consensus
regarding the tags, degree of
fineness in analysis and the
methodology to be followed. This
calls for some discussions amongst
the scholars from varying fields
such as Sanskritists, linguistics and
computer scientists . It will be
achieved through conduct of
workshops and meetings.
A text available in multiple
constitutes parallel corpora.
NBT & Sahitya Akademi are some
of the official agencies who
develop parallel texts in different
languages through translation.
Such Institutions have given
permission to CIIL to use their
works for creation of electronic
versions of the same as parallel
The literary magazines and news
language editions will have to be
approached for parallel corpora.
Computer programmes have to be
written for creating
[I] Aligned texts; [II] Aligned
sentences; and [III] Aligned
1.Tools for Transfer Lexicon Grammar (including creation of interface for
building Transfer Lexicon Grammar)
• 2. Spellchecker and corrector tools
• 3. Tools for POS tagging. (Trainable tagging tool + an Interface for editing POS
• tagged corpora)
• 4. Tools for chunking (Rule-based language-independent chunkers)
• 5. Interface for chunking (Building an interface for editing and validating the
• 6.Tools for syntactic tree bank, incl. interface for developing syntactic tree bank
• 7. Tools for semantic tagging with basic resources are the Indian language
WordNets showing a browser that has two windows – one showing the senses
(i.e., synsets) from the WordNet appear in the other window, after which a
manual selection of the sense can be done
• 8. (Semi) automatic tagger based on statistical NLP (the preliminary version of
which is ready in IITB)
• 9. Tools for text alignment, including Text alignment tool, Sentence alignment tool
and Chunk alignment tool as well as an interface for aligning corpora