Transcript Document

Dogri: 0.22
Manipuri: 0.14
Konkani: 0.24
Sindhi: 0.25
Nepali: 0.28
Kashmiri: 0.54
Santali: 0.63
Bodo: 0.13
Sanskrit: Negligable
Non-Scheduled
languages: 3.44
Maithili: 1.18
Assamese: 1.28
Punjabi: 2.83
Oriya: 3.21
Hindi: 41.03
Malayalam: 3.21
Kannada: 3.69
Gujarati: 4.48
Urdu: 5.01
Tamil: 5.91
Marathi: 6.99 Telugu: 7.19
Bengali: 8.111
Indian Languages - 2001
Adi
Garo
Kolami
Malto
Rengma
Afghani / Kabuli / Pashto
Gondi
Kom
Maram
Sangtam
Anal
Halabi
Konda
Maring
Savara
Angami
Halam
Konyak
Miri / Mishing
Sema
Ao
Hmar
Korku
Mishmi
Sherpa
Arabic / Arbi
Ho
Korwa
Mogh
Shina
Balti
Jatapu
Koya
Monpa
Simte
Bhili / Bhilodi
Juang
Kui
Munda
Tamang
Bhotia
Kabui
Kuki
Mundari
Tangkhul
Bhumij
Karbi / Mikir
Kurukh / Oraon
Nicobarese
Tangsa
Bishnupuriya
Khandeshi
Ladakhi
Nissi / Dafla
Thado
Chakhesang
Kharia
Lahauli
Nocte
Tibetan
Chakru / Chokri
Khasi
Lahnda
Paite
Tripuri
Chang
Khezha
Lakher
Parji
Tulu
Coorgi / Kodagu
Khiemnungan
Lalung
Pawi
Vaiphei
Deori
Khond / Kondh
Lepcha
Persian
Wancho
Dimasa
Kinnauri
Liangmei
Phom
Yimchungre
English
Kisan
Limbu
Pochury
Zeliang
Gadaba
Koch
Lotha
Rabha
Gangte
Koda / Kora
Lushai / Mizo
Rai
Zemi 2
Zou
Mission Statement
Annotated, quality language data
(both-text and speech)
and tools in
Created
Indian languages
in house,
to
through
Individuals
outsourcing,
Institutions
Industry etc., for
Research and Development.
acquisition.
3
Objectives
A repository of linguistic resources in all Indian
languages in the form of text, speech and lexical corpora.
Facilitating creation of such databases by different
organizations.
Setting standards for data collection and storage of
corpora for different research and development
activities.
Supporting development and sharing of tools for data
collection and management.
Facilitating training through workshops, seminars etc. in
technical as well as process related issues.
Creating and maintaining the LDC-IL website that would
be the primary gateway for accessing LDC-IL resources.
Designing or providing help in creation of appropriate
language technology for mass use.
Providing the necessary linkages between academic
institutions, individual researchers and the masses. 4
Participating Institutions in India
All academic institutes, research organizations and
Corporate R&D groups from India and abroad working on
Indian languages will be encouraged to participate in
LDC-IL. The following have already shown interest:
•IISc Bangalore;
•All Indian Institutes of Technology;
•IIITs at Hyderabad and elsewhere;
•ISI Calcutta/Hyderabad/Bangalore;
•C-DAC, Pune;
•TIFR Mumbai;
•Universities like HCU; DU; JNU; NEHU
•HP Labs India;
•IBM;
•Language institutions like KHS, NCPUL & RSKS; and,
•Of course, the MCIT-TDIL
5
Funding & Management
• The core funding from the Government of India.
• All activities will be in a project mode.
• Will attempt to leverage expertise already available to
cut avoidable cost and delay.
• All staff will be on contract.
• All receipts and payments through internet gateways,
or through conventional means, will go to the
Consolidated Fund.
• However, the Government will release grants required
to the Consortium as required. If need be, the support
will be extended beyond the initial six year period.
• As the nodal agency, CIIL will further distribute the
relevant funding for specific sub-components of the
scheme to other academic institutions.
• An annual progress report will be submitted to the
government.
6
Arrangements
1.
2.
3.
4.
5.
LDC-IL will be open to all institutions, Research
Organizations, and Corporate sector from all over
the world.
Members will be encouraged to contribute
databases and share revenues from sale of the
data they contribute
The databases will be available for R&D purposes
to all members and non-members on payment of
the appropriate fee, with a license for use only.
The organization will be asked to sign a License
Agreement that the databases will not be
distributed by it to others either free or for a fee.
The IP and the copyright of any product developed
as a result of such an R&D activity shall lie with
the organization that has created the product.
7
Tasks
•
•
•
•
Establishing standards
Creating language resources
Annotating language data
Building systems/helping system
building
• Creating human resources
• Co-ordinating language resource
developing activities
8
Major Areas
Linguistic Resource Development
• Creation of different kinds of Corpora
including Pathological speech, Historical/
Inscriptional databases
• Natural Language Processing
• Speech Recognition and Synthesis
• Character Recognition
• By-products like Word finders, lexicons of
different
kind,
thesauri,
Usage
compilations etc.
9
Text Corpora - Monolingual / Parallel Corpora (SL)
Sl. No.
Languages
1st Year
2nd Year
3rd Year
4th Year
5th Year
Total
1
Assamese
2
2
2
2
2
10
2
Bengali
2
2
2
2
2
10
3
Bodo
0.6
0.6
0.6
0.6
0.6
3
4
Dogri
0.6
0.6
0.6
0.6
0.6
3
5
Gujarati
2
2
2
2
2
10
6
Hindi
2
2
2
2
2
10
7
Kannada
2
2
2
2
2
10
8
Kashmiri
1
1
1
1
1
5
9
Konkani
1
1
1
1
1
5
10
Maithili
1
1
1
1
1
5
11
Malayalam
2
2
2
2
2
10
12
Manipuri
1
1
1
1
1
5
13
Marathi
2
2
2
2
2
10
14
Nepali
2
2
2
2
2
10
15
Oriya
2
2
2
2
2
10
16
Punjabi
2
2
2
2
2
10
17
Sanskrit
0.4
0.4
0.4
0.4
0.4
2
18
Santali
0.6
0.6
0.6
0.6
0.6
3
19
Sindhi
0.6
0.6
0.6
0.6
0.6
3
20
Tamil
2
2
2
2
2
10
21
Telugu
2
2
2
2
2
10 10
22
Urdu
2
2
2
2
2
10
Tools for Corpora Management & Analysis
Frequency analyzers for character, word, sentence.
KWIC and KWOC retrievers.
Tool for Automatic transliterations from Indian
language scripts to Roman and vice versa: Kannada,
Tamil, Telugu, Assamese, Bengali, Manipuri, Manipuri,
Malayalam, Punjabi, Oriya, Gujarati.
Parallel corpora tools for text alignment, including
sentence alignment tool and chunk alignment tool as
well as an interface for aligning corpora.
Tools for
• Morphological analysis
• POS tagging
• Semantic tagging
• Syntactic tree bank
11
Computational Grammars for Indian Languages
Task 1:
Task 2:
Task 3:
Task 4:
Task 5:
Task 6:
Task 7:
Task 8:
Hierarchical POS Tag set
Dictionary - (a) closed class words and
(b) open class words
Morphological analyzer and generator
Manual
POS
annotation
and
development of an automatic tagger
Semantic tagging
Chunker
Tree banking
Shallow parser, which will eventually
turn into a deep parser
12
Linguistic Research
•
•
•
•
•
•
•
•
•
•
•
Lexical studies
Semantics
Pragmatics & Discourse analysis
Sociolinguistics
Dialectology & Variation studies
Stylistics
Language teaching
Historical linguistics
Psycholinguistics
Social psychology
Cultural studies
13
Speech Corpora






Develop tools that facilitate collection of
high quality speech data
Collect data that can be used for building
speech recognition. speech synthesis and
provide speech-to-speech translation from
one language to another language spoken
in India (including Indian English).
Apart from these like applications in the
area of text corpora, speech corpora also,
main efforts are on the engineering side.
So, efforts shall also be made to collect
Child language corpora
Pathological speech/language data and
Speech error Data
14
Applications
• Speech Recognition and Speech
Synthesis
• Speech to Speech translation for a pair
of Indian languages
• Command and control applications
• Multimodal interfaces to the computer
in Indian languages
• E-mail readers over the telephone
• Readers for the visually disadvantaged
• Speech enabled Office Suite etc
15
Speech Dataset
1. Phonetically Balanced Vocabulary
2. Phonetically Balanced Sentences
3. Connected
Text
created
using
phonetically balanced vocabulary
4. Date Format
5. Command and Control Words
6. Proper Nouns 500 place and 500 person
names
7. Most Frequent Words: 1000
8. Form and Function Words
9. News domain: news, editorial, essay each text not less than 500 words
16
Number of Speakers
• Data will be collected from minimum of 300 (150
Male and 150 Female) speakers of each
language. In addition to this, natural discourse
data from various domains too shall be collected
for Indian languages for research into spoken
language.
• Data for speech synthesis shall be collected
from limited number of speakers - 3 male and 3
female in the studio environment. They shall
invariably have very good voice quality and are
professional voice givers/media announcers.
17
Annotation of data:
1. Data to be used for speech recognition
shall be annotated at phoneme, syllable,
word and sentence levels
2. Data to be used for speech synthesis
shall be annotated at phone, phoneme,
syllable, word, and phrase level.
Annotation tools:
Tools
will
be
developed
for
semiautomatic annotation of speech
data. These tools will also be useful for
annotating speech synthesis databases.
18
Coverage of languages
I Year
II Year
III Year
III Year
13. Maithili
19. Sindhi
1. Bengali
7. Manipuri
2. Hindi
8. Malayalam 14. Dogri
20. Oriya
3. Tamil
9. Punjabi
15. Bodo
21. Marathi
4. Telugu
10. Urdu
16. Konkani 22. Khasi
5. Assamese 11. Kannada
17. Santali
6. Nepali
18.Kashmiri 24. Kodava
12. Gujarati
23. Tulu
19
Indian Sign Language corpora
Northern India :
Southern India:
North-eastern India:
Western India:
Eastern Indian:
Delhi
Mysore
Shillong
Lchalkaranji
Kolkata
Lexical items
Sentences
Production data
1st
2nd
3rd
4th
5th
year
year
year
year
year
15000
2500
50
20
Character Recognition
•
•
•
Development of standards, tools and
linguistic resources (datasets) for the fields
of Online HWR, Offline HWR and OCR.
Promotion of development of these
technologies.
Promotion of development of important
and challenging applications of these
technologies in the context of Indic
languages and scripts.
21
By-products like lexicon, thesauri,
WordNet etc
•
Creation of frequency dictionaries - five per
year
•
•
•
•
•
•
•
First year:
Bengali, Hindi, Kannada, Manipuri, Urdu.
Second year: Bodo, Dogri, Maithili, Nepali, Konkani.
Third year: Assamese, Gujarati, Oriya, Punjabi, Tamil,
Fourth year: Kashmiri, Malayalam, Marathi, Sanskrit,
Santali.
Fifth year : other languages
Multilingual multi directional dictionary - an
ongoing process
Aiding wordnet creation and collaborating
with others for the same - an ongoing
process
22
Licensing Policy
Licensing is an important issue for
LDC-IL. The draft policy for licensing
shall
be
evolved
through
discussions within one year. The
same shall be finalized within
another one year by the time the
annotated data is available for
delivery purposes.
23
Evaluation
The data that the LDCIL creates and
obtains has to be evaluated. For each
kind of data, tool etc., matrices have to
be evolved. Bench marking, good
standards etc., have to be developed. In
one year time frame, the same shall be
accomplished for first set of tools. In the
next year/s the same for other data and
tools shall be developed
24
Beyond Roadmap
Above all and in addition to what LDCIL
has projected in the roadmap the LDC-IL
will positively respond to the specific
language data needs of the individuals,
institutions and industry by taking up
their requests on priority basis for
licensing purposes. In the beginning the
derivatives of the databases shall be
licensed and after all the licensing issues
are resolved the databases shall also be
licensed.
25
Monolingual Text Corpora
Sl. No. Language Word Count
1.
Bengali
2.
Sl. No. Language Word Count
50,42,724
8.
Konkani
I5,69,906
Bodo
6,37,801
9.
Maithili
83,92,505
3.
Dogri
8,24,443
10.
Manipuri
16,37,104
4.
English
21,15,461
11.
Nepali
21,58,324
5.
Hindi
3,45,85,882
12.
Tamil
4,67,096
6.
Kannada
71,84,702
13.
Urdu
22,80,782
7.
Kodava
1,83,322
14.
Yarava
13,904
26
Parallel Text Corpora
Sl. No.
1
2
3
4
5
6
Language
English
Bengali
English
Dogri
English
Hindi
English
Kannada
English
Maithili
English
Nepali
Texts
05
04
73
32
07
11
Word Count
1,26,828
93,952
88,025
93,293
17,57,736
17,53,235
7,79,258
4,76,855
1,59,419
1,36,421
2,63,256
2,02,157
27
Speech Data Set Details
Assamese Bengali Gujarati Hindi Kannada
Phon. Bal.
Vocabulary
439
561
689
800
390
Phon. Bal.
Sentences
200
200
200
500
150
6
6
6
6
6
Command &
Control Words
250
238
296
250
82
Proper Nouns
841
823
902
824
1018
Most Frequent
Words
-
1000
-
1000
1000
Form & Function
Words
265
178
232
200
432
News Domain
texts
150
150
150
150
150
Connected Texts
28
Speech Data Set Details
Maithili Manipuri Nepali
Tamil
Urdu
Phon. Bal.
Vocabulary
509
374
421
565
775
Phon. Bal. Sentences
208
200
200
228
195
6
6
6
6
6
Command & Control
Words
187
243
74
369
141
Proper Nouns
824
825
834
908
500
1000
1000
1000
1000
1000
Form & Function
Words
243
189
190
598
380
News Domain texts
150
150
150
150
150
Connected Texts
Most Frequent Words
Other languages to be completed before March 31, 2009
Malayalam, Punjabi
29
Speech Corpora
Assamese
Informants
Male
Female
81
81
Bengali
Gujarati
Hindi
Kannada
238
77
314
82
234
83
316
82
14850
3769
11483
2940
247.50
62.49
191.23
49.00
Maithili
Manipuri
Nepali
82
82
48
82
82
60
3340
2602
3307
55.40
43.22
55.07
Tamil
78
71
5127
85.45
Language
Duration
Minutes
Hours
1985
33.05
Other languages to be completed before March 31, 302009
Malayalam, Punjabi, Urdu
Frequency Dictionaries:
Most frequent 5000 words
Published
Sl. No.
1.
2.
3.
4.
Language
Bengali
Hindi
Kannada
Manipuri
To be published by March 31, 2009
Sl. No.
1.
2.
Language
Nepali
31
Urdu
Development of Tools
Corpora management packages developed:
1. Word Frequency Analyser
2. N-Gram (Bi-Gram, Tri-Gram) for word and
character
3. Speech Annotation Manual prepared and
published
The following packages will be developed:
1. KWIC and KWOC Retriver
2. Tool for Semi Automatic Annotation of
Speech Data.
32
»Interns LDC-
33
34