4. India_Country Report_OCOCOSDA2010

Transcript 4. India_Country Report_OCOCOSDA2010

Recent Developments in Speech Corpora in Indian
Languages
Country Report -India
O-COCOSDA 2010
Kathmandu,Nepal,Nov.25, 2010
S S Agrawal
Advisor,C-DAC,Noida & Executive Director KIIT,Gurgaon,
India
[email protected], [email protected]
Speech Corpora Development
Continuing activities

CDAC-NOIDA A-STAR/U-STAR Project

250 Hindi Sentences, 2 utterances, 2 Channels 70 Speakers ,4 age groups,S/N-50db / 16
bit, 44.1 khz. Used for development of Hindi ASR.

1250 phonetically rich Hindi sentences by one male speaker–phonetically labeled using
tool kit (manually corrected) for developing Hindi TTS .

50 hrs of Annotated speech corpora for six Indian Languages TDIL(DIT/MCIT).



HTK
Hindi, Marathi, Punjabi by CDAC, Noida
Bengali, Assamese & Manipuri language by CDAC, Kolkata
Tamil, Telugu, Malayalam and Kannada by CDAC, Thiru

Phonetically rich database from large number of multilingual speakers in 3 languages Hindi,
Marathi and Indian accented English.50k phonetically Hindi rich sentences by TIFR.

Multi-channel, Multi-lingual database for 100 speaker database in contemporary/Non
contemporary situations – CFSL, Chandigarh

Data base for Dialectal variations, Emotional variations, in Hindi KIIT,AMU, IPU

Speech Database for Bodo (Assamese) language - IIT-G 1000 phonetically rich sentences
CIIL-LDCIL
Size of Speech Corpora
LANGUAGE
Total no. of
Speakers
Size of speech Data
by Female (hours)
Size of speech Data
by Male (hours)
Total Speech
Data (hours)
Assamese
306
34.5
44.5
79
Bengali
469
60
62
122
English (Bengali accent)
53
16.5
12
28.5
English (Kannada accent)
54
English (Malayalam accent)
10
8
9
17
English (Tamil accent)
10
8.5
8.5
17
Gujarati
235
33.5
27.5
61
Hindi
433
51.5
51.5
103
Kannada
492
68
65
133
Konkani
107
15.5
15.5
31
Maithili
150
Malayalam
160
27.5
25.5
53
Manipuri
162
Marathi
150
Nepali
196
26.5
30
56.5
Oriya
162
18
19.5
37.5
Punjabi
156
14.5
15
29.5
Tamil
156
29.5
36
65.5
Telugu
150
Urdu
169
20
20
40
KIIT- Mobile Text & Speech Database Collection in Hindi and
Indian Spoken English
(Contracted by Nokia Research Center, China)


Collection of 2 million words (Approx.200,000 Mobile messages in each of Hindi and Indian
Spoken English languages)
Cleaning and expansion of raw data with reference to the grammar rules and context.
Creation of 13 Prompt sheets containing 630 phonetically rich sentences in each language based
on the text data collected.
Recording of prompt sheets sentences by 100 speakers through three channels simultaneously.
Audio Annotation of the recorded sentences .

Database for Emotional Speech AMU,IPU,KIIT





Recognition of Emotional Speech using ANN and Human Perception
Happiness, Anger, Sadness, Fear, Neutral
600 Sentences- 6 students of Drama Club- 5 sentences,4 times in 5 emotional conditions.
Database for Transformation of Emotions (Hindi Speech)
Happiness, Sadness, Anger, Surprise, Neutral
1500 sentences – 15 Speakers, 20 sentences,5 emotions
Speech Recognition and databases
Continuing ASR Systems:


LVSR and Language models for Tamil ,Telugu, Speech Recognition – Anna University.

Telephone Speech Recognition System for Hindi – IBM (I) Research Lab.

Speech to Text System for Hindi – Shrut- lekhan – Prototype - CDAC Pune

Indian Language Speech Recognition Systems - H.P. Labs India.

Manner Based Lexically Driven Bengal Speech Recognition System – CDAC Kolkata
Consortium project on Speech Recognition


Speech Based Access for Agriculture Commodity

Wire or Wireless communication based enquiry : Telephone or Mobile Phones

Six languages in first Phase: Hindi – IIT, Kanpur ,Assamese - IIT Guwahati, Bengali– CDAC,
Kolkata , Marathi– TIFR + IIT, Mumbai, Telugu-IIIT, Hyderabad, Tamil - IIT, Madras

A database of sentences from 3000 farmers in each Language.
Text to Speech Synthesis-Consortium
Project

Collection of Text & Speech Corpora –Festival Based
Language
Hindi
Tamil
Telugu
Marathi
Malayalam
Bengali

Lexicon
0.1 M
0.45 M
0.1 M
0.11 M
0.05 M
0.023 M
No. Syllable
4300
4200
3000
9200
6561
7400
Speech (hrs)
6 hrs
6 hrs
45 hrs
10 hrs
6 hrs
10hrs
Institute
IIT-M
IIT-M
IIIT-H
CDAC-Mum
CDAC-TRV
IIT-Kh
Other projects:

Vaachak: Hindi, going on for Indian English – followed Indian Language – SAPI Compliant-

Hindi Vani: TTS for Hindi based on Klatt’s format synthesizer – version – CEERI

Bangla Vani: Concatinative Bangla and Nepali (ESNOLA Based) TTS – developing using E/rock

Subhasini : TTS Malayalam : Band on disphonic concation – suggests ISCII, ISFOC &
Prologix Software
- CDAC Kolkata
UNICODE – CDAC, Thiruvanantpuram

4. India_Country Report_OCOCOSDA2010

Transcript 4. India_Country Report_OCOCOSDA2010

Directory