A new generation of Indonesian dictionaries

download report

Transcript A new generation of Indonesian dictionaries

AFNLP 2008 Meeting

Indonesia Country Report

Hammam Riza [email protected]

Agency for the Assessment and Application of Technology (BPPT) Ministry of Research and Technology Republic of Indonesia 1

TOC

Past Activities

Activities in 2007

Activities Plan 2008, 2009

National Language Year 2008

Past NLP Research Projects in Indonesia             Indonesian Text-To-Speech (BPPT, ITB, UI) GDA/MMA/Linguistic-DS MPEG-7 (Multimedia Annotation) Cross-Linguistic Portal (dictionaries, corpus, tools) Web translator (WebTRans)

Standard Indonesian Language Corpus (SILC) Indonesian Language Dictionaries Project (KBBI)

English-Indonesia Parallel Corpus (INCI) Speech recognition/synthesis system (Bandung Institute of Technology/ Telkom RDC/University of Indonesia) Information retrieval (ITB and University of Indonesia) Text/Image processing tools (Gajah Mada University) Computational lexicon (National Language Center) Computational morphology (Atmajaya University) 3

Promotion of Language Technologies (2007)

 National Language Congress XII in Solo introducing toolkit to build speech database for endangered languages and Atmajaya Language Workshop (June 2007) in Jakarta on promoting local computing policy and speech technologies (both keynote speeches by Dr. Hammam Riza)  Promotion of

Context Sensitive Dictionary Project

for Speech Translation Corpus for Aceh Tsunami Region; (Indonesian Acehnese, bidirectional) 4

Activities in Machine Translation (2006-2007)

 Rule-based system Indonesian-English translator (started in 2006) was launched to the market June 2007 by ITB   This translator is combined with English TTS (Windows), and Indonesian TTS (proprietary) Experiment of Statistical MT – using Pharaoh decoder (Eng-Indo parallel corpus) by 5

Current Activities in Speech Tech

• • • • • Telkom RDC & BPPT collaboration on Speech Recognition and Summarization Indonesia Goes Open Source (IGOS) speech recognition system (funded by Ministry of Research and Technology) Speech recognition system for Bahasa Indonesia (University of Indonesia) – Transcribing speech data that contains broadcast TV and Radio news – Applications: • sending short message service (sms) • IVR ( health and tourism services) Research for “intonation by example” and “automatic prosody pattern extractor” using Artificial Neural Network (ANN) Text to Speech system for local languages (ITB/UI) 6

100

th

Year of Bahasa Indonesia – National Language Year 2008

  Series of event culminating at the International Conference on Bahasa Indonesia (Oct 2008)  Importance of Indonesian – Its roles, functions in national life & development (policy making, business, media, education)  Language planning (shaping change) 6 keynote speakers from AFNLP will be invited by Indonesian government through out the year

Major Activities for 2008

       Local Language Resource Projects (Language Center) Indonesian and Local Languages - Wordnet MALINDO (Malaysia-Indonesia) joint projects Speech to speech translation for Asian languages (A-STAR) Speech database Telkom RDC/BPPT (APT support) Language Resources and Translation English Indonesia (collaboration with PAN Localization) Speech Corpus for Local Languages (Endangered Languages) – using BLARK (ELDA) 8

Activities Plan for 2008-2009

      Speech Recognition and Phrase-based Statistical Machine Translation (SMT) system for bidirectional

Indonesian-English

and

Indonesian-Japanese

Mapping and SMT for Indonesian-Regional Languages (

Bahasa Nusantara

) and for German, French, Chinese and Arabic (

cross border languages) Information Retrieval (cross language speech retrieval)

 Searching and retrieving Indonesian speech data Topic Detection and Tracking (TDT)   Identifying topics in speech data collection Classifying new data to the existing topics in the collection Speech Synthesis Speech Summarization  Summarize the Indonesian speech documents 9

E-dictionary project National Language Center

    Size & Comprehensiveness:   200,000 entries many subject areas are covered Method:   corpus-based, primary data for largest print dict Usefulness:   find the words you need definitions and examples are helpful Users  writers, journalists, editors, scientists, academics, teachers, students, business people, lawyers etc… Kamus Besar Bahasa Indonesia (KBBI) 3 rd ed.

Echols & Shadily’s Eng-Ind. dictionary.

In Indonesia, there are at least 13 biggest local languages with at least one million speakers

Javanese (75,200,000) Sundanese (27,000,000) Malay (20,000,000) Madurese (13,694,000) Minangkabau (6,500,000) Batak (5,150,000) Buginese (4,000,000) Balinese (3,800,000) Acehnese (3,000,000) Sasak (2,100,000) Makassarese (1,600,000) Lampung (1,500,000) Rejang (1,000,000)

ACEH – 32 local languages

EAST JAVA – 6 local languages

LOCAL & CROSS-BORDER LANGUAGES

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Note:

bn id

English, Arabic, Chinese, French, German, Dutch, Japanese, etc.

kh

Cross-Border Languages in Indonesia:

la my mm ph sg

South East Asia

% Local Languages % English th tp vn % Other Cross Boader Languages

Language Digital Divide Language Preservation

    Survey of indigenous local languages Local computing policy will be developed for major local languages Endangered languages are identified and preserved by means of ICT Language resources collection for official and major local languages

Thank You Any comments please mail to [email protected]