Transcript Document

Towards Universalisation of Creativity

Dr.Om vikas

Dr. Om Vikas Department of Information Technology Ministry of Communications and Information Technology Government of India E-mail: [email protected]

ICDL-2004

Is there gain in knowledge or loss of Knowledge?

• From an estimated 10,000 world languages in 1900, about 6,700 language survived in 2000.

Two percent of the world's languages are becoming extinct every year

.

• There is worldwide, unquantifiable erosion of cultural participation, knowledge and innovation.

• With the loss of a language, we lose art and ideas, scientific information and technological innovation capacity.

• World-level literacy is improving. More people can read than ever before, but fewer people create stories.

• There is tendency from being

creators

to

consumers

at the time when technology could have amplified our creative capacities.

• UNESCO study (1999) of 65 languages: 49 of the languages (75%) had experienced real decline in number of works translated from these languages into other languages.

• The proportion for English arose from 43 percent in 1980 to over 57 percent in 1994.

• The share held by top four translated languages (English, Spanish, French and German) rose from 65 percent in 1980 to 81 percent in 1994.

• According to an UNESCO study involving world’s 140 most published authors; 90 out of 140 were English writers in 1994 compared to 64 out of 140 in 1980.

• There is collapse in authorship, translation and quality in other languages.

Erosion of Language and Culture !!

Dr.Om vikas ICDL-2004

Is the technology to divide or to unite

?

• Latin Alphabet users , 39 % of the global population enjoy 84% of access to the Internet • Hanzi-users in (CJK), 22% in global population enjoy 13% of Internet access • Arbic script users, 9% of the population have 1.2 % of the Internet Access • Bralmi-origin scripts users in South-east Asia and Indic scripts users occupy 22 % of the World population have just 0.3 % of Internet access.

• More than 80% content on Internet is in English.

• ICT penetration in India and other developing countries is lower.

Dr.Om vikas ICDL-2004

ICT Indicators Teledensity Cellphone Density PC penetration Advanced Nations 50-70 % 30-75 % 30-60 % Developing Nations 20-30 % 04-7 % 0.5-2 % Underdeveloped Nations Sprawling

Dr.Om vikas

Digital <<<<<<< >>>>>> >Divide !!!!!

ICDL-2004

Digital Divide as They Behold Perception Developed Countries Developing Countries

Why discussed ?

Policy Results Consumer nature Desire to capture larger markets Information explosion Increasing use of English and thrust of western culture. “substitute the old” [Consumerism-centric] IPR-Centric Fear of lagging behind in economic race Localization Preservation of local language and culture. “Upgrade the Old” Technology development Low cost PC $400 Open source technology less than $ 40 Reason: PPP : (15:1) GNP : (75:1) 34260 (USA) 24260 2400 (India) 460 Focus Digital divide Access to Information Wider control Digital Unite Share the Knowledge Small is beautiful.

Low affordability means low ICT penetration & sprawling Digital Divide Dr.Om vikas ICDL-2004

e -

Content & Universal Access

 UNESO identifies

Challenges in Multilinguism and universal

access to information

• General affordable worldwide access • Hardware and Software, Web and Internet Features.

• Availability of Accessible websites and Internet Access devices.

Accessibility of multiple languages

Development of content in Native languages, and its placement on Internet

.

• Appropriate design of software for users Dr.Om vikas ICDL-2004

Potential Use of non-English languages on Internet

will increase drastically by 2010 as shown below :

Users

500 Mn 400 Mn 300 Mn 200 Mn 100 Mn 0

Eng Jap

2010 2003

Chinese French Spanish German Indian languages Dr.Om vikas 65 % information on Internet is in English Source : IBM’s Web Fountain ICDL-2004

New Order of Knowledge based Society :

• • Universalization of Creativity Rise, Raise & Race Dr.Om vikas ICDL-2004

Raise to Rise & Race to Limits

Liberalisation is advice of advanced nations to the rest for creating conducive environment for technology acquisition and absorption and thus expanding their market. Mindset needs to be changed to help the underdeveloped nations to catch-up in technology absorption and participation in knowledge generation. Following is an example of

providing high-tech solution in low-tech environment.

A group of engineer volunteers in USA designed and built a rugged and low-cost bicycle- powered computer and wireless network for villagers of phon kham in Laos which had no electricity or phone service. There was no way to call relatives living abroad or even in the next town. This is a project to bridge the digital divide.

Innovation

follows on Stretching our imagination to limits. As we noticed that constrained environment of a village in Lao led development of new operating system, cycle-powered PC, etc.

Heterogeneity of communities opens up new opportunities for innovation and integration skills. Time is critical factor in the context of ICT. Let all the communities the world over catch up to the basic technology absorption capability and use it for improving quality of life Dr.Om vikas of the people at large.

ICDL-2004

Digital Knowledge Resources:

• Electronic Information is being created in many forms and formats and stored in many repositories • Ever improving Information Technology makes sharing of Knowledge Resources economical , universally accessible Dr.Om vikas ICDL-2004

World Scenario of Digital Library Initiatives

Digital libraries are a form of information technology in which social impact matters as much as technological advancements.

DLI in USA

Six major projects were launched during 1994-1998 under DLI (Digital Library Initiative) funded by the NSF, DARPA and NASA in the USA.

Digital Libraries Initiative-phase 2 (DLI-2) is an NSF led initiative that builds on the successes of DLI-1. DLI-2 is supported by many funding agencies like NSF, DARPA, National Library of Medicine, Library of congress National Endowment for the Humanities.

investigate

digital libraries as human-centered systems.

DLI-2 will Dr.Om vikas ICDL-2004

DARPA's Information Management program address (www.dapra.mil/ito/research/in)

core digital library issues

requiring revolutionary research technology: 

Federated repositories.

The organisation of distributed repositories into a coherent virtual collection is fundamental 

Scalability.

Managing billions of digital objects and millions of sources poses challenges in identifying, categorizing, indexing, summarizing and extracting content.

Interoperability.

Digital libraries require semantic interoperability among heterogeneous repositories distributed across the network.

Collaboration.

Analysts work in distributed teams, building on each other's knowledge experience and resources.

Communication

. Timely dissemination of research results is the focus of D-Lib.

Dr.Om vikas ICDL-2004

The

Illinois D-Lib project

(http://dli.grainger.uiuc.edu) take SGML directly from the publisher's collections, convert it into a canonical format for federated searching and transform tags into a standard set.

Federating the search at a semantic level

is an area of active research in digital library community. Statistical approaches lead toward scalable semantics - indexing deeper than text word search that is computable on large real collections.

Journal Storage project

started at University of Michigan with the grant of the Andrew W Mellon Foundation. JSTOR database total 450,000 articles and 2.7 million pages created via a combination of page images and full-text at a rate pf 100,000 pages. The www.jstor.org

URL links to three server machines: two at University of Michigan, a third at Princeton University.

increased reliability, accessibility, and capacity.

Distributed mirrors offer Dr.Om vikas ICDL-2004

The

Informedia Project at

Carnegie Mellon University has created a terabyte digital video library in which automatically derived descriptors for the video are used for indexing, segmenting, and accessing the library contents.

Artificial Intelligence techniques have been used to create metadata - the data that describes video content.

Powerful browsing capabilities are essential in a multimedia information retrieval system.

The Carnegie Mellon DLI project searched multimedia, particularly video segments, by generating text indexes using speech understanding. The Stanford DLI project searched across different engines using multiprotocol gateways.

Other even harder issues remain untouched, such as multicultural search across context and meaning.

Dr.Om vikas ICDL-2004

DLI in Europe

The importance of D-Lib research is spreading beyond the US.

European research in Digital Libraries

is funded by the European Union as well as national sources. DL projects have supported by the Information Engineering, ( www.echo.lu/ie) , Language Engineering ( www.echo.lu/langeng/en/lehome.html) , and Esprit ( www.cordis.lu/esprit ) programs in Europe.

Under NSF-EU collaboration, five working groups has been formed in the key technical areas of Interoperability, Metadata, IPR, Resource indexing and discovery, and multilingual information access.

Dr.Om vikas ICDL-2004

DLI in Asia

Since 1995, D-Lib research has become a national grand challenge in several countries

in Asia

. Most projects can be classified into the following categories:  Nationwide D-Lib initiative and special purpose digital libraries for example, the library 2000 Project in Singapore (to link all library resources) and Financial Digital Library at the University  of Hong Kong (to serve the needs of HK stock market and users) Digital museum and historical document digitalization-fox example, Digital Museum Project of the National Taiwan University and Digitalization of art collection of the Palace  Museum in Taipai by IBM.

Local language processing and historical cultural content could be the most immediate Asian contribution to the international DL community. An Asia Digital Library consortium is fostering long term collaboration and projects in DL-related topics in Asia ( www.cyberlib.net/adl ) .

Dr.Om vikas ICDL-2004

 Local language and multilingual information retrieval-for example, the Net Compass Project of Tsinghua University in China, Chinese Information Retrieval at the Academia Sinica, Taiwan, and New Zealand's multilingual project.

The

New Zealand D-Lib

( http://www.nzdl.org

) currently offers about 20 collections, varying in size from a few documents upto 10 million documents and several gigabytes of text. The documents written in many different languages, including English, French, German, Arabic, Maori, Portugese and Swahili.

The D-Lib provides interfaces to the collections in several languages.

To accommodate blind users (with speech synthesizers) and partially sighted users (with large-font displays), NZ D-Lib provides text only version of the interface for each language.

Dr.Om vikas ICDL-2004

iv. Digital Library of India Initiative Broad Objectives :

• • To digitize and index the heritage knowledge.

To promote life long learning in the society (a necessity of the Knowledge-based society).

• To promote collaborative creativity and building up knowledge teams across borders.

• Participation in World initiatives on Digital Library such as UDL.

Dr.Om vikas [ It is to note that India has

Multiple Languages, Multiple scripts, Manuscripts in different forms, Books using various fonts, Vast tacit knowledge resource of vanishing scholars, and Multiple commentaries on a text This forms a vast treasure of heritage knowledge.

] ICDL-2004

Dr.Om vikas •

Mobile Digital Library – Knowledge at doorsteps

To facilitate surf, access, print,and take away a book of choice anywhere and anytime •

20 DL Centers with 106 high resolution Scanners

4 Megacenters (to setup)

ICDL-2004

Issues pertaining to digitization Multilingual Issues

• Character Sets (UNICODE?) • Representations • Multilingual Navigation • Translation Assistance

Policy Challenges

• Convenient quality displays • What to digitize first?

• Use of copyrighted material • Economics (Who pays? Who gets?) • Privacy • Reliability of information • Authentication of text from multiple versions • Digital Library Act.

Dr.Om vikas ICDL-2004

Dr.Om vikas

Need for Indian Digital Library Act

.

Issues to tackle may include compulsory Licensing, digital pack book (incentive: 10% tax deduction on lifetime revenue); deemed out of print (donate electronic rights); concept shift in Royalty per copy to per preview; public lending rights (as in Japan); 4Cs (Consortium for Compensation for Creative Content), formula to respect content creator and pay compensation, (min. Rs. 100/- to max Rs. 1 lakh), inclusion of books, music and movie with higher & higher privacy value.

ICDL-2004

Linguistic Scenario in India

• Eighteen constitutional Indian Languages are mentioned as follows with their scripts within parentheses: Hindi (Devanagari), Konkani (Devanagari), Marathi (Devanagari), Nepali (Devanagari), (Devanagari), Sindhi (Devanagari/Urdu), Kashmiri (Devanagari/Urdu); Assamese (Assamese), Manipuri (Manipuri), Bangla (Bengali), Oriya (Oriya), Gujarati (Gujarati), Punjabi (Gurumukhi), Telugu (Telugu), Kannada (Kannada), Tamil (Tamil), Malayalam (Malayalam) and Urdu (Urdu). There are 10 Indic Scripts in vogue.

Sanskrit • Interestingly, Indian languages owe their origin to Sanskrit, hence they have in common rich cultural heritage and treasure of knowledge. Indic scripts have originated from Brahmi script. Less than 5 percent of people can either read & write English. Over 95 percent population is normally deprived of the benefits of English-based Information Technology.

Characteristics of Indian Languages

• What You Speak Is What You Write (WYSIWYW) • Script grammar - transformation rules • Relatively word order free • Common phonetic based alphabet • Common concept terms (from Sanskrit) Dr.Om vikas ICDL-2004

Indian Language Technology Map CoILTech

Dr.Om vikas

IETE – New Delhi G.G.Univ. Bilaspur CoILTech

ICDL-2004

Major Achievements in ILT Information Dissemination Localization of LINUX Translation Support Systems Human Machine Interface systems Standardization

Dr.Om vikas

Knowledge Tools Knowledge Resources

ICDL-2004

Translation Support Systems (MAT)

• • •

English to Hindi (Angla-Bharati) http:// anglahindi.iitk.ac.in

(very satisfactory above 85% consistently okay)

Indian Languages to Hindi

(In the process of development)

Hindi to English

(In the process of development) 

Human Machine interface Systems Optical Character Recognition

(

OCR)

(accuracy for 7 ILs viz. Hindi Marathi, Bangla, Tamil, Telugu, Gurumukhi, Malayalam, above 97%. OCRs in other ILs are in the process of development)

Text to Speech system (TTS)

(Hindi, Bangla,)

Continuous Speech Recognition CSR

(Hindi) Dr.Om vikas ICDL-2004

Major Achievements in ILT…..

Knowledge Resources

 • • • • • • • • • Bilingual dictionaries (over 30, 000) words English - Hindi English - Telugu English - Tamil Hindi Hindi English - Kannada English - Bangla English - Punjabi English - Oriya Hindi Hindi Hindi Hindi English - Malayalam - Hindi English - Sanskrit Hindi 

Parallel Corpora

– One Million page Parallel Corpora is under process of development. The development of the parallel corpora is one of the unique achievement of the TDIL programme and is appreciated worldwide [ 600 Thousand pages ready.] Dr.Om vikas ICDL-2004

Major Achievements in ILT…..

Standardization UNICODE

DIT is the voting member of the Unicode Consortium.

Proposed changes in the Unicode Standards finalized in consultation with respective State Government and Indian IT Industry and presented in the

UNICODE Technical committee

meeting. Some of the proposed changes have been incorporated in Unicode version 4.0

INdian Scripts FOnt Code (INSFOC)

Standards have been developed

Indian Script to Romanization Tables

(

INSROT)

are ready 

Knowledge Tools

Morph Analyzer , Syntactic Analyzer, Spell checker , Messaging system , Authoring Systems,

Word processors

, code conversion utilities have been developed. Dr.Om vikas ICDL-2004

Major Achievements in ILT…..

Localization of LINUX systems INDIX system

: Localized

INDIX-2

supports 5 IL s Viz. Hindi, Marathi, Gujrati, Tamil and Bangla. LINUX operating system with other Indian Languages support is in the process of development.

Information Dissemination: TDIL Web-site http://tdil.mit.gov.in

This Web Site contains information for various TDIL activities, achievements and provides access to a variety of content and downloadable in Hindi and for other Indian languages.

Free Downloads

Indian Language keyboard driver & fonts and other tools, corpus, content, conversion utilities, Machine aided Translation systems.

Quarterly Language Technology Flash : Vishwabharat@tdil

Dr.Om vikas ICDL-2004

Language Technology HRD

• Post Graduate Programs in the Domains of Computational Linguistics & Knowledge Engineering.

• All the Bachelors and Masters Programmes in Computer Science Engineering will cover the Multilingual Computing aspect also. • School curricula include basics of multilingual computing.

Dr.Om vikas ICDL-2004

Typical illustration of Indian Language OCRs

Hindi OCR Input OCR Output Dr.Om vikas Efficiency 96.8%, working for font size from 12-36 ICDL-2004

Dr.Om vikas

Gurmukhi to Shahmukhi Transliteration

Gurmukhi Shahmukhi ICDL-2004

Machine Translation (MAT) – English to Hindi http://anglahindi.iitk.ac.in

Illustration of online MAT system

Simple Sentences.

sarala vaakya

sarla vaa@ya

.

.

Welcome to London.

landana men aapakaa svaagata hai

.

landna maoM Aapka svaagat hO

.

There are some cases which are still pending.

vahaan kuc'ha kesa hain jo abhii bhii nilamibata hain

vahaÐ kuC kosa hOM jaao ABaI BaI inalaimbat hOM

.

.

Dr.Om vikas ICDL-2004

Machine Translation (MAT) – Hindi to English

Dr.Om vikas ICDL-2004

Innovating to Innovate

Researchers always want to go for that last 2% of performance. But better to get a sufficient solution out fast and then continue to enhance it.

….MarkDean, IBM

it’s

(Source : Harvard Business Review, Aug’2002)

Hence TDIL Program emphasizes on

Collaborative development

of language technology and.

 Taking Language Technology Products

out to market rapidly

feedback and refinement for Dr.Om vikas ICDL-2004

Media Lab Asia : another initiative

World Computer

(Lowcost PC) Rural Operating Systems; Speech Interfaces For Local Dialects; Visual Language; Interfaces for All; Interlingua Web; Multi-Literate Interface; Literacy Learning Through Pictures

Bits for All

(Universal Connectivity) Rural WiFi, DakNet, Digital Gangetic Plain, Off-Line Internet Access, Rural VoIP

Tomorrow's Tools

(Language Interfaces) Mapping For the Masses, Community Access to Sustainable Health (Ca:sh), Building Robots Creating Science (BRICS), Digital Craft Revival, Digital Human Body, Digital Music, InfoSculpture, Suchik, Polysensors, Complex RF Impedance Analyzers, UV-VIS Spectrometer, Power Sensors, Think Cycle

Digital Village (Consolidation in delivering value to the masses)

Sustainable Access in Rural India, Community Connection, Digital Mandi, InfoThela Dr.Om vikas ICDL-2004

Trends in Language Technology

Intelligent Human Computer Interaction

To support more sophisticated and natural input and output that promise knowledge or agent-based dialogue in which the interface gracefully handles errors and interruptions and dynamically adapts to the current context.

Typical properties : Multimodal input - They process potentially ambiguous, imprecise combinations of mixed input such as written text, spoken language, gestures (e.g., mouse, pen, dataglove) and gaze.

Multimodal output

speech, graphics, - They design coordinated presentation of, e.g., text, and gestures, which may conventional displays or animated, life-like agents.

be presented via

Interaction management

- mixed initiative interactions that are context dependent based on system models of the discourse, user, and task Dr.Om vikas ICDL-2004

Machine Translation 1970s :

Narrow domain , Rules-based approach

1980s : 1990s : 2000s :

Practical MT system example based approach nterlingua and Transfer method.

Multilingual MT, Simultaneous Interpretation, example based revisited, corpus based and statistics based approach.

MT through NL understanding language resources Dr.Om vikas ICDL-2004

Speech Technology Development:

• • • • Speech technology is the field of There is ongoing

Interactive Technologies shift from Speech component

research

to

.

research on integrated

Speech Systems

. Together with Speech, are the modalities that constitute full natural human - human communication (e.g.. Gesture, lip movements, facial expression, gaze, bodily posture) leading towards multimodal interactive systems 1970s : Speech synthesis systems used

rule-based formant system

. (Formants are transfer function of vocal tract resonant frequency.) 1990s:

Concatenated speech synthesis

pieces of pre-recorded speech.

systems use small There is trend towards

cross-project collaboration

, synergy, critical mass, and deployable & scalable technologies Dr.Om vikas ICDL-2004

Trends in Digital Library Technologies

Multi-modal Input Standardization Navigation Architecture IPR Issues Knowledge Generation Capacity

Scanning, Smartizing (Value Addition), Content, Multi-lingual, Multi-media Character Code, Font Code, Semantic Indexing, DOI, XML, SCORM Browsing, Finding, Searching, Zooming, Reality, Aboutness, Searching Navigation, Translation Assistance.

Hyperbolic Tree, Virtual Mathematics, Multilingual Interoperability, Multi-lingual Information Access, Metadata, Resource Indexing & Discovery In Globally Distributed Digital Library 4Cs(Consortium For Compensation For Creative Content)

Focus In 20th Century

Capitalistic & Monopolistic Trend In Publication & Dissemination.

Focus In 21st Century

Universalization Of Creativity.

Dr.Om vikas ICDL-2004

Future Knowledge Networks

The Interspace represents the third wave in the ongoing evolution of the Global Information Infrastructure, driven by rapid advances in computing and Information Technology during .

The wave pattern roughly describes four distinct phases of functionality: fundamental research (trough), development of prototype systems (ascent), emergence of commercial systems (crest), and mass propagation (descent) Dr.Om vikas ICDL-2004

Scalable Semantics

Future knowledge networks will rely on scalable semantics, on automatically indexing the community collections so The knowledge networks of the Interspace will be connected via switching machines that switch concepts.

Connectivity and training continue to be the principal barriers to integrating the global network of libraries.

Interspace focuses on scalable technologies for semantic indexing that work generally across all subject domains.

We can use concept spaces collections of abstract concept generated from concrete objects-to boost searches by interactively suggesting alternative terms. We can use category maps to boost navigation by interactively browsing clusters of related documents. Scalable semantics is used to index the semantics of document contents on large collections. Concept spaces use text documents as the objects and noun phrases as the concepts.

Dr.Om vikas ICDL-2004

Summing up the Challenges Ahead

ML Open Source Software

- Shareable Software - Standards database and updating - Support service & Help line - Consortium approach - GPL with performance else Garbage In Garbage out •

Benchmarking & Standards

- testing against international standards • - active participation in evolving standards

Information Technology Culture - Awareness :

IT Clinic, Workshops, media

- BIPK

(Basic information Processing Kit) with user friendly, easy-to-use, affordable, scalable, interoperable and re-usable tools. BIPK may consist star office like processing facility, fonts, KB driver, spell checker, dictionary and conversion utility.

- Entrepreneurship

: Gyanaudyog workshops. Dr.Om vikas ICDL-2004

.... Challenges Ahead

Cross –lingual Information Access

- Search engine, Web Crawler, on-line machine translation. •

Localization

- Localization of software and content into local languages • - Enlarging share in localization outsourcing ( $ 8 Bn By 2006:IDC)

International Collaboration in Language Informatics.

- Industry - academia cooperation in joint research & technology development projects.

- Exchange of faculty and students - HRD programs in knowledge Engineering & Computational Linguistics •

Rise, Raise & Race

- Possess basic language technologies - Promote Collectivistic Culture - Think globally & act locally - Collaborate for innovation Dr.Om vikas ICDL-2004

Dr.Om vikas

Digital Library is a means to meet the end : Objective of Universalization of Creativity

ICDL-2004

Nothing is so pious as knowledge.

xÉ Ê½þ YÉÉxÉäxÉ ºÉoù¶ÉÆ {ÉÊ´ÉjÉʨɽþ Ê´ÉtiÉä*

(Bhagwadgita: 4.38)

Dr.Om vikas

¶ÉÉÆÊiÉ: (

Shaantih

)

ICDL-2004