Transcript Document
Towards Universalisation of Creativity
Dr.Om vikas
Dr. Om Vikas Department of Information Technology Ministry of Communications and Information Technology Government of India E-mail: [email protected]
ICDL-2004
Is there gain in knowledge or loss of Knowledge?
• From an estimated 10,000 world languages in 1900, about 6,700 language survived in 2000.
Two percent of the world's languages are becoming extinct every year
.
• There is worldwide, unquantifiable erosion of cultural participation, knowledge and innovation.
• With the loss of a language, we lose art and ideas, scientific information and technological innovation capacity.
• World-level literacy is improving. More people can read than ever before, but fewer people create stories.
• There is tendency from being
creators
to
consumers
at the time when technology could have amplified our creative capacities.
• UNESCO study (1999) of 65 languages: 49 of the languages (75%) had experienced real decline in number of works translated from these languages into other languages.
• The proportion for English arose from 43 percent in 1980 to over 57 percent in 1994.
• The share held by top four translated languages (English, Spanish, French and German) rose from 65 percent in 1980 to 81 percent in 1994.
• According to an UNESCO study involving world’s 140 most published authors; 90 out of 140 were English writers in 1994 compared to 64 out of 140 in 1980.
• There is collapse in authorship, translation and quality in other languages.
Erosion of Language and Culture !!
Dr.Om vikas ICDL-2004
Is the technology to divide or to unite
?
• Latin Alphabet users , 39 % of the global population enjoy 84% of access to the Internet • Hanzi-users in (CJK), 22% in global population enjoy 13% of Internet access • Arbic script users, 9% of the population have 1.2 % of the Internet Access • Bralmi-origin scripts users in South-east Asia and Indic scripts users occupy 22 % of the World population have just 0.3 % of Internet access.
• More than 80% content on Internet is in English.
• ICT penetration in India and other developing countries is lower.
Dr.Om vikas ICDL-2004
ICT Indicators Teledensity Cellphone Density PC penetration Advanced Nations 50-70 % 30-75 % 30-60 % Developing Nations 20-30 % 04-7 % 0.5-2 % Underdeveloped Nations Sprawling
Dr.Om vikas
Digital <<<<<<< >>>>>> >Divide !!!!!
ICDL-2004
Digital Divide as They Behold Perception Developed Countries Developing Countries
Why discussed ?
Policy Results Consumer nature Desire to capture larger markets Information explosion Increasing use of English and thrust of western culture. “substitute the old” [Consumerism-centric] IPR-Centric Fear of lagging behind in economic race Localization Preservation of local language and culture. “Upgrade the Old” Technology development Low cost PC $400 Open source technology less than $ 40 Reason: PPP : (15:1) GNP : (75:1) 34260 (USA) 24260 2400 (India) 460 Focus Digital divide Access to Information Wider control Digital Unite Share the Knowledge Small is beautiful.
Low affordability means low ICT penetration & sprawling Digital Divide Dr.Om vikas ICDL-2004
e -
Content & Universal Access
UNESO identifies
Challenges in Multilinguism and universal
access to information
• General affordable worldwide access • Hardware and Software, Web and Internet Features.
• Availability of Accessible websites and Internet Access devices.
•
Accessibility of multiple languages
•
Development of content in Native languages, and its placement on Internet
.
• Appropriate design of software for users Dr.Om vikas ICDL-2004
Potential Use of non-English languages on Internet
will increase drastically by 2010 as shown below :
Users
500 Mn 400 Mn 300 Mn 200 Mn 100 Mn 0
Eng Jap
2010 2003
Chinese French Spanish German Indian languages Dr.Om vikas 65 % information on Internet is in English Source : IBM’s Web Fountain ICDL-2004
New Order of Knowledge based Society :
• • Universalization of Creativity Rise, Raise & Race Dr.Om vikas ICDL-2004
Raise to Rise & Race to Limits
Liberalisation is advice of advanced nations to the rest for creating conducive environment for technology acquisition and absorption and thus expanding their market. Mindset needs to be changed to help the underdeveloped nations to catch-up in technology absorption and participation in knowledge generation. Following is an example of
providing high-tech solution in low-tech environment.
A group of engineer volunteers in USA designed and built a rugged and low-cost bicycle- powered computer and wireless network for villagers of phon kham in Laos which had no electricity or phone service. There was no way to call relatives living abroad or even in the next town. This is a project to bridge the digital divide.
Innovation
follows on Stretching our imagination to limits. As we noticed that constrained environment of a village in Lao led development of new operating system, cycle-powered PC, etc.
Heterogeneity of communities opens up new opportunities for innovation and integration skills. Time is critical factor in the context of ICT. Let all the communities the world over catch up to the basic technology absorption capability and use it for improving quality of life Dr.Om vikas of the people at large.
ICDL-2004
Digital Knowledge Resources:
• Electronic Information is being created in many forms and formats and stored in many repositories • Ever improving Information Technology makes sharing of Knowledge Resources economical , universally accessible Dr.Om vikas ICDL-2004
World Scenario of Digital Library Initiatives
Digital libraries are a form of information technology in which social impact matters as much as technological advancements.
DLI in USA
Six major projects were launched during 1994-1998 under DLI (Digital Library Initiative) funded by the NSF, DARPA and NASA in the USA.
Digital Libraries Initiative-phase 2 (DLI-2) is an NSF led initiative that builds on the successes of DLI-1. DLI-2 is supported by many funding agencies like NSF, DARPA, National Library of Medicine, Library of congress National Endowment for the Humanities.
investigate
digital libraries as human-centered systems.
DLI-2 will Dr.Om vikas ICDL-2004
DARPA's Information Management program address (www.dapra.mil/ito/research/in)
core digital library issues
requiring revolutionary research technology:
Federated repositories.
The organisation of distributed repositories into a coherent virtual collection is fundamental
Scalability.
Managing billions of digital objects and millions of sources poses challenges in identifying, categorizing, indexing, summarizing and extracting content.
Interoperability.
Digital libraries require semantic interoperability among heterogeneous repositories distributed across the network.
Collaboration.
Analysts work in distributed teams, building on each other's knowledge experience and resources.
Communication
. Timely dissemination of research results is the focus of D-Lib.
Dr.Om vikas ICDL-2004
The
Illinois D-Lib project
(http://dli.grainger.uiuc.edu) take SGML directly from the publisher's collections, convert it into a canonical format for federated searching and transform tags into a standard set.
Federating the search at a semantic level
is an area of active research in digital library community. Statistical approaches lead toward scalable semantics - indexing deeper than text word search that is computable on large real collections.
Journal Storage project
started at University of Michigan with the grant of the Andrew W Mellon Foundation. JSTOR database total 450,000 articles and 2.7 million pages created via a combination of page images and full-text at a rate pf 100,000 pages. The www.jstor.org
URL links to three server machines: two at University of Michigan, a third at Princeton University.
increased reliability, accessibility, and capacity.
Distributed mirrors offer Dr.Om vikas ICDL-2004
The
Informedia Project at
Carnegie Mellon University has created a terabyte digital video library in which automatically derived descriptors for the video are used for indexing, segmenting, and accessing the library contents.
Artificial Intelligence techniques have been used to create metadata - the data that describes video content.
Powerful browsing capabilities are essential in a multimedia information retrieval system.
The Carnegie Mellon DLI project searched multimedia, particularly video segments, by generating text indexes using speech understanding. The Stanford DLI project searched across different engines using multiprotocol gateways.
Other even harder issues remain untouched, such as multicultural search across context and meaning.
Dr.Om vikas ICDL-2004
DLI in Europe
The importance of D-Lib research is spreading beyond the US.
European research in Digital Libraries
is funded by the European Union as well as national sources. DL projects have supported by the Information Engineering, ( www.echo.lu/ie) , Language Engineering ( www.echo.lu/langeng/en/lehome.html) , and Esprit ( www.cordis.lu/esprit ) programs in Europe.
Under NSF-EU collaboration, five working groups has been formed in the key technical areas of Interoperability, Metadata, IPR, Resource indexing and discovery, and multilingual information access.
Dr.Om vikas ICDL-2004
DLI in Asia
Since 1995, D-Lib research has become a national grand challenge in several countries
in Asia
. Most projects can be classified into the following categories: Nationwide D-Lib initiative and special purpose digital libraries for example, the library 2000 Project in Singapore (to link all library resources) and Financial Digital Library at the University of Hong Kong (to serve the needs of HK stock market and users) Digital museum and historical document digitalization-fox example, Digital Museum Project of the National Taiwan University and Digitalization of art collection of the Palace Museum in Taipai by IBM.
Local language processing and historical cultural content could be the most immediate Asian contribution to the international DL community. An Asia Digital Library consortium is fostering long term collaboration and projects in DL-related topics in Asia ( www.cyberlib.net/adl ) .
Dr.Om vikas ICDL-2004
Local language and multilingual information retrieval-for example, the Net Compass Project of Tsinghua University in China, Chinese Information Retrieval at the Academia Sinica, Taiwan, and New Zealand's multilingual project.
The
New Zealand D-Lib
( http://www.nzdl.org
) currently offers about 20 collections, varying in size from a few documents upto 10 million documents and several gigabytes of text. The documents written in many different languages, including English, French, German, Arabic, Maori, Portugese and Swahili.
The D-Lib provides interfaces to the collections in several languages.
To accommodate blind users (with speech synthesizers) and partially sighted users (with large-font displays), NZ D-Lib provides text only version of the interface for each language.
Dr.Om vikas ICDL-2004
iv. Digital Library of India Initiative Broad Objectives :
• • To digitize and index the heritage knowledge.
To promote life long learning in the society (a necessity of the Knowledge-based society).
• To promote collaborative creativity and building up knowledge teams across borders.
• Participation in World initiatives on Digital Library such as UDL.
Dr.Om vikas [ It is to note that India has
Multiple Languages, Multiple scripts, Manuscripts in different forms, Books using various fonts, Vast tacit knowledge resource of vanishing scholars, and Multiple commentaries on a text This forms a vast treasure of heritage knowledge.
] ICDL-2004
Dr.Om vikas •
Mobile Digital Library – Knowledge at doorsteps
To facilitate surf, access, print,and take away a book of choice anywhere and anytime •
20 DL Centers with 106 high resolution Scanners
•
4 Megacenters (to setup)
ICDL-2004
•
Issues pertaining to digitization Multilingual Issues
• Character Sets (UNICODE?) • Representations • Multilingual Navigation • Translation Assistance
Policy Challenges
• Convenient quality displays • What to digitize first?
• Use of copyrighted material • Economics (Who pays? Who gets?) • Privacy • Reliability of information • Authentication of text from multiple versions • Digital Library Act.
Dr.Om vikas ICDL-2004
Dr.Om vikas
Need for Indian Digital Library Act
.
Issues to tackle may include compulsory Licensing, digital pack book (incentive: 10% tax deduction on lifetime revenue); deemed out of print (donate electronic rights); concept shift in Royalty per copy to per preview; public lending rights (as in Japan); 4Cs (Consortium for Compensation for Creative Content), formula to respect content creator and pay compensation, (min. Rs. 100/- to max Rs. 1 lakh), inclusion of books, music and movie with higher & higher privacy value.
ICDL-2004
•
Linguistic Scenario in India
• Eighteen constitutional Indian Languages are mentioned as follows with their scripts within parentheses: Hindi (Devanagari), Konkani (Devanagari), Marathi (Devanagari), Nepali (Devanagari), (Devanagari), Sindhi (Devanagari/Urdu), Kashmiri (Devanagari/Urdu); Assamese (Assamese), Manipuri (Manipuri), Bangla (Bengali), Oriya (Oriya), Gujarati (Gujarati), Punjabi (Gurumukhi), Telugu (Telugu), Kannada (Kannada), Tamil (Tamil), Malayalam (Malayalam) and Urdu (Urdu). There are 10 Indic Scripts in vogue.
Sanskrit • Interestingly, Indian languages owe their origin to Sanskrit, hence they have in common rich cultural heritage and treasure of knowledge. Indic scripts have originated from Brahmi script. Less than 5 percent of people can either read & write English. Over 95 percent population is normally deprived of the benefits of English-based Information Technology.
Characteristics of Indian Languages
• What You Speak Is What You Write (WYSIWYW) • Script grammar - transformation rules • Relatively word order free • Common phonetic based alphabet • Common concept terms (from Sanskrit) Dr.Om vikas ICDL-2004
Indian Language Technology Map CoILTech
Dr.Om vikas
IETE – New Delhi G.G.Univ. Bilaspur CoILTech
ICDL-2004
Major Achievements in ILT Information Dissemination Localization of LINUX Translation Support Systems Human Machine Interface systems Standardization
Dr.Om vikas
Knowledge Tools Knowledge Resources
ICDL-2004
Translation Support Systems (MAT)
• • •
English to Hindi (Angla-Bharati) http:// anglahindi.iitk.ac.in
(very satisfactory above 85% consistently okay)
Indian Languages to Hindi
(In the process of development)
Hindi to English
(In the process of development)
Human Machine interface Systems Optical Character Recognition
(
OCR)
(accuracy for 7 ILs viz. Hindi Marathi, Bangla, Tamil, Telugu, Gurumukhi, Malayalam, above 97%. OCRs in other ILs are in the process of development)
Text to Speech system (TTS)
(Hindi, Bangla,)
Continuous Speech Recognition CSR
(Hindi) Dr.Om vikas ICDL-2004
Major Achievements in ILT…..
Knowledge Resources
• • • • • • • • • Bilingual dictionaries (over 30, 000) words English - Hindi English - Telugu English - Tamil Hindi Hindi English - Kannada English - Bangla English - Punjabi English - Oriya Hindi Hindi Hindi Hindi English - Malayalam - Hindi English - Sanskrit Hindi
Parallel Corpora
– One Million page Parallel Corpora is under process of development. The development of the parallel corpora is one of the unique achievement of the TDIL programme and is appreciated worldwide [ 600 Thousand pages ready.] Dr.Om vikas ICDL-2004
Major Achievements in ILT…..
Standardization UNICODE
DIT is the voting member of the Unicode Consortium.
Proposed changes in the Unicode Standards finalized in consultation with respective State Government and Indian IT Industry and presented in the
UNICODE Technical committee
meeting. Some of the proposed changes have been incorporated in Unicode version 4.0
INdian Scripts FOnt Code (INSFOC)
Standards have been developed
Indian Script to Romanization Tables
(
INSROT)
are ready
Knowledge Tools
Morph Analyzer , Syntactic Analyzer, Spell checker , Messaging system , Authoring Systems,
Word processors
, code conversion utilities have been developed. Dr.Om vikas ICDL-2004
Major Achievements in ILT…..
Localization of LINUX systems INDIX system
: Localized
INDIX-2
supports 5 IL s Viz. Hindi, Marathi, Gujrati, Tamil and Bangla. LINUX operating system with other Indian Languages support is in the process of development.
Information Dissemination: TDIL Web-site http://tdil.mit.gov.in
This Web Site contains information for various TDIL activities, achievements and provides access to a variety of content and downloadable in Hindi and for other Indian languages.
–
Free Downloads
Indian Language keyboard driver & fonts and other tools, corpus, content, conversion utilities, Machine aided Translation systems.
Quarterly Language Technology Flash : Vishwabharat@tdil
Dr.Om vikas ICDL-2004
•
Language Technology HRD
• Post Graduate Programs in the Domains of Computational Linguistics & Knowledge Engineering.
• All the Bachelors and Masters Programmes in Computer Science Engineering will cover the Multilingual Computing aspect also. • School curricula include basics of multilingual computing.
Dr.Om vikas ICDL-2004
Typical illustration of Indian Language OCRs
Hindi OCR Input OCR Output Dr.Om vikas Efficiency 96.8%, working for font size from 12-36 ICDL-2004
Dr.Om vikas
Gurmukhi to Shahmukhi Transliteration
Gurmukhi Shahmukhi ICDL-2004
•
Machine Translation (MAT) – English to Hindi http://anglahindi.iitk.ac.in
Illustration of online MAT system
Simple Sentences.
sarala vaakya
sarla vaa@ya
.
.
Welcome to London.
landana men aapakaa svaagata hai
.
landna maoM Aapka svaagat hO
.
There are some cases which are still pending.
vahaan kuc'ha kesa hain jo abhii bhii nilamibata hain
vahaÐ kuC kosa hOM jaao ABaI BaI inalaimbat hOM
.
.
Dr.Om vikas ICDL-2004
•
Machine Translation (MAT) – Hindi to English
Dr.Om vikas ICDL-2004
Innovating to Innovate
Researchers always want to go for that last 2% of performance. But better to get a sufficient solution out fast and then continue to enhance it.
….MarkDean, IBM
it’s
(Source : Harvard Business Review, Aug’2002)
Hence TDIL Program emphasizes on
Collaborative development
of language technology and.
Taking Language Technology Products
out to market rapidly
feedback and refinement for Dr.Om vikas ICDL-2004
Media Lab Asia : another initiative
World Computer
(Lowcost PC) Rural Operating Systems; Speech Interfaces For Local Dialects; Visual Language; Interfaces for All; Interlingua Web; Multi-Literate Interface; Literacy Learning Through Pictures
Bits for All
(Universal Connectivity) Rural WiFi, DakNet, Digital Gangetic Plain, Off-Line Internet Access, Rural VoIP
Tomorrow's Tools
(Language Interfaces) Mapping For the Masses, Community Access to Sustainable Health (Ca:sh), Building Robots Creating Science (BRICS), Digital Craft Revival, Digital Human Body, Digital Music, InfoSculpture, Suchik, Polysensors, Complex RF Impedance Analyzers, UV-VIS Spectrometer, Power Sensors, Think Cycle
Digital Village (Consolidation in delivering value to the masses)
Sustainable Access in Rural India, Community Connection, Digital Mandi, InfoThela Dr.Om vikas ICDL-2004
Trends in Language Technology
•
Intelligent Human Computer Interaction
To support more sophisticated and natural input and output that promise knowledge or agent-based dialogue in which the interface gracefully handles errors and interruptions and dynamically adapts to the current context.
Typical properties : Multimodal input - They process potentially ambiguous, imprecise combinations of mixed input such as written text, spoken language, gestures (e.g., mouse, pen, dataglove) and gaze.
Multimodal output
speech, graphics, - They design coordinated presentation of, e.g., text, and gestures, which may conventional displays or animated, life-like agents.
be presented via
Interaction management
- mixed initiative interactions that are context dependent based on system models of the discourse, user, and task Dr.Om vikas ICDL-2004
•
Machine Translation 1970s :
Narrow domain , Rules-based approach
1980s : 1990s : 2000s :
Practical MT system example based approach nterlingua and Transfer method.
Multilingual MT, Simultaneous Interpretation, example based revisited, corpus based and statistics based approach.
MT through NL understanding language resources Dr.Om vikas ICDL-2004
•
Speech Technology Development:
• • • • Speech technology is the field of There is ongoing
Interactive Technologies shift from Speech component
research
to
.
research on integrated
Speech Systems
. Together with Speech, are the modalities that constitute full natural human - human communication (e.g.. Gesture, lip movements, facial expression, gaze, bodily posture) leading towards multimodal interactive systems 1970s : Speech synthesis systems used
rule-based formant system
. (Formants are transfer function of vocal tract resonant frequency.) 1990s:
Concatenated speech synthesis
pieces of pre-recorded speech.
systems use small There is trend towards
cross-project collaboration
, synergy, critical mass, and deployable & scalable technologies Dr.Om vikas ICDL-2004
•
Trends in Digital Library Technologies
Multi-modal Input Standardization Navigation Architecture IPR Issues Knowledge Generation Capacity
Scanning, Smartizing (Value Addition), Content, Multi-lingual, Multi-media Character Code, Font Code, Semantic Indexing, DOI, XML, SCORM Browsing, Finding, Searching, Zooming, Reality, Aboutness, Searching Navigation, Translation Assistance.
Hyperbolic Tree, Virtual Mathematics, Multilingual Interoperability, Multi-lingual Information Access, Metadata, Resource Indexing & Discovery In Globally Distributed Digital Library 4Cs(Consortium For Compensation For Creative Content)
Focus In 20th Century
Capitalistic & Monopolistic Trend In Publication & Dissemination.
Focus In 21st Century
Universalization Of Creativity.
Dr.Om vikas ICDL-2004
Future Knowledge Networks
The Interspace represents the third wave in the ongoing evolution of the Global Information Infrastructure, driven by rapid advances in computing and Information Technology during .
The wave pattern roughly describes four distinct phases of functionality: fundamental research (trough), development of prototype systems (ascent), emergence of commercial systems (crest), and mass propagation (descent) Dr.Om vikas ICDL-2004
Scalable Semantics
Future knowledge networks will rely on scalable semantics, on automatically indexing the community collections so The knowledge networks of the Interspace will be connected via switching machines that switch concepts.
Connectivity and training continue to be the principal barriers to integrating the global network of libraries.
Interspace focuses on scalable technologies for semantic indexing that work generally across all subject domains.
We can use concept spaces collections of abstract concept generated from concrete objects-to boost searches by interactively suggesting alternative terms. We can use category maps to boost navigation by interactively browsing clusters of related documents. Scalable semantics is used to index the semantics of document contents on large collections. Concept spaces use text documents as the objects and noun phrases as the concepts.
Dr.Om vikas ICDL-2004
Summing up the Challenges Ahead
•
ML Open Source Software
- Shareable Software - Standards database and updating - Support service & Help line - Consortium approach - GPL with performance else Garbage In Garbage out •
Benchmarking & Standards
- testing against international standards • - active participation in evolving standards
Information Technology Culture - Awareness :
IT Clinic, Workshops, media
- BIPK
(Basic information Processing Kit) with user friendly, easy-to-use, affordable, scalable, interoperable and re-usable tools. BIPK may consist star office like processing facility, fonts, KB driver, spell checker, dictionary and conversion utility.
- Entrepreneurship
: Gyanaudyog workshops. Dr.Om vikas ICDL-2004
.... Challenges Ahead
•
Cross –lingual Information Access
- Search engine, Web Crawler, on-line machine translation. •
Localization
- Localization of software and content into local languages • - Enlarging share in localization outsourcing ( $ 8 Bn By 2006:IDC)
International Collaboration in Language Informatics.
- Industry - academia cooperation in joint research & technology development projects.
- Exchange of faculty and students - HRD programs in knowledge Engineering & Computational Linguistics •
Rise, Raise & Race
- Possess basic language technologies - Promote Collectivistic Culture - Think globally & act locally - Collaborate for innovation Dr.Om vikas ICDL-2004
Dr.Om vikas
Digital Library is a means to meet the end : Objective of Universalization of Creativity
ICDL-2004
Nothing is so pious as knowledge.
xÉ Ê½þ YÉÉxÉäxÉ ºÉoù¶ÉÆ {ÉÊ´ÉjÉʨɽþ Ê´ÉtiÉä*
(Bhagwadgita: 4.38)
Dr.Om vikas
¶ÉÉÆÊiÉ: (
Shaantih
)
ICDL-2004