Transcript Document

Digitization Practices in
India: Issues and
Challenges
V.N. Shukla
C-DAC, NOIDA UNIT
NATURAL LANGUAGE
PROCESSING AND
INTERFACES
INFRASTRUCTURE
AND SUPPORT
SERVICES
MISSION
C-DAC
HUMAN RESOURCE
DEVELOPMENT IN
HITECH AREAS
2
AREAS OF COMPETENCE
Graphical Display
System
NLP
E-Governance
.
.
Internet on CATV
& E-Commerce
Solar Energy
System
Security
Systems
NOIDA
.
Embedded
System
System Engineering and
Consultancy
3
Digital Library Activities : CDAC Noida
•Digital Library Projects
•Mega Centre for Digital Library
•Mobile Digital Library : Dware Dware Gyan Sampada
•Digital Library at President’s House
•Digital Library at Nagari Pracharini Sabha Varanasi
•Digital Library at Uttaranchal
•GyanNidhi : Multilingual Parallel Corpus in Indian Languages
•Digital Library at Gujrat Vidyapeeth ,Ahmedabad
•Digitization of Libraries
Digital Library Mission
To organize the information and make it universally
accessible and useful.
Online Content
Offline Content
Billions of web pages
Billions of items still unindexed
DL Initiatives
Only ~15% of books are in print
~85% of books are out of print
and/or out of copyright – these
books are only found in libraries
GOAL: Create a comprehensive virtual card catalog of all
books in all languages, while respecting publishers’ rights
Source: Google
Digital Libraries
Users
Hyperlinks
Index
Metadata
Search
DL creation &
processes
Traditional Libraries
I
N
D
E
X
A Typical Library Collection
The value is in the middle
15%
~15%
In-Print
~65% or more
Unclear copyright status
• May be in copyright, but not for sale
• Rights may have reverted to author
• May be in the public domain
Less than 20%**
Public Domain
92% of the world's books are neither generating revenue for the
copyright holder nor easily accessible to potential readers.*
*Source: Covey, Denise Troll. "Global Cooperation for Global Access: The Million Book Project“
**OCLC analysis of the Google Books Library Project: http://www.dlib.org/dlib/september05/lavoie/09lavoie.html
DIGITAL LIBRARY DEFINITION

Digital Library (DL) may be seen as
“Collection of intelligent creations by human
beings through their own language and
culture. It also reflects cultural heritage
besides providing archive and generating
many research issues pertaining to Natural
Language Processing”
Digital Library ?
Sun Microsystems defines a digital library as the electronic extension of
functions users typically perform and the resources they access in a
traditional library.
These information resources can be translated into digital form, stored
in multimedia repositories, and made available through Web-based
services.
According to other definition Digital libraries are
“Organizations that provide the resources, including the specialized
staff, to select, structure, offer intellectual access to, interpret,
distribute, preserve the integrity of, and ensure the persistence over
time of collections of digital works so that they are readily available for
use by a defined community or set of communities”.
What is Digital library ?







A Service? An Architecture?
A set of Information Resources?
A set of tools to locate, search, retrieve
information?
Possibly the tools to create such resources and
services also fall within the purview of DLs
Digital face of traditional libraries
Include both digital collections and traditional
Backbone and nervous system of libraries.
Digital library Vs traditional library
•Efficient & qualitative services by collecting, organizing,
disseminating, retrieving and preserving the information.
storing,
•Preservation benefits besides making information retrieval & delivery more
comfortable.
•Online access to historical and cultural documents whose existence is
endangered due to physical decay.
Digital libraries necessarily include a strong focus on the management of
digital content, just as traditional libraries have focused for long on the
management of content in physical forms.
Digital Content Management
Most of the digital content that is being managed includes:
• Human Language, in various forms character-coded electronic text, scanned
images, printed or handwritten text or human speech.
• Language technology helps in managing digital content
• Management through learning from past experience also adds to manage
content
The major areas for great exploitation are:
• Information retrieval,
• multimedia,
• database,
• data mining,
• data warehouse,
• on-line information repositories,
• image processing, hypertext,
• World Wide Web and wide area information services (WAIS).
Few advantages of digital libraries
• Access anywhere
• Reducing delays
• Distributed storage – central access
• Better cataloguing
• Cross references to other documents
• Full text search
• Protected information source
• Wide exploration and exploitation of the information
The information explosion, the wide bandwidth data networks and the potential
of Internet-based technologies - such as the Web - make digital libraries one of
the important application areas of computer science.
Process of Digital Preservation
Centralized
Server
XML Meta File
Creation using
Dublin core Std.
Book scanning
status
Reject the
Book
Yes
No
Scanned
Image in TIFF
format
S/w to divide
even & odd
pages
Conversion to
TXT/RTF/HTML
Uploading
Batch
cropping &
Cleaning
OCR
Goals of DL


Focused on digitization technology, metadata
schemes, data management techniques, and digital
preservation.
Second-generation digital library


exploring new opportunities and developing new
competencies.
Third-generation digital library

focusing instead on fully integrating digital material into the
library’s collections through a modular systems
architecture.
Ingredients for DLs

Hardware
The minimum machinery to do the job

Software
The programs for handling data

Digital Objects
Articles, Conference Papers, Thesis,……

Basic Skills
Things one has to learn
Hardware

A Server



You’ll need access to a web server
A good PC
Scanners
Flatbed – Auto feed, Back to back
MF
Book Scanner
Software

Open Source Software (OSS)
Dspace, E-Prints, Fedora, GSDL……

Proprietary software you can’t avoid
Image Editing and Optical Character Recognition Software
have to be purchased
Content is King
The information content is
more important than the
systems used for its storage,
management and retrieval
Objects should not be “locked”
in specific DLs or archives
Creating DLs …

Six steps

Selecting
Acquiring
Digitization
Creation Of Meta Data
Organizing
Archiving

Providing Access





Possible Delivery Formats




Pure image formats: TIFF, JPEG
Open encoded formats: XML, HTML, ASCII, and
Unicode
Hybrid formats: PDF, DjVu – can contain both image and
text
Proprietary formats: Microsoft Word, WordPerfect
Digitization: Issues





Copyright
Access copy and archive copy
File size
Storage media( CD, Hard disc…)
File format ( TIFF,JPEG…)
Challenges in Digitization

Building digital collections of national importance from
existing texts, documents, images . . .

Creating new digital documents & linking them

Subject portals: Selecting and maintaining open source
digital resources

Developing / adapting management tools for digital
collections

Providing access to digital collections
25
Challenges..

Integrating digital & other library collections

incl. integration of OPACs, subscribed e-resources and
subject portals


Establishing services for digital libraries

online access & offline support

education & training of users and librarians
Addressing social, legal, policy issues
26
Challenges in Publishing

Preservation of layout

Searchability of content and metadata

Efficient image compression

Easy browsing of books

Accommodating low bandwidth user

Multilingual text support

Multipaging
Digital Library Support in India
Funding
 Ministry of Communication & Information Technology
(MIT)
 Ministry of Human Resource Development (MHRD)
 Manuscript Mission of India
 Department of Scientific & Industrial Research (DSIRTRP)
 All India Council for Technical Education (AICTE)
 University Grants Commission (UGC)
Digital Library Initiatives in India










Library Consortium in India
Scholarly Science Journals
Theses & Dissertations
Institutional E-Print Archives
Books (out of copyright)
Manuscripts
Newspapers
Online Courseware
Open Access at Metadata Level
Portal and Gateway Services
29
Government of India
Min. of C&IT
Universal Digital
Library
Min of Culture
National Manuscript
Library
Others
CSIR E-Journals
Consortium
INDEST-AICTE
Consortium
UGC Infonet
Consortium
FORSA
Consortium
IIM Libraries
Consortium
Participating centers of DLI
PTU-1
PTU-2
PTU-3
Rashtrapathi
Bhavan
CDAC Noida
ERNET
IIIT-Allahabad
Digital Library of India
CDAC Kolkata
MIDC
Pune University
IIIT-H
State & City
Central Library
University of Hyderabad
Goa University
IISc
TTD Tirupati
Sringeri Mutt
IISc, IIAP,
ASR Melkote
PoornaPragya
AKCE
Anna University
Kanchi Mutt
SASTRA
Mega Scanning Centres at
IIITH, IIITA
CDAC- Noida and Kolkatta
Digital Library Initiatives in India
Some Examples
Digital Library of India
http://www.dli.ernet.in/
April 20, 2009
Workshop on Institutional Repositories
33
http://www.ias.ac.in/
April 20, 2009
Workshop on Institutional Repositories
35
http://www.insa.ac.in/
April 20, 2009
Workshop on Institutional Repositories
36
http://medind.nic.in/
April 20, 2009
Workshop on Institutional Repositories
37
April 20, 2009
Workshop on Institutional Repositories
38
39
Manuscripts

India has the largest collection of manuscripts in the world (5 million
Approximately).

India is the repository of an astounding wealth of ancient knowledge
belonging to different periods of history, going back to thousands of
years. Most of this knowledge belonging to different areas of
intellectual activity such as religion, philosophy, science, arts and
literature is preserved in the form of manuscripts. Composed in
different Indian languages and scripts, they are preserved in materials
such as birch bark, palm leaf, cloth, wood, stone and paper.

National Manuscript Mission was launched five-year programme in
Feb., 2003 by the Ministry of Human Resource Development, Govt. of
India to get all the manuscripts and conserve them.
http://namami.nic.in/
Archives of Indian Labour
V.V. Giri National Labour Institute
Heritage of Indian Working Class
Commissions on Labour
Oral History Collections
Trade Union Collections
Regional Collections
Strike Collections
Powered by Green Stone Digital
Library
http://www.indialabourarchives.org/
43
Digital Libraries Benefits : Individual
Gain access to the holdings of libraries worldwide through
automated catalogs. Locate both physical and digitized
versions of scholarly articles and books.
 Optimize searches, simultaneously search the Internet,
commercial databases, and library collections.
 Save search results and conduct additional processing to
narrow or qualify results.
 From search results, click through to access the digitized
content or locate additional items of interest.
All of these capabilities are available from the desktop or
other Web-enabled device such as a personal digital
assistant or cellular telephone.

Conclusion

Digital Libraries are redefining the role of libraries in
society & the role of librarians & information specialists

National level mechanism is essential to promote and
coordinate open access and public domain digital library
systems





Improve awareness of open access
Regular training – tools, processes, standards
Support setting up of working models, services
National Resource Centre for open access publishing
International agencies like UNESCO, ICSU, ICSTI,
CODATA need to actively promote and support
developing country initiatives
References




Digitization Of Library Forum Survey 2010. IT Act .
Available at www.mit.gov.in/it-bill.htm.
A digital library for education: the PEN-DOR project. The
Electronic Library, 17(2), 75-82.
Government of India. 2000. “Background Report on IT
for Masses” itformasses.nic.in/vsitformasses/page1.htm
Government of India. 2000. IT for the Common Man: The
Millenium IT Policy. Department of Information.
Thank You