Chapter One - Center for the Study of Digital Libraries

Download Report

Transcript Chapter One - Center for the Study of Digital Libraries

Planning a
digital library
How to Build a Digital Library
Ian H. Witten and David Bainbridge
Planning a Digital Library


Responsibilities
Technology to be used


Metadata standard to be used



Greenstone, DSpace, Fedora, Eprints
Dublin Core, METS, etc.
Types of access
Retrospective or Born Digital?
Responsibilities

Legal Issues
Distributing information carries responsibilities
 Copyright


Social Issues
Respect customs of the community
 Both source and use communities


Ethical issues
Ideology


Ideology – a clear conception of what you plan
to achieve with the collection of information
Ideology of a Collection:
Purpose
 Objectives
 Principles



guide what is to be included in the collection
Placed in Introduction to Digital Library
Document versus Work

Work



Document



The disembodied content of a message
Pure information
Traditional library: a physical object that embodies the work
Digital library: a particular electronic encoding of a work
How are distinctions made between different
manifestations of a single work?
Converting an Existing Library



Digitizing an existing paper-based collection is
the most expensive kind of project
Consider whether it is worth the effort and
expense
16th Century Mexican Library
Incunabula
 Broadsides

Advantages of Digital Libraries







Easier to access remotely than conventional
libraries
Powerful search and browsing
Easier to add additional services
Easier to organize and reorganize
Easier to maintain?
Easier to preserve?
Does your collection have these advantages?
Questions to Address





Will the digital library coexist with an existing
physical one?
What is the collection’s growth rate?
How dynamic is the collection?
Should you consider outsourcing the whole
digital library operation?
Could user needs be satisfied in alternative ways?
Prioritizing Materials

Special collections and unique materials


High use items


Rare books and manuscripts
Research and teaching materials
Low-use items
Criteria for Digital Conversion

Intellectual content




Educational value






Classroom support
Background reading
Distance education
Institutional


Scholarly value
Desire to enhance access to information
Funding available
Resource sharing
Promote strengths of an institution
Reduce handling of fragile originals
Cost and space savings
Building a New Collection

New material


The copyright holder may be the best one to create a
digital collection
Metadata

Where will it come from?
Bibliographic Entities


Documents
Works


Editions


Electronic documents use terms such as version, release and
revision
Authors


Distinction between document and work
Authority control – standardized names for authors
Titles

Attributes of works
Bibliographic Entities

Subjects

Two approaches to automatically assign subject:



Literary and artistic works



Style, form, content, genre
Library of Congress Subject Headings (LCSH)


Key-phrase extraction
Key-phrase assignment
Controlled vocabularies: 30,000 pages, 2,000,000 entries
Hierarchical relationship of broader and narrower topics
Subject classifications


Traditional libraries have a linear arrangement
Digital collection can be rearranged at the click of a mouse
Digitizing Documents

Digitization
The process of taking traditional library materials
and converting them to electronic form
 Allows storage and manipulation by a computer
 The process is time-consuming and expensive

Stages of Digitization

Scanning
Creates a digitized image of each page
 Usually presented to the user


Optical Character Recognition (OCR)
Creates an encoded representation of the textual
content of the pages
 Necessary for full-text indexing
 Allows searching

Decisions in Scanning


Black-and-white, grayscale or color
Resolution


number of pixels per linear unit
Bits per pixel
Monochrome display: 16 or 256 levels of gray
 Color display: up to 24 or 32 bpp


Quality

Increases storage space and time to access
Optical Character Recognition


Manual cleanup is necessary
Less efficient than manual keying when error
rate drops below 95 percent
Interactive OCR

Optical character recognition should be done as an
interactive process

Acquisition


Cleanup



Examine layout
Recognition


Filtering, deskewing and manual cleanup of unwanted areas
Page analysis


Input from scanner or read a file
The “OCR” part
Checking
Saving

Plain text, HTML, RTF, PDF, MS Word
Page Handling



Unbinding
Microfiche or microfilm
Two most expensive parts
Handling the paper
 OCR

Planning a Digitization Project


Outsourcing
Cost



$1 to $2 for scanning and OCR
Quality control
Verification