NDNPPresentationEdited
Download
Report
Transcript NDNPPresentationEdited
Chronicling America and the
National Digital Newspaper
Program:
Technical Aspects
Part 1: Newspapers and Microfilm
Challenges
USNP
Part 2: Technical Details
Image views
Text searching
Indexing
Part 3: Managing a newspaper digitization
project
PIALA 2010
UH Manoa Hamilton Library
Challenges
Newspapers are a difficult medium
Never meant to last, made for daily use
and disposal
Pages crumble and acid corrodes the
materials
Tracking serial publications over time
Patron demand increased, storage space
grew scarce, binding costs rose
PIALA 2010
UH Manoa Hamilton Library
Microfilm
Adopted in the 1920s as a standard
Turns newspaper from a storage
nightmare to a relatively easy medium
to handle
Libraries had to decide what to do
with the hardcopy
Keep in holdings?
Deaccession?
PIALA 2010
UH Manoa Hamilton Library
United States Newspaper
Program (USNP) Began in
1982
Funded by National Endowment for the
Humanities, managed by the Library of
Congress
University of Hawai’i with Hawaiian Historical
Society, Hawai’i State Archives and State
Library contributed for Hawai’i
In mid-2000s: the USNP had received over $54
million in NEH support & non-federal
contributions of approx $19.6 million
Bibliographic records for over 140,000
newspaper titles; access to 70 million pages of
newsprint in microfilm
PIALA 2010
UH Manoa Hamilton Library
USNP
Goal: Locate, catalog, and microfilm
newspapers
Hawai’i microfilmed 260,000 pages
and cataloged 476 titles
Program ended in 2007
PIALA 2010
UH Manoa Hamilton Library
USNP Preservation
Microfilming
Guidelines
Optimum legibility
Image orientation & reduction ratios to fill frame
& obtain greatest degree of legibility in public
use copies
Quality
Each roll of first generation film shall be inspected
frame-by-frame by both the filming agency and
the project for density and resolution and to
determine that the film is free of emulsion
scratches, abrasions, fingerprints, spots, fog,
and other defects
http://www.loc.gov/preserv/usnpguidelines.html
PIALA 2010
UH Manoa Hamilton Library
USNP Preservation
Microfilming Guidelines
Density
• No less than five readings at start, middle & end of
each reel with a transmission densitometer
calibrated daily
• Maximum (Dmax) density measurements taken on
exposed image with no words or graphics
• Background densities no lower than .80 & no
higher than 1.20, lower densities preferred for
older pages & to facilitate production of readerprinter & enlargement prints.
• Base-plus-fog density (Dmin) on the master
negative shall not exceed .10
PIALA 2010
UH Manoa Hamilton Library
National Endowment for
the Humanities and Library
of Congress created NDNP
No single US collection of newspapers
Every institution focusing on particular
themes relating to their collecting plans
Thousands of volumes of newspapers
spread across the country
Enhance access to newspapers, building
on the foundation of the United States
Newspaper Program
PIALA 2010
UH Manoa Hamilton Library
NDNP Overview
2-Year awards to state projects,
renewable
Digitize 100,000 pages of microfilmed
newspaper
Newspapers picked must be from
between 1836 to 1922
Historical essays on each newspaper
Collation and Quality Control on all
papers
PIALA 2010
UH Manoa Hamilton Library
NDNP Goals
20-year span with phased, sustainable development
of 30 million page database
Establish technical conversion specs & practices for
efficient basic discovery & access
Develop production tools to ensure good digital
objects that can be managed & preserved long-term
Provide public access to and take preservation
responsibility for the digitized newspapers
Create a national resource of historically significant
newspapers from all the states and U.S. territories
PIALA 2010
UH Manoa Hamilton Library
NDNP Microfilm-related
Challenges
Where are the master reels?
Copyright issues (Who filmed the
newspapers and owns the master
microfilm)
Technical specifications (Poorly filmed,
low density readings, etc)
Microfilm standards applied vary widely
PIALA 2010
UH Manoa Hamilton Library
No universally accepted
metadata standard for
historical newspapers
Online historical newspapers
produced by public or private sector
existed as discrete systems,
metadata structures not designed for
interoperability
Titles, issues, pages and reels all
need to be represented as different
yet related classes of objects
PIALA 2010
UH Manoa Hamilton Library
NDNP
Digital Deliverables
Images scanned at 300-400 dpi
• Three formats:
grayscale, uncompressed Tiff 6.0
Images
Compressed JPEG2000 images
PDF Image with hidden text
Accompanying structural and
technical metadata
OCR text for all pages
PIALA 2010
UH Manoa Hamilton Library
NDNP Scanning
specifications
De-skew images with a skew of greater
than 3 degrees
Crop to visible edge of page
Capture grayscale preservation microfilm
targets
PIALA 2010
UH Manoa Hamilton Library
NDNP OCR
specifications
Conform to ALTO XML schema
• ALTO (Analyzed Layout and Text Object)
is a XML (Extensible Markup Language)
Schema that details technical metadata
for describing the layout and content of
physical text resources
Bounding box coordinate data
• Each column is sectioned and
coordinates are used to place words
PIALA 2010
UH Manoa Hamilton Library
NDNP
Metadata requirements
(Metadata is Information about Information)
METS (Metadata Encoding and Transmission
Standard) format records preservation
metadata
Structural metadata to relate pages to title,
date, and edition; sequence pages within issue
or section; and to identify image and OCR files
Technical metadata to support the functions
of the Library of Congress repository
PIALA 2010
UH Manoa Hamilton Library
XML Rules
Single, unique root element
Matching open/close tags
Consistent capitalization
Correctly nested elements (no overlapping elements)
Attribute values enclosed in quotes
No repeating attributes in an element
Provides international, vendor independent standard
for describing information
PIALA 2010
UH Manoa Hamilton Library
Family of XML data
standards includes:
METS – Metadata Encoding and
Transmission Standard
MODS – Metadata Object
Description Schema
PREMIS – PREservation Metadata
Implementation Strategies
EAD – Encoded Archival
Description
PIALA 2010
UH Manoa Hamilton Library
METS
(Metadata Encoding and
Transmission Standard)
XML Schema for the purpose of
creating XML files that define:
• the hierarchical structure of digital
library objects (images, text files,
etc.)
• the names and locations of the files
• the associated metadata (e.g., MODS)
PIALA 2010
UH Manoa Hamilton Library
Metadata Object
Description Schema
(MODS)
An XML Schema designed for expressing
bibliographic data
(Think of it as an alternative to the MARC
format)
PIALA 2010
UH Manoa Hamilton Library
Sections of a METS file
<mets>
<metsHdr/> -
METS header (document talks about itself)
<dmdSec/> -
Descriptive metadata (MODS, etc.)
<amdSec/> -
Administrative metadata (copyright info., etc.)
<fileSec/> -
File section (names and locations of files)
<structMap/> -
Structural map (relationships of the parts)
<structLink/> -
Linking information
<behaviorSec/> - Binding executables/actions to object
</mets>
PIALA 2010
UH Manoa Hamilton Library
Title METS
Combines bibliographic and holdings data
in a single title record, converted from
MARC to MARC XML format
Titles digitized will have additional data
• descriptive essays, more precise geographic
coverage data
• which is put in a Metadata Object Description
Schema (MODS) object within the larger METS
document
PIALA 2010
UH Manoa Hamilton Library
Issue and Reel METS
Issue METS
• Issue Data
• Page Data
Reel METS
• Reel Data
• Target Data
PIALA 2010
UH Manoa Hamilton Library
WHY?
XML structure used by software for creation of
multiple outputs:
• HTML/XHTML for Web display; PDF for printing
Ease of editing (single records or batches of
records)
Ability to validate data
Ease of data management and publishing
Interoperability
• Repository submission and OAI harvesting
PIALA 2010
UH Manoa Hamilton Library
All that coding pays off
for the user when
SEARCHING
Geographic
metadata
Title metadata
Date metadata
PIALA 2010
UH Manoa Hamilton Library
Keyword searching
OCR/OWR does not yield article
“transcriptions”; text OCR’d from images of
newspapers is used for searching purposes
Several options
• ANY of the words, ALL of the words
• EXACT PHRASE
• Proximity search
– Look for words within 5, 10, 50 or 100
words of one another
PIALA 2010
UH Manoa Hamilton Library
Page thumbnail view
Click on
thumbnail
or
description
of page to
view larger
version
PIALA 2010
UH Manoa Hamilton Library
Page view
Different
format can
be selected
with one
click
PIALA 2010
UH Manoa Hamilton Library
Browse Issues
A calendar
view
indicating
which issues
have been
digitized
Can change
which year
you’re viewing
Browse First
Pages
PIALA 2010
UH Manoa Hamilton Library
Project Management
From Microfilm to Digital Images
Managing a Newspaper Conversion Project
PIALA 2010
UH Manoa Hamilton Library
NDNP
&
University of Hawai’i
UH first grant began in July 2008,
running until June 2010
Grant renewed: July 2010-June 2012
Utilizing the microfilm created under the
USNP
Excellent quality microfilm (in theory)
Fewer problems with cataloging/description,
acquiring 2N duplicates (in theory)
PIALA 2010
UH Manoa Hamilton Library
Project Management
Request for Proposals (RFP)
• Include all LC technical specifications
Position Description(s)
• Coordinator, students
Hiring and Training
PIALA 2010
UH Manoa Hamilton Library
Project components
Microfilm identification and duplication
Digitization
Metadata creation & Validation
PIALA 2010
UH Manoa Hamilton Library
Microfilm selection
Choose what is important to your institution(s) if
possible
Copyright
•
•
Reels created by or for your institution
Reels by Proquest, etc, you may have to ask for permission
and pay much higher duplication fees
Decide
•
PIALA 2010
Complete runs of few titles, or many short/incomplete runs
of a lot of titles
UH Manoa Hamilton Library
Vendors
iArchives
• Leaders in the field
• Lots of experience
OCLC/BSLW (Backstage Library Works)
Apex/Covantage
Northern Micrographics (NMT)
Local or national microfilm duplication
companies
PIALA 2010
UH Manoa Hamilton Library
Equipment
10 500 GB External Hard Drives (Western
Digital MyBooks) and Pelican cases
1 PC with double monitor
Software: Library of Congress’ Digital
Validator and Viewer (DVV)
Densitometer
Microfilm reader/scanner
PIALA 2010
UH Manoa Hamilton Library
Our Stuff
Densitometer
Pelican Cases
Microfilm
scanner
PC with
2 monitors
& portable
HDs (red)
PIALA 2010
UH Manoa Hamilton Library
Staffing
Project Coordinator
• Quality Control Technician
Graduate students
Advisory Board
Subject/history/newspaper specialists
PIALA 2010
UH Manoa Hamilton Library
Metadata Collection
Density readings
Recorded onto a spreadsheet
PIALA 2010
UH Manoa Hamilton Library
Preparing the Microfilm:
Metadata
Data from, OCLC MARC record & local
holdings
PIALA 2010
UH Manoa Hamilton Library
Preparing the Microfilm:
Collation
Review use copy of reel
• Missing issues or pages
• Duplicate issues or pages
• Mutilated pages
• Other abnormalities (E.g. pages out of
order, incorrect dates)
PIALA 2010
UH Manoa Hamilton Library
Preparing the Microfilm:
Collation
Review use copy, record data on spreadsheet
PIALA 2010
UH Manoa Hamilton Library
iArchives Digitization Workflow
QC
Film
Scanning
Split,
De-Skew,
Crop
Shared
Storage
(NAS)
QC
QC
QC
Image
Processing
Image
Metadata
KEY:
■ Automatic process [image
processing, OCR, …]
■ Manual process [image + page
metadata]
■ Quality Control
Page/Reel
Metadata
Workflow
Manager
DB
QC
OCR
Framework
QC
Post
Process
Customer
Deliverables
Automated
Processing Cloud
Scan QC
Split, Crop & DeSkew
iArchives OWR Framework
3 Leading OCR
Software Programs
2,000,000 Word
Dictionary
OWR
2,000,000 Name
Dictionary
Post-vendor validation
Once the hard drive returned, we
verify/validate the batch using the DVV
program
Verification compares the metadata listed in the
master XML file to the metadata found in the issue
XML files for correctness
Validation is done if a new master XML file needs to
be created. It creates checksums for each file and
records them in the subsequent metadata
Copy contents of hard drive onto our
server
PIALA 2010
UH Manoa Hamilton Library
Quality Control
Image quality
Too dark? Too light? Skewed?
Correct image?
Compare digitized image to microfilmed
image
No Missing Issue/Page tags
Review metadata
Dates
LCCN #
Locations
PIALA 2010
UH Manoa Hamilton Library
Thumbnail View
can use DVV or any
graphics program
PIALA 2010
UH Manoa Hamilton Library
Quality Control
LC Digital Viewer
and Validator (DVV)
PIALA 2010
UH Manoa Hamilton Library
Metadata Viewer
PIALA 2010
UH Manoa Hamilton Library
OCR
PIALA 2010
UH Manoa Hamilton Library
Headers
PIALA 2010
UH Manoa Hamilton Library
Title Essays - 500 words
Describes newspaper’s history
• Date of establishment
• Editors
• Type of news reported
• Political viewpoint
• Where is the paper today?
Published to Chronicling America
PIALA 2010
UH Manoa Hamilton Library
Links
Chronicling America:
http://chroniclingamerica.loc.gov/
Library of Congress: http://www.loc.gov/ndnp/
National Endowment for the Humanities:
http://www.neh.gov/projects/ndnp.html
Hawai’i Newspapers: a union list
http://evols.library.manoa.hawaii.edu/handle/10524/2
089
Using <METS> and <MODS> to Create XML
Standards-based Digital Library Applications
http://www.loc.gov/standards/mods/presentations/me
ts-mods-morgan-ala07/
PIALA 2010
UH Manoa Hamilton Library
Thank You!
Mahalo!
Kinisou Chapur!
Questions? Comments?
Email us at:
♦ [email protected]
♦ [email protected]
https://sites.google.com/a/hawaii.edu/ndnp-hawaii/
PIALA 2010
UH Manoa Hamilton Library