Digitisation OCR and File Formats

Download Report

Transcript Digitisation OCR and File Formats

NAMASKARA
Digital Archiving – A Workflow
K P Raghuraman
National Centre for Science Information
Indian Institute of Science, Bangalore
Acknowledgements
Organizers
Mr. Francis Jayakant
Mr. Filbert Minj
Friends who supported me in the
effort
Internet
17 July 2015
Archives and Publication Cell, IISc
2
Digital Archiving
What is Digital Archive
Documented Information & storage system
Holds permanent, fixed data for a long time
(?) in a structured and easy accessible way
Employs information architecture
configured to assure trustworthiness and
long term retention
17 July 2015
Archives and Publication Cell, IISc
3
Digital Archiving – Need
A practical task for keeping
documents intact for future use
Improved access to information
resources, preservation and
dissemination as required
Any time; anywhere and any place
17 July 2015
Archives and Publication Cell, IISc
4
Digital Archiving – Benefits
Digitisation contribute to
Conservation of physical resources
Enables effective sharing of information and contributes to knowledge flow
Unlocks information that was previously difficult to access in paper form
Use of digital surrogates will reduce wear and tear of originals / made legible
Negate the use of originals
Access to information could be restricted with remote access
Provide customizable user interface for collaborative working environment
Faster support regarding any query & question
Cost saving on paper & Time saving in finding information
17 July 2015
Archives and Publication Cell, IISc
5
Digital Archiving – Advantages
Improved searching mechanisms
Metadata search - Full text search - Boolean
search
Support simultaneous searching in a
standardised form, across a range of resource
categories.
Information, rather than media, can be collated
to support a query, regardless of the original
source material type.
Space save
3000 kg of paper could be saved in a DVD
Data can be recombined for manipulation and
compressed for various applications
17 July 2015
Archives and Publication Cell, IISc
6
Digital Archiving – Technology and Process
Digital record is mirror image of original
analogue/paper based file in terms of
Page layout and number of pages
Hand written text, graphics & logos
Colour of original document
These images is then rendered into desired format
(e.g. pdf) for archiving, printing and distribution
Creation of Metadata – used for search and index
Additional metadata providing contextual information
Who uses the records
How will they be used
When will be they used
Access codes to prevent unauthorized access
17 July 2015
Archives and Publication Cell, IISc
7
Digitisation
Crude definition
Scan
Save
Is it just Scan and Save
Is there a workflow
Are guidelines for the whole process
17 July 2015
Archives and Publication Cell, IISc
8
Digitisation
Definition
Converting written and printed information
into electronic form
Creation of computerisation of a printed
analog.
Contents
Contents – text image, audio or combination
of these (multimedia)
17 July 2015
Archives and Publication Cell, IISc
9
Objective of Digitisation
Create content of databases
Facilitate access
Preservation
Dissemination of information resources
17 July 2015
Archives and Publication Cell, IISc
10
Digitisation Process
Output
Electronic Document
Tagged Image File Format (TIFF)
Portable Document Format (PDF)
Useful for hosting information on the intranet
Platform independent
PDF readers are available as free
downloads
17 July 2015
Archives and Publication Cell, IISc
11
Digitisation - Objects and Process
Image
Text
Audio
Video
Scanner captures images.
Software analyses images and creates
texts and images
Software converters convert raw Audio and
raw Video to standard digital format
17 July 2015
Archives and Publication Cell, IISc
12
Digitisation - Issues
Hardware
•
•
Computer
Scanner
Software
•
•
•
Communication software PC – Scanner – TWAIN
complaint
Image processing – Photoshop, Macromedia Fireworks
etc.
Enable text material to be converted to Text i.e. OCR
(Optical Character Recognition) – AABBYY, OmniPage
Suitable Policy
•
•
•
Consistent quality threshold for scanned images.
Choosing appropriate image format – TIFF, JPEG etc.
Choosing an appropriate file name scheme.
17 July 2015
Archives and Publication Cell, IISc
13
Scanners
Flat bed scanners
Normal Desktop scanner
Sheet fed scanners
Same as above but here document moves and
scan-head is immobile
Handheld scanner
Used to capture text – size of a pen.
Drum scanner
Used in publishing industries
Planetary Scanner
Scanning books
17 July 2015
Archives and Publication Cell, IISc
14
Types of Images
1-bit black and white – either black or white
Used for printed text or line graphics
Unsuitable for images
8-bit grey scale – 256 grey scales
Black and white photographs
Non-color documents
8-bit color – 256 colors
low quality images
24-bit color – 16.8 million shades of color
Ideal archival quality images
For color photo printing
17 July 2015
Archives and Publication Cell, IISc
15
Resolution
Measurement in dots per inch (dpi)
Higher dpi higher the file size
Resolutio
n
300
200
100
2-bit
20 K
black and
white
11K
5K
1K
8-bit B&W 158 K
or color
89K
39 K
9K
24 bit
color
267 K
118 K
29 K
17 July 2015
400
475 K
Archives and Publication Cell, IISc
16
Image - Size
Images size measured in pixels
Image size varies with scanned resolution
Modification of image size is called
resampling
Image screen pixels are found on each
pixel of the screen
One screen pixel contains one image pixel
and can have any RGB value
800 x 600 pixels 14” monitor
1024 x 786 pixels 16” monitor
17 July 2015
Archives and Publication Cell, IISc
17
Image – File Formats
Some standard image formats
TIFF – Tagged Image File Format
JPEG – Joint Photographic Expert
Group
DjVu – déjà vu (a free file format)
GIF – Graphic Interchange Format
PNG – Portable Network Graphics
17 July 2015
Archives and Publication Cell, IISc
18
TIFF
Multiple images and data in the same file
Tags in file header (information on size,
compression)
Loss-less format, useful for archival
images
Platform independent
Format useful for future modification –
can edited without compression loss
Disadvantage
Size of image is very high
17 July 2015
Archives and Publication Cell, IISc
19
JPEG
Strongest format for web images and
printing images
Superior quality can be produced
Variety of compression capability
Best method for online viewing
Disadvantage
Lossy compression format
17 July 2015
Archives and Publication Cell, IISc
20
GIF
Very old format
Lossless compression format
Less storage space
Strong candidate for graphic art and
drawing.
Disadvantage
Limited to 256 colors.
17 July 2015
Archives and Publication Cell, IISc
21
DjVu
File format to save scanned images
especially with text.
Advanced technology for image layer
separation of text and images.
High quality readable images, stored
in minimum space – useful for web.
Progressive loading – useful for web.
Format used for Million books project
17 July 2015
Archives and Publication Cell, IISc
22
PNG
A new format
Created to improve on GIF format
Supports 24-bit color or greyscale
Provides for variety of transparency
Lossless data compression
Disadvantage
New so old software does not support
17 July 2015
Archives and Publication Cell, IISc
23
File Formats
Audio
Wav
Microsoft, IBM audio file format.
Lossless storage method – large files.
MP3 – MPEG -1 Audio Layer-3
Popular digital audio encoding.
Lossy compression format so smaller files.
Still can produce good reproduction of original.
Real Audio – ram
Variety of audio codecs from lowbitrate to high fidelity
formats
Streaming audio format
17 July 2015
Archives and Publication Cell, IISc
24
File Formats
Video
MPEG 21
Defines “Rights Expression Language” standard
– Sharing digital rights/permissions/restrictions for
content from content creator to consumer
XML based file system
– Can communicate machine readable license
information in a "ubiquitous, unambiguous and secure"
manner.
The main objective of the MPEG-21 is to define the
technology needed to support users to exchange,
access, consume, trade or manipulate Digital Items in
an efficient and transparent way.
17 July 2015
Archives and Publication Cell, IISc
25
OCR
Optical Character Recognition
Goal – Recreate text and other
elements like tables and layout so as
to edit in popular word-processors
Requirement – Scanner and text
conversion software (OCR)
Technology – Examines patterns of
dots and recognizes them and writes
them as alphabetic characters and
numbers
17 July 2015
Archives and Publication Cell, IISc
26
OCR - Process
The scanner or camera produces
TIFF image
The software cleans the image for
noises and starts recognizing patterns
Recognized patterns in alphabets and
numbers
Unrecognized patterns into images
17 July 2015
Archives and Publication Cell, IISc
27
Widely used settings
24 –bits color
600 dpi (while 300 or 400 for text are
popular)
TIFF Rev 6 without compression or LZW
compression
(PNG is currently becoming popular)
Photographs to be scanned twice the size
B&W photographs in grey scale
Text can also use the above settings can
be stored as PDF or DjVu
17 July 2015
Archives and Publication Cell, IISc
28
Popular Practices Followed
Initially Preservation Masters are
created.
Should be uncompressed to retain
archival integrity
For long time storage purposes.
Compressed Web files are created for
surrogate files in repository or for
web-site
17 July 2015
Archives and Publication Cell, IISc
29
Specific File Formats
Original
Preservation
Master
Surrogates
Image
TIFF
JPEG, DjVu
Text
TIFF
JPEG, DjVu, PDF
Audio
Linear WAV
MP3/ RealAudio
ram
Video
MPEG 21
17 July 2015
Archives and Publication Cell, IISc
30
OCR - Accuracy
Depends
Color of paper
Characters should be reasonably well
formed
The font should one of the popular ones.
99% accuracy achieved
Bleached white paper
10pt character size
1.5 line spacing
Computer based printouts
17 July 2015
Archives and Publication Cell, IISc
31
OCR - Issues
Deal with archival material
Old text printed during hand pressed
period
Gothic and exotic fonts used
Paper color is yellow
Characters are often broken and not
well-formed due to age and
environment factors
17 July 2015
Archives and Publication Cell, IISc
32
Best Practice
First scan and store as TIFF files
OCR TIFF files
Depending on the application and
size can convert it into pdf or any
format
Depending on accuracy of OCR use
TIFF or OCR copies for pdf
17 July 2015
Archives and Publication Cell, IISc
33
OCR – Software
AABBYY – Fine Reader – Very
popular
OMNI Page – High end OCR tool
Read IRIS – A competitor to AABBYY
and OMNI Page
MODI – Microsoft Office Document
Imaging (introduced in Win-XP and
exports to word)
17 July 2015
Archives and Publication Cell, IISc
34
17 July 2015
Archives and Publication Cell, IISc
35
Camera
produces raw
uncorrected
color photo of
the each page
17 July 2015
Archives and Publication Cell, IISc
36
The software
cleans up the
image and saves
as Hi-Res TIFF
image
Using OCR it can
converted to
editable text
17 July 2015
Archives and Publication Cell, IISc
37
Summary
Digitization is a process
Large number of analogue items like
image, text, audio and video are
captured into digital form
Understand the variables and tasks in
the process
Methods of capturing images
Conversion process performed
17 July 2015
Archives and Publication Cell, IISc
38
Summary
Document the workflow
This will lead to life history for each
digitized item
Help Create Consistency and
Reliability
17 July 2015
Archives and Publication Cell, IISc
39
New Definition
Is this the end of digitization?
Are we through with the work?
As in every other job here too
sustainability and maintenance is
necessary
17 July 2015
Archives and Publication Cell, IISc
40
Long term maintenance
Technology is changing rapidly
Obstacles that may need to overcome
Lack of awareness in general about how
such resources may be exploited
effectively for scholarly purposes
Lack of relevant IT skills and/or
analytical methods
Lack of appropriate user support.
17 July 2015
Archives and Publication Cell, IISc
41
Strategies to preserving data
Preserving the data and the hardware and
software platforms from which they are
originally made accessible.
Refreshing data by copying them
periodically onto new storage media.
Migrating data through changing technical
regimes by rendering them into an
appropriate standard interchange formats.
Emulating the look and feel of the original
data on successive generations of
hardware and software platforms.
17 July 2015
Archives and Publication Cell, IISc
42
Points to ponder
Unlike paper, parchment and other traditional forms of
recording medium, electronic systems and their data are not
durable. Digital materials have very different preservation
requirements to analogue materials, which may last for
many decades through storage in optimal environmental
conditions.
The other difficulty with electronic data and files is that they
require the intervention of other systems to facilitate
readability or usability. This innate dependency makes the
files themselves very fragile. A problem in any of the
supporting components can render the information useless.
It is not enough to physically preserve the storage medium
or present the bitstream. Without the commensurate tools
to decode and present the bitstream, a future user will be
met with gibberish.
17 July 2015
Archives and Publication Cell, IISc
43
Digitization – Next Step
Will mean preservation of materials
that are ‘born digital’ .
Migration
Electronic data transferred from one data
format to another.
Emulation
Attempts to use current and future
technologies to emulate the tools and logic
used when the records and files were
originally created
17 July 2015
Archives and Publication Cell, IISc
44
Informative web sites
Irish Virtual Research Library and Archive Project Workbook
http://www.ucd.ie/ivrla/workbook/wdigprese
rvation.html
The Arts and Humanities Data Service
(AHDS) is a UK national service aiding the
discovery, creation and preservation of
digital resources in and for research,
teaching and learning in the arts and
humanities
http://ahds.ac.uk/about/publications/index.h
tm
Archives and Publication Cell, IISc
17 July 2015
45
17 July 2015
Archives and Publication Cell, IISc
46