Manual for Scanning Books - Guru Nanak Dev Engineering

Transcript Manual for Scanning Books - Guru Nanak Dev Engineering

Universal Library Project
User Manual
Indian Institute of Science
Bangalore
[email protected]
Process
Process Involved
Identification of
Books
Pre-Scanning
process
Scanning
Process
Image
Processing
Conversion
Process
Copyright 2002, ULIB Project , IISc
Identification of Books
No copy right law is violated.
They are of relevance to the selected user group.
The books are available for scanning.
Copyright 2002, ULIB Project , IISc
Pre-Scanning Process
Copyright 2002, ULIB Project , IISc
Pre – Scanning Process

Pre-Scanning Verification, Whether Books has been scanned
elsewhere.
 Create a Directory with the name of the book to be scanned
(e.g.) Discoveries_And_Inventions_Of_The_Twentieth_Century
 Metadata file (meta.xml)
 Folders within that directory
–
–
–
–
–
OTIFF – will contain the Original Scanned TIF images
PTIFF – will contain the Processed TIF images
HTM – will contain the HTM pages after OCR
RTF – will contain the RTF pages after OCR
TXT – will contain the TXT pages after OCR
Copyright 2002, ULIB Project , IISc
Metadata Creation
<?xml version="1.0" encoding="ISO-8859-1" ?>
- <dublincore>
<title>Proceedings of Indian Academy of Sciences, Volume 1</title>
<creator>Editorial Board, Indian Academy of Science</creator>
<subject /> Science <subject />
<publisher>Indian Academy of Sciences, Bangalore</publisher>
<contributor>Indian Institute of Sciences</contributor>
<date>1934</date>
<documenttype>Journal</documenttype>
<format />
<identifier />
<source />
<language>English</language>
<relation />
<coverage />
<rights />
<copyrightowner>Indian Academy Science</copyrightowner>
<copyrightexpirydate />
<scanningcentre>SERC,Indian Institute of Science,Bangalore</scanningcentre>
<scannerno > SE-2639 <scannerno />
<digitalrepublisher >Indian Institute of Science <digitalrepublisher />
<digitalpublicationdate /> 09/07/02 <digitalpublicationdate />
</dublincore>
Copyright 2002, ULIB Project , IISc
File Formats
Image File Formats used to store scanned images
Tagged Image File Format(TIFF or TIF) – Recommended
Windows Bitmap(BMP)
Joint Photographic Experts Group (JPEG or JPG)
Text file format used to store the OCR output
Rich text format(RTF) – Recommended
Hypertext Markup Language(HTML) - Recommended
ASCII text(TXT) - Recommended
Portable Document format(PDF)
Copyright 2002, ULIB Project , IISc
Scanning Process
Copyright 2002, ULIB Project , IISc
Scanning

Scanner – Minolta PS 7000 Book Scanner
– Connects to PC running Windows OS 2000 Professional
– Scans books face up
– Scan Speed – 6 pages/min
– Transfer of scanned image through
SCSI Interface
Copyright 2002, ULIB Project , IISc
About Scanning software
Quick scan is the Software used for scanning
PixTools/QuickScan is a high-performance Microsoft Windows utility
application that provides an integrated image acquisition environment that
allows you to scan, view, print, annotate, store, and perform image
processing on documents.
Scanning
QuickScan uses Pixel Translations ISIS (Image and Scanner Interface
Specification) libraries to support more than 125 scanners from many
manufacturers.
ISIS drivers enable scanning at the full rated speed of your scanner.
QuickScan also incorporates support for full control of your scanner's
capabilities, allowing you to adjust brightness, contrast, scan resolution, scan
mode, dithering, and any other settings available in your scanner.
Copyright 2002, ULIB Project , IISc
About Scanning software
Viewing
QuickScan is also a high-performance image viewer that includes many features to
make it easy to display and manipulate images.
Features include a main viewer, a thumbnail viewer, fast scaling and rotation,
background preloading, annotations, scale-to-gray conversion for binary images, and
a pan window.
Printing
QuickScan prints images using any standard Windows-supported printer. Options for
convenient image printing include Fit Page, Actual Pixels, and Actual Size.
Copyright 2002, ULIB Project , IISc
About Scanning software
Saving
QuickScan saves acquired images in a variety of popular image file formats and
compression schemes.
By using TIFF Group 4, you can achieve compression rations of 35:1 to 50:1,
depending on the type of image and the quality of the scan.
Color and gray scale capabilities vary according to your scanner and the image
file format in use.
QuickScan supports color and gray scale file storage, scanning, viewing, and
printing.
Copyright 2002, ULIB Project , IISc
About Scanning software
Image Processing
Image processing is a technology designed to improve the quality and readability of
your images.
Image processing can be used to clean up "dirty" images, straighten crooked images
caused by paper misalignment during scanning, remove black circles on images
scanned from punch holes, etc.
It can also be used to recognize barcodes and patch codes on the image.
QuickScan includes a suite of image processing filters, or IP filters, each designed to
perform a specific enhancement task.
These filters can be used individually or in sequence, and they can be configured to
deliver the optimal results depending on your parameters.
Image processing can be done on existing images opened in QuickScan. It can also be
accomplished dynamically during a batch scan.
Copyright 2002, ULIB Project , IISc
Launching QuickScan Software
Starting the Quick Scan Software
Go to
START
PROGRAMS
PixTools
Products
QUICK SCAN
Scanner Selection
Go to the File and Click on Select Scanner. In the Scanner selection window
select Minolta PS 7000 and click OK
Note – This must be done first time you run the
software.
Scanner Setting
Now go to Options > Scanner settings the Scanner Settings window opens
Scanner Setting
The required settings have to be made.
Mode – Black and white.
Dither – None.
Dots Per Inch – 600.(For scanning Books)
Brightness – Manual.(It can be set depending upon quality of Book to either dark or
lite, range between 1 to 9 in scale).
 In general it can be set to normal (5 in
scale).
 For very old Books the scale is set to 3.
Contrast – Automatic
Copyright 2002, ULIB Project , IISc
Resolution
Scanner Resolution
Resolution can be defined as the number of the pixel used to define one inch of the
scanned image.
The scale of measurement used is Dots Per Inch(dpi).
Good Quality
High Resolution
More Data
More memory
Copyright 2002, ULIB Project , IISc
Scanner Setting
Now click the “More” option from the scanner Settings window. The Minolta PS
7000 Special Feature window will open
Scanner Setting
Make sure that all the options in the Minolta PS 7000 Special Features window
are selected correctly and according to the Book.
Mode of Scanning:
Spread - Scanning Journal, Newspapers, Drawings Sheets.
Single - Scanning Single sheet in a Book either Left or Right page .
Book Split - Scanning Books both left and right page and save each file separately.
Select Book Split for Scanning Books
Enable Following options
Under Image Edit, select “Book Compositions”, “ Frame Masking”, ”Finger
Masking”, and “centring” all of them in that order.
Select “centre-line erase” and “Automatic detection”, again, in order
Finally, for scan Method, select “Front Panel”.
Press “OK” to exit Scanner settings window.
For single mode Enable the above options
For Spread mode disable the above options.
Copyright 2002, ULIB Project , IISc
Setting for Scanning Old Books
In Scanning such type of Books, where the background quality of pages is yellowish in most cases,
the following has to be taken care.







Check for the mirror to be clean.
The distance between scanner bed and stopper should be set to 10mm/Auto in control panel.
For this setting to be done in control panel Goto User mode Book sheet set to 10mm/Auto and
Click Ok. (here you can do the adjustments)
Contrast can be also adjusted in control panel if the scanned images are blur.
For this settings to be done in control panel Goto Qlty -> contrast (here the setting can be
adjusted).
These settings can be done according to the book as the quality of pages (background) varies
from book to book.
Usually the Brightness in scale is set to maximum dark and contrast to 1 while scanning old
books, but these will vary from book to book and even page to page. These setting can be
varied by trial and error.
While Scanning such type of Books, the error “Bind size not detected” will be
more, since the quality of book edges, is “DIRTY” which is the cause for this
error mostly.
Copyright 2002, ULIB Project , IISc
Setting for scanning New Books
In Scanning such type of book where you will find the quality and background is white, So you
need to set the following settings.




The distance between scanner bed and stopper should be set to 10mm/Auto in control panel.
For this setting to be done in control panel User mode -> Book sheet. (here you can do the
adjustments)
Brightness is Automatic
Contrast is Automatic
In most case when scanning these type of Books, the error “Bind size not
Detected” will be very less and error may also never occur.
Copyright 2002, ULIB Project , IISc
Scanning
To scan a batch of pages:
From the File menu, choose Scan Batch to File> Create New Batch. The
Scanned Document Name dialog appears:
In the File Name text box, type a file name (without an extension), and from
the Drives and Directories lists, choose the drive and directory where you
would like your file saved.
From the List Files of Type list, choose the appropriate file type.
From the File Format Properties group box, choose the appropriate Color
Format and Compression options.
From the Schema Activation group box, choose use schema.
Copyright 2002, ULIB Project , IISc
Scanning
In the Quick Scan window go to File Menu > Scan Batch to File and click Create
New Batch. The Scanned document Name Window opens
File Menu
Scan Batch To File
Create New Batch
Scanning
The Scanned Document name window opens and the following settings have to
be made.
In the File Name text box, type a file name (without an extension) based on the
schema conventions described in next two slides. The file name can be a
maximum of 16 characters and name it with relevance to Book that You scan.
Under List Files of Type, select “TIFF(*.tiff)”.
Under Drives, pick the appropriate drive for where you plan to store the scanned
images.
Under Directory, pick the “5TIF” folder which was created in Pre-Scanning
stage.
To the right side of this screen under Schema Activation select “USE SCHEMA”
Note - “Multi-Page” should not be selected since it saves all pages as a single file.
Schema Convention
Naming schemas let you automatically name large batches of scanned pages in such a
way that they are easily identifiable.
You can choose from a list of common naming schemas or you can create your own to
meet the requirements of your job.
Schemas can be thought of as “pictures” of how you want your documents’ file names to
look, including automatically-generated batch number, page numbers, and side
identifiers.
Schema selection
In Schema Activation click Edit Schema, the
File naming schema window opens
Copyright 2002, ULIB Project , IISc
Schema Conventions
$ sign – Represents alphabets from the set {a-z,A-Z}.
# sign – Represents one numeral from the set {0-9}.
; (Seperator) – Ends the file name.
Example - $$$$####;
$$$$ - Represents the File name that you specified in File Name
text box.
#### - Represents the numbers.
Copyright 2002, ULIB Project , IISc
Rescanning
To scan and insert pages into your current batch for pages that were not scanned
properly.
Place the pages you want to insert in your batch in the scanner.
In the Thumbnail Viewer, do one of the following
Select a thumbnail and place the cursor to the right of the image to insert the
scanned page after the selected image.
Select a thumbnail and place the cursor to the left of the image to insert the scanned
page before the selected image.
To Insert pages do the following
From the File menu, choose Scan Batch to File > Insert Pages. The Insert Into
Current Batch dialog appears.
Choose the appropriate Color Format(Binary) and Compression(CCITT group 4)
options and click OK. The Prepare Scanner dialog appears.
Click Start Scanning.
Copyright 2002, ULIB Project , IISc
Rescanning
To scan and insert pages into your current batch for pages that were not scanned
properly as shown.
Place the pages you want to insert in your batch in the scanner.
Bad scanned page
Copyright 2002, ULIB Project , IISc
Rescanning
Select the page as shown in thumbnail and delete it.
Select page
Copyright 2002, ULIB Project , IISc
Rescanning
To insert/replace pages do the following.
File Menu
Scan batch to file
Insert pages
Insert into current batch window opens.
Click OK, with color format as Binary and comperssion as CCITT Group 4.
Copyright 2002, ULIB Project , IISc
Rescanning
The Prepare Scanner window open. And you can scan the document till it is scanned
properly.
Click start scanning to continue scanning
Copyright 2002, ULIB Project , IISc
Tips during Scanning
Proper care has to be taken while placing the book.
The book should not go beyond the stopper.
Steel mirror has to be kept clean and wipe it with a soft cloth.
Before placing the books clean the books so that dust does not enter from
books to scanner bed.
Place the scanner away from direct lighting source.
Do not place any magnetic material or liquid near the scanner.
Do not try look at the light source when scanning.
Always make sure you switch off the scanner when you have completed
scanning.
Take a break for 5 minutes after every hour of scanning.
Copyright 2002, ULIB Project , IISc
Image Processing
Copyright 2002, ULIB Project , IISc
Image Processing using Java
Enhances the quality of the scanned Images by removing noise.
Reduces File size.
Tools used
Despeckle – Removes isolated black pixels.
Deskew – Detects and removes the skew
Crop – Removes the extra white spaces
Execution Procedure
Install JDK version 1.2.1 or above version
Set the path C:\crop\debug
Run the command for single Book.
java cropper1 “D:\Books\Book1”
For multiple book
java cropperm “D:\Books\Book1”
Execute the batch file Cropper.bat
After completion, execute the batch file Clearup.bat
Copyright 2002, ULIB Project , IISc
Cropping

Cropping Version 1.3 – Release August 2002
The Base Folder is selected which crops all the books.
Copyright 2002, ULIB Project , IISc
Conversion Process
Processed Image File
OCR
RTF
HTML
Copyright 2002, ULIB Project , IISc
TXT
Optical Character Recognizer
The scanned Images are in Tagged Images Format so for
further processing we need to extract the text part, hence we
use Fine Reader to get the text file for further processing.
OCR
Launching ABBYY FineReader 6.0
Software used for Character Recognition
START
PROGRAMS
ABBYY FineReader
ABBYY Fine Reader 6.0
Professional
Copyright 2002, ULIB Project , IISc
ABBYY FineReader Setting
Go to Tools > Options > Recognition and set the following options.
Under Recognition tab
Recognition Language – English(optional)
Document type – automatic
Training – Do not use user patterns
Copyright 2002, ULIB Project , IISc
ABBYY FineReader Setting
Under Formatting tab
Retain layout – Enable
Retain full page layout.
Keep pictures.
Copyright 2002, ULIB Project , IISc
ABBYY FineReader Setting
Click Formats > RTF/DOC – Enable
Default paper size - A4.
Keep page breaks.
Keep line breaks.
Retain text color.
Copyright 2002, ULIB Project , IISc
ABBYY FineReader Setting
In HTML tab – Enable
Keep line breaks.
Retain text color.
Use solid line as page break.
Format – Auto.
Copyright 2002, ULIB Project , IISc
ABBYY FineReader Setting
In Text tab – Enable
Code page – automatic.
Code page type - Windows
Keep line breaks.
Append to the end of file.
Use page break character(#12) as page separator.
Use blank line as paragraph separator.
Copyright 2002, ULIB Project , IISc
ABBYY FineReader Setting
Click the open and read button.
From the Open screen, select the “PTIFF” directory for the book you are working with.
Click on one of the files, and then press ‘ctrl-a” to select all the TIF files.
Press “open”, which will start the OCR process. After a few moments of seeming inactivity, you will
see the program reading pages.
Copyright 2002, ULIB Project , IISc
Processing
Reading the pages this will consume some time.
Reading
Copyright 2002, ULIB Project , IISc
Processing
Output of Recogniser
Image
text
Copyright 2002, ULIB Project , IISc
Saving Processed files
Saving the output in RTF, HTM and TXT
Copyright 2002, ULIB Project , IISc
Research Areas
Enhancements in Image Processing tool.
Error free character recognizer in Indian Languages.
Software to identify various fonts of different languages and to create meta files that
can be used for accessing the information.
Most of the Indian languages are not available in printed form. In such case Speech
Technologies to convert speech to text in Indian Languages and Photographic
scanner technologies may have to be developed.
Copyright 2002, ULIB Project , IISc
Trouble Shooting
Trouble Shooting
Scanning
Recognition
Copyright 2002, ULIB Project , IISc
Trouble Shooting during scanning
The following are frequently encountered problems
Error messages
Reason
Bind size is Too high
Press down book table
The size of original Lower the original so
is more than 65mm that it is within range
above book table
of book table.
Bind size is not detected
reset book or change
sheet.
A tag or bookmark
is attached to
original.
Copyright 2002, ULIB Project , IISc
Action to be taken.
Take off tags and
bookmarks.
Trouble Shooting during Recognition
If a image contains table and is not recognized as table do the following.

Select the page that has table layout.select the block, set the block type as Table,
select Analyze table and then recognize the page.
output
Copyright 2002, ULIB Project , IISc
Trouble Shooting during Recognition
You can also draw horizontal and vertical lines as shown if the block does not
have lines
Vertical line
Copyright 2002, ULIB Project , IISc
Palm Leaves Setting
Copyright 2002, ULIB Project , IISc
Scanning Palm Leaves
The setting for scanning Palm Leaves
The Brightness and contrast were varied and the best results were
obtained. The various variation are shown.
Brightness – Normal(5 in scale) and contrast – 0(in scale)
Original image
Cropped image
Copyright 2002, ULIB Project , IISc
Scanning Palm Leaves
Brightness – lite(10 in scale) and contrast – 0(in scale)
Original image
Cropped image
Copyright 2002, ULIB Project , IISc
Scanning Palm Leaves
Brightness – Normal(5 in scale) and contrast – 9(in scale)
Original image
Cropped image
Copyright 2002, ULIB Project , IISc
Scanning Palm Leaves
Brightness – Dark(1 in scale) and contrast – 9(in scale)
Original image
Cropped image
Copyright 2002, ULIB Project , IISc
Scanning Palm Leaves
Brightness – Lite(7 in scale) and contrast – 9(in scale)
Original image
Cropped image
Copyright 2002, ULIB Project , IISc
Scanning Errors
Copyright 2002, ULIB Project , IISc
Scanning Errors
Interruptions occur during scanning due to various reasons
A tag or Book mark is present in book.
Book is not properly placed, When the book is placed or projected beyond the stopper.
Steel mirror is dirty.
When Book exceeds the height of steel mirror(greater than 65mm).
The error message that occurs due to above reasons.
Copyright 2002, ULIB Project , IISc
Scanning Errors
A tag or Book mark is present in book.
Tag
Copyright 2002, ULIB Project , IISc
Scanning Errors
Book is not properly placed, When the book is placed or projected beyond the stopper.
Beyond stopper
Distance between
stopper and mirror
Stopper
Copyright 2002, ULIB Project , IISc
The Bad Scanned image.
Copyright 2002, ULIB Project , IISc
Good Scanned Image.
Copyright 2002, ULIB Project , IISc
Recognition Error due to Bad scanning
Copyright 2002, ULIB Project , IISc
Recognition is Good when the document is scanned properly
Copyright 2002, ULIB Project , IISc
For further queries / suggestions
mailto: [email protected]
Copyright 2002, ULIB Project , IISc