Recognition and Retrieval from Document Images

Transcript Recognition and Retrieval from Document Images

Recognition and Retrieval from
Document Image Collections
Million Meshesha
(Roll No.: 200299004)
Advisor: Dr. C. V. Jawahar
Centre for Visual Information Technology,
International Institute of Information Technology,
Hyderabad, India
Introduction
• Global effort to digitize and archive large collection of
multimedia data
– Most of them are printed books
• Emergence of large Digital
Libraries like UDL, DLI, etc.
– One million book archival
activities at Mega Scanning
center – IIIT-H
• Involvement of Google,
Yahoo, Microsoft in massive
digitization project

The aim of digitization is for easier preservation and make
documents freely accessible to the globe.
Needs to design efficient means of access to the content.
The Direct Approach
• Recognition-based access to documents
– Easy to integrate into a standard IR framework
– Success of text image retrieval mainly depends on the
performance of OCRs
Optical Character Recognition
Document
Images
Preprocessing
and
Segmentation
Feature
Extraction
Classification
Postprocessing
Text
Documents
Database
Search engine
Textual
Query
Cross
lingual
Retrieval
Text Documents
Challenges
• The state-of-the-art OCR engines recognize
documents printed in Latin and some Oriental
scripts
– with few errors in each page for high quality images
• Unavailability of robust OCRs for indigenous scripts
of African and Indian languages.
• Challenges in developing OCRs for scripts with
complex shape and large number of characters.
• Lack of specialized recognizers for large document
image collections.
• Diversity and quantity of documents archived in
digital libraries.
Alternate Approach: Recognition-Free
Document
Images
Preprocessing
and
Segmentation
Feature
Extraction
Clustering
and
Indexing
Database
Search engine
Textual
Query
Cross
Lingual
Rendering
Retrieval
Document
Images
Comparison of the Two Approaches
Recognition-based
Recognition-free
Needs recognition before
Retrieve without explicit
retrieval
recognition
e.g. Text search engines e.g. CBIR, CBVR
Less offline processing
(excluding recognition)
High offline processing
Fast and efficient algorithms
Slow & inefficient schemes
Compact representation
Bulky representation
Content/language
dependent
More of content/language
independent
Challenging to build
(because of recognizers)
Relatively easy to build with
certain level of acceptable
performance
Review of OCR Systems
• Conventional OCRs follow sequential steps:
Preprocessing
Document Layout Analysis
Segmentation
Feature Extraction
Bayesian
statistical
Structural
Thresholding
Features
Line
Segmentation
Text/Image
Block like
Lexical
Information
classifier
Shape,
contour etc.
identification
Normalization
Word Segmentation
Dictionary
and
Geometric
Layout
SVM
Skewclassifier
Detection/
Transformation
Domain
Punctuation
Rules
Analysis
Correction
Component
Features
likeAnalysis
DFT, DCT
Neural Network
Statistical
Information
Noise Removal
Algorithms
Global and Local Features
Classification
Post Processing
“Anatomy of a Versatile Page Reader“, H.Baird, Proc. of IEEE, Vol. 80, no.7, July,1992.
“Omnidocument Technologies”, IM. Bokser, Proc. of IEEE, Vol 80, no.7, July,1992
Review of Recognition-Free
•
Manmatha et al:
–
–
–
•
Chaudhury et al.:
–
–
•
Experimented on word level image matching.
Extracted features at the baseline, concavities, line segments, junctions, dots and stroke
directions and computed a distance metric.
Srihari et al.:
–
–
–
•
Exploited the structural characteristics of the Indian scripts to access them at word level.
Employed geometric features, and suffix trees for indexing.
Trenkle and Vogt:
–
–
•
Proposed the word spotting idea for matching word images from handwritten historical
manuscripts.
Used dynamic time warping (DTW) for word image matching.
Selected profile features for matching handwritten word images.
Spotting words from document images of Devanagari, Arabic and Latin.
Used Gradient, Structural and Concavity (GSC) features.
Implement correlation similarity measure for word spotting.
AK Jain and Anoop M. Namboodiri:
–
–
Employed DTW based word-spotting for Indexing and retrieval of on-line documents.
Extract features such as the height of the sample point, direction and curvature of strokes.
Santanu
Chaudhury,
GeetikaC.
Sethi,
Anand
and Gaurav
Harit,
"Devising
Access
Techniques
for Vivek:
Indian Indian
Language
S. N.
Srihari,
H. Srinivasan,
Huang
and Vyas
S. Shetty,
"Spotting
Words
in
Latin,Interactive
Devanagari
and Arabic
Scripts,"
A.K.
Jain
and
Anoop
M.
Namboodiri,
"Indexing
and
Retrieval
of
On-line
Handwritten
Documents",
Proc.
of
the
Seventh
J.
M.
Trenkle
and
R.
C.
Vogt,
"Word
Recognition
for
Information
Retrieval
in
the
Image
Domain",
Symposium
on
Document
Document
Images",Intelligence
Proc. of the Seventh International Conference on Document Analysis and Recognition (ICDAR), 2003, Pp.
Journal
of Artificial
T.
Rath and
R. Manmatha,on"Word
ImageAnalysis
Matching Using
Dynamic(ICDAR),
Time Warping",
Proceedings
International
Document
Recognition
2003, pp.
655-659 of the Conference on Computer
Analysis
Information Retrieval,
pp. 105-122,and
1993.
885-889andConference
Vision and Pattern Recognition (CVPR), 2, pp. 521--527, 2003.
Major Contributions
1.Study indigenous African scripts for document understanding
• First attempt to introduce the challenges toward the recognition and
retrieval of indigenous African scripts.
2.Design an OCR for recognizing Amharic printed documents
• test on real-life document images (books, magazines and newspapers).
3.Propose an architecture of self adaptable book recognizer
• demonstrate its application on document images of book.
4.Propose an efficient matching and feature extraction schemes
• Performance analysis on datasets of word-form variants, degradations
and printing variations in word images.
5.Construct an indexing scheme by applying IR principles for
efficient searching in document images.
– experiment its efficiency on document images of book and newspapers.
African Scripts
• Africa is the 2nd largest continent in the world, next to Asia.
• There are around 2500 languages spoken in Africa, which
are either:
– Installed by conquerors of the past and use a modification
of the Latin and Arabic scripts.
– Indigenous languages with their own scripts.
E.g. Amharic (Ethiopia), Vai (West Africa), Bassa
(Liberia), Mende (Sierra Leone), etc.
Most are not used as
• official
Document
image analysis and understanding research is
languages
very limited for indigenous African scripts.
Few attempts
are available for Amharic scripts.
Their– existence
is not known
–byOther
indigenous scripts are not yet studied
most researchers
Characters are complex
in shape
Bassa
Vai script
Mende script
Amharic Language/Script
• Large number of characters
– More than 300 characters
• Vowel formation
• Existence of visually similar
characters
• Frequently occurring
characters
• Amharic word morphology
– Have rich word morphology
• Amharic (like Hindi) is verbfinal language, modifiers
usually precede the nouns
they modify.
– the word order in English
sentences: Subject-VerbObject
– the word order in Amharic
and Hindi is Subject-ObjectVerb
Recognition from “A” Document Image
Amharic OCR is developed on top of an OCR for
Indian Languages.
Preprocessing
Segmentation
•••Feature
extraction
– Consider
the entire component
–
Binarization:
– Lineassegmentation
image
a feature.
• Convert gray pixels into binary.
– PCA• Identify lines in a text.
• Used for dimensionality
– Word
segmentation
reduction.
–
Skew
detection
and
correction:
• Reduces
to
character/
•
Identify
words
in
a
text
line.
connected components sub• Ensure that the page is aligned properly
space.
– Character segmentation
– LDA
• Extracts
optimal
• Detect
eachdiscriminant
character from
vector
and
reduces
to
– Noise
removal
segmented
word.
classification sub-space
• Remove artifacts in the image
• Classification
1,4
Consider characters
1,3
2,4
and
1,2
2,3
3,4
– DDAG based architecture for
multi-class SVMs.
D. H. Foley and J. W. Sammon. An optimal set of 1
discriminant
vectors.3IEEE 4
2
VectorPavan
Machines
C. –
V. Support
Jawahar,
MNSSK
Kumar, SS1975.
Ravi Kiran: A Bilingual OCR for Hindi-Telugu
Trans.
on Computing,
24:271-278,
Documents
Applications.
ICDAR 2003: 408-412
(SVMs)and
at its
each
node.
Experimental Results
LaserJet
Printouts
Document
Fonts
Accuracy (%)
96.51
(PowerGeez, Visual Geez, Agafari, Alpas)
Sizes
98.49
(10, 12, 14, 16)
Styles
95.65
(Normal, Bold, Italic)
Real-life
Books
Newspapers
Magazines
Blob
Cut
Merge
91.45
88.23
90.37
Comments
• Present day OCRs do not improve
the performance over time.
– Performance on the first and last
pages of the book are statistically
identical.
• OCRs are designed to convert a
single document image into a textual
representation.
• Omni-font OCRs are rare even for
English.
– Performance degrades with quality,
unseen fonts, etc.
OCR for a collection (e.g. book) has to be different
from OCR designed for an isolated single page.
Can we design a recognizer for document image
collections; say, Book recognizer ?
Our Strategy
• Enable OCR learn from its experience through feedback at
normal operation that comes from postprocessor.
– The conventional open-loop system of classifier followed by postprocessor is closed.
• Learns from both correctly classified and misclassified
examples.
• Extends knowledge gained from one page to other pages
– Iterates and perfects on a page (a set of pages).
• Improves its performance over time to varying document
image collections in fonts, sizes and styles, Quality
Apply machine learning procedures to build an
intelligent OCR
Conventional OCRs
Comparison
• Designed for a single page
• No feedback; top-down serial process
• Failures are costly: any error at intermediate level results in
wrong output of system
• Offline training
• Performance declines or static
Our new approach, Book recognizer
• Designed for multiple pages
• Feedback based flexible design
• Any error at an intermediate level can be corrected by using
proper feedback.
• More of online learning
• Performance improves overtime
Self adaptable OCR Design
lntormatlon
iold
lnformation
told
Document
Images
Recognizer
Model
Base
Model
• Incremental
learning
Recognized
Texts
Classifier
Post Processor
lnformation
idol
Selected
Samples
• Pass new samples
for training
Filtered
samples
Samples
…
Rejected
Samples
Validator
Sampler
Sample
Database
Refined
Samples
Labeler
Labeled
samples
dol
lnformation
•Label unlabelled data
•Add samples to their proper class
• Produces errorcorrected words.
• Such words are
candidate for
feedback
i
• Detection of outliers
• Validation in image
space
Learning online
• Experiment on poor quality book
More
More
2nd
Initial
Final
iteration
iteration
iteration
accuracy
accuracy
accuracy
accuracy
accuracy
==65.24%
95.26%
===
88.24%
94.82%
91.08%
• Initial accuracy was less than 70%
– a very low accuracy was obtained
• Within few iterations of learning, the recognition accuracy
improved near to 96%.
Results on font and style variations
Further Issue
• OCR is a long-term solution.
– Needs some time to come up with a workable
system.
• But our problem is immediate.
– A number of documents are already archived and
ready for use.
Can we access the content of document images
without explicit recognition?
Word Spotting
Collection
Professor
Alexander
Smith
until
recently
head
chemistry
Columbia
University
American
Chemical
Society
died
native
Query
University
University
University
University
University
University
University
University
University
University
University
University
University
University
Matching Score
10.38
14.44
12.21
9.32
16.43
17.34
14.56
15.10
0.51
18.71
14.32
12.13
19.11
18.10
Word Search by Word Spotting
Query
Christian
Render
Feature
Extraction
Matching
Efficient Matching Scheme
• Matching techniques:
–Cross Correlation
–Dynamic Time Warping
(DTW)
• Aligns and finds the best
match between pairs of Recall
word images with different
DTW
89.58%
size.
• Trace backcross
to identify
the76.43%
correlation
optimal warping path
(OWP)
Precision
F-score
90.81%
90.19%
78.83%
77.61%
Performance analysis shows that DTW outperforms
Cross correlation
Challenges in Word Image Search
• Degradation of documents
– Cuts, blobs, salt and pepper, erosion of border pixels, etc.
• Print variations
– A word image may vary in size, style, font and quality.
• Morphological variation
– A word may have different variants.
“Stemming” of Word Images
• Two possible variants of a word:
(i) formed by adding prefix and/or suffix to the root word),
e.g. 'connect‘
‘connects', ‘connecting',
'reconnect‘…
(ii) synonymous words. E.g. ‘connect‘
‘join', ‘attach‘ …
• It is observed that most of the word form variations
takes place either at the beginning or at the end.
• Needs matching algorithm which can “penalize”
mismatches in the beginning or at the end.
Propose a novel DTW-based partial matching
scheme
DTW-based Morphological Matcher
Partition OWP (with length L) into beginning, middle
and end regions of length k (L/3) each
for i = 1 to k do
if there is matching cost concentration at the
beginning
reduce extra cost from the total matching score
else break.
end for
for i = L down to 2k do
if there is matching cost concentration in the end
reduce extra cost from the total matching score
else break
end for
Normalize the matching score by the length of the
optimal warping path.
Performance of partial matching
Before
After
Item
Recall Precision F - s c o r e R e c a l l Precision F - s c o r e
Font
83.35
91.83
86.95
95.90
98.20
97.03
Size
87.38
91.39
89.30
96.80
99.42
98.09
Style
75.62
80.25
77.84
88.94
94.73
91.69
Degrad 85.82
ation
88.49
87.04
91.74
96.26
93.92
Degraded Words
Complex script
Salt and Pepper
Blobs
Cuts
Historic documents
Degradation Modeling
•
•
•
•
Cuts and breaks
Blobs
Salt and pepper
Erosion of boundary
pixels
We built datasets using our degradation models for
English, Hindi and Amharic.
Invariant Feature Selection
• Investigate various features:
– Profiles (upper, lower, projection, transition)
– Statistical moments (mean, standard deviation, skew)
– Region-based moments (zero-order moment, first-order moment,
central moment)
– Transform Fourier representations
• Global vs. Local Features
– Global features: compute a single value.
– Local features: compute 1D representation following vertical strips of a
word.
– Local features perform better than global features
Recall
Precisionof profiles,
F-score
• For better performance combine
local features
moments and
transform
representations
Global
featuresdomain
53.32%
50.24%
51.73%
Local features
82.92%
80.53%
81.71%
Invariant Feature Selection
• To test the performance of combined features the
DTW matching algorithm is modified
• Combined local features of profiles and moments
are invariant to degradations and printing
variations.
Test result on degraded word images
Degrad
ation
Hindi
Amharic
Recall Precis FRecall
ion
score
Precisi
on
English
F-score Recall Precisi
on
Fscore
Cuts
92.34
92.41
92.37
93.72
94.93
94.32
93.76
88.15
90.87
Salts &
pepper
93.28
93.17
93.20
96.88
97.11
96.99
96.56
96.02
96.29
Blobs
85.95
92.33
89.03
89.46
93.48
91.43
89.79
86.43
88.08
Erosion
92.77
92.58
92.67
94.91
95.72
95.31
92.38
93.29
92.83
Information Retrieval from
“Document Images”
• Users expect more than just searching for
documents that contain their query word.
– Expectation for the popularity of text search.
• Retrieve relevant documents in ranked order.
• Remove effects of stopwords in the retrieval
process.
• Fast search and efficient delivery of documents.
• How can we meet users requirements?
Construct an indexing scheme to organize word
images following IR principles.
Mapping IR techniques for
Document Image Retrieval
Algorithm(s) Used
Modules
Purpose
Text search engine
Current work
Stemming
words
Group word
variant
Language modeling Morphological
e.g. Porter algorithm matching using DTW
Stopword
detection
Remove
common
words
Stop word list
Inverse document
Frequency (IDF)
Relevance
Rank
measurement documents
Term frequency
(TF)
Modified TF/IDF
Clustering
---
Improved hierarchical
clustering
Inverted index and
signature file
Inverted index
Group index
terms
Indexing data Organize
structure
index lists
Indexing Document Images
IR Measures and Clustering
Word
Images
Stemming
Stopword
Detection
Relevance
Measure
Template
(Keywords)
Index
terms
Index
list
Inverted
Indexing
Clustered English Words
Clustered
words vary in:
Fonts
Sizes
Styles
Forms
Quality
Clustered Amharic Words
Test results on datasets of the
various fonts, sizes and styles
Hindi
Type
Amharic
Recall
Precision
F-score
Recall
Precision
F-score
Fonts
91.28
92.85
91.88
93.05
93.87
93.45
Styles
83.29
84.01
83.64
89.09
89.80
89.44
Sizes
94.59
96.94
95.74
95.99
96.34
96.26
Normal
Bold
Italic
10
12
14
16
PowerGeez
VisualGeez
Agafari
Alpas
Performance: Precision vs. Recall
graph
• The graph shows effectiveness of our scheme
• it increases both precision and recall by moving the entire curve up and out
to the right.
Concluding Remarks
• African scripts
– Introduce for the first time indigenous African scripts
– Initial attempt to recognize Amharic documents with good results to extend it to
other indigenous African scripts.
– Needs engineering effort to make it applicable for real-life situations
• Recognizer design
– New attempt to propose self-adaptable recognizer for document image collections
with the help of machine learning algorithms
– Encouraging results for developing recognizer for large document image
collections
– Further work is needed for extending the framework to many of the complex Indian
and African scripts
• Document image indexing and Retrieval
– Propose DTW-based partial matching scheme to perform morphological matching
– Design invariant feature extraction scheme to degradation and printing variations
– Apply IR principles, and construct clustering and indexing scheme.
– Needs solving system related issues for practical online retrieval from large corpus
Million
Meshesha
and
C.
V.
Jawahar,
“Matching
Word Images
for Content-based
“Optical
Character
Recognition
Amharic
Million
Meshesha
and
C.
V.
Jawahar,
“Self-Adaptable
Recognizer
for of
Document
Million
Meshesha
and
C.
V.
Jawahar,
Indexing
Word
Images
for
Recognition-free
Million
Meshesha
and
C. V. of
Jawahar,
``Indigenous
Scripts
of African
Languages",
Retrieval
fromAfrican
Printed
Document
Images",
Journal
ofTechnology",
Document
Analysis
Documents”,
ofInt.
Information
and
Communication
Vol. 3,
Image
Collections",
InJournal
Proc.
Conf.
onInternational
Pattern
Recognition
and
Machine
Retrieval
from
Printed
Document
Databases,
Information
Sciences:
An
International
African
Journal
of
Indigenous
Knowledge
Systems,
Vol.
6,
No
2,
pp.
132
142,
2007.
and
Recognition
(IJDAR)
(in press).
No.
2,
pp.
53
66,
June
2007.
Intelligence
(LNCS),
2007.
Journal (revised & submitted).
Scope for Future Work
• Develop an online system for searching hundreds
of books over the Web
• Recognition and retrieval of complex documents
(such as camera-based, handwritten, etc.).
• Apply advanced image preprocessing techniques
to enhance image quality for large collection of
document images.
• Retrieval of documents in presence of OCR errors
and scope for hybrid approaches.
Publications: Conference Papers
• Million Meshesha and C. V. Jawahar, “Self-Adaptable Recognizer for
Document Image Collections", In Proc. of Int. Conf. on Pattern Recognition
and Machine Intelligence (LNCS), 2007.
• A. Balasubramanian, Million Meshesha, C. V. Jawahar, “Retrieval from
Document Image Collections", In Proceedings of 7th IAPR Workshop on
Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872),
2006, pp 1-12.
• Sachin Rawat, K. S. Sesh Kumar, Million Meshesha, Indiraneel Deb
Sikdar, A. Balasubramanian and C. V. Jawahar, “Semi-automatic Adaptive
OCR for Digital Libraries", In Proceedings of 7th IAPR Workshop on
Document Analysis Systems (DAS), Nelson, New Zealand, (LNCS 3872),
2006, pp 13-24.
• K. Pramod Sankar, Million Meshesha, C. V. Jawahar, “Annotation of
Images and Videos based on Textual Content without OCR", In Workshop
on Computation Intensive Methods for Computer Vision, Part of 9th
European Conference on Computer Vision (ECCV), Austria, 2006.
• Million Meshesha and C. V. Jawahar, “Recognition of Printed Amharic
Documents", In Proceedings of 8th International Conference of Document
Analysis and Recognition (ICDAR), Seoul, Korea, Sep 2005, Volume 1, pp
784-788
• C. V. Jawahar, Million Meshesha, A. Balasubramanian, “Searching in
Document Images", In Proceedings of Indian Conference on Computer
Vision, Graphics and Image Processing (ICVGIP), 2004, pp. 622-627.
Publications: Journal Articles
• Million Meshesha and C. V. Jawahar, “Matching Word Images for
Content-based Retrieval from Printed Document Images", International
Journal of Document Analysis and Recognition (IJDAR) (in press).
• C. V. Jawahar, A. Balasubrahmanian, Million Meshesha and Anoop
Namboodiri, “Retrieval of Online Handwriting by Synthesis and Matching",
Pattern Recognition (in press).
• Million Meshesha and C. V. Jawahar, “Optical Character Recognition of
Amharic Documents”, African Journal of Information and Communication
Technology", Vol. 3, No. 2, pp. 53 - 66, June 2007.
• Million Meshesha and C. V. Jawahar, ``Indigenous Scripts of African
Languages", African Journal of Indigenous Knowledge Systems, Vol. 6,
No 2, pp. 132 - 142, 2007.
• Million Meshesha and C. V. Jawahar, Indexing Word Images for
Recognition-free Retrieval from Printed Document Databases,
Information Sciences: An International Journal (revised & submitted).
Thank you

Recognition and Retrieval from Document Images

Transcript Recognition and Retrieval from Document Images

Directory