Slides - CVIT - IIIT Hyderabad

Download Report

Transcript Slides - CVIT - IIIT Hyderabad

Content Level Access to Digital
Library of India Pages
Praveen Krishnan, Ravi Shekhar, C.V. Jawahar
CVIT, IIIT Hyderabad
IIIT Hyderabad
Digital Library of India (DLI)
http://www.dli.iiit.ac.in/
Vision : To enhance access to information and knowledge to masses.
• Partner to Million Book Universal Digital Library Programme.
Information for people
Dataset for researchers
IIIT Hyderabad
Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi Pratha, C V Jawahar: The Digital Library of India Project:
Process, Policies and Architecture, ICDL , 2007.
Digital Library of India (DLI)
Vision : To enhance access to information and knowledge to masses.
Content
Languages
Statistics
IIIT Hyderabad
• #Books 4 Lakhs
• 41 different languages
• #Pages 134 Million
• Includes
- Hindi, Telugu, Marathi.. • #Words 26 Billion
- English, French, Greek..
Source: http://www.new1.dli.ernet.in/
Digital Library of India (DLI)
Meta data search
• Supports Meta data based search.
• No Content Level Access
Indian freedom struggle and independence Search
IIIT Hyderabad
Digital Library of India (DLI)
• Need Content Level Access
• Content + Meta Data
Indian freedom struggle and independence Search
IIIT Hyderabad
Digital Library of India (DLI)
• Need Content Level Access
• Content + Meta Data
Indian freedom struggle and independence Search
Reliable
Text
Representation
?
IIIT Hyderabad
Goal
Digital Library of India Search
• Build a search engine with support for Indian languages.
• Word Spotting
IIIT Hyderabad
Goal
Indian Language Document Search Engine
Text Query
Support
खोज
Page 1
IIIT Hyderabad
Goal
Indian Language Document Search Engine
शिवाजी और मराठा साम्राज्य
खोज
Multi Keyword
Support
Page 1
IIIT Hyderabad
Goal
Indian Language Document Search Engine
शिवाजी और मराठा साम्राज्य
खोज
Ranks based on #
Occurrences
Page 1
IIIT Hyderabad
Goal
Indian Language Document Search Engine
शिवाजी और मराठा साम्राज्य
खोज
Semantically
Related Words
Page 1
IIIT Hyderabad
Goal
Indian Language Document Search Engine
शिवाजी और मराठा साम्राज्य
खोज
Seamless scaling to billions of word images.
Sub second retrieval
Page 1
IIIT Hyderabad
Text from OCR
Hindi Page
Telugu Page
IIIT Hyderabad
- Hindi: Title - Praachin Bhaartiy Vichaar Aur Vibhutiyaan, Published in 1624
- Telugu: Title - Andhra Vagmayaramba Dasha, Published in 1960
Text from OCR
Hindi Page
IIIT Hyderabad
Cuts
Telugu Page
Text from OCR
Hindi Page
IIIT Hyderabad
Merges
Cuts
Telugu Page
Text from OCR
Hindi Page
Telugu Page
IIIT Hyderabad
Variations in Script,Cuts
Font and Typesetting.
Text from OCR
Char %
Hindi
Telugu
IIIT Hyderabad
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and
C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian
Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR
Word %
Hindi
Telugu
IIIT Hyderabad
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and
C. Bhagavati, “Experiences of Integration and Performance Testing of Multilingual OCR for Printed Indian
Scripts,” in ICDAR MOCR Workshop, 2011.
Text from OCR
Search %
Hindi
Telugu
IIIT Hyderabad
BoVW for Image Retrieval
Text Retrieval
Image Recognition
Query Image
Ranked Retrieved Results
IIIT Hyderabad
Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos.
ICCV 2003
BoVW for Image Retrieval
• Fixed Length Representation
• Invariant to popular deformation
Query Image
Ranked Retrieved Results
IIIT Hyderabad
Josef Sivic, Andrew Zisserman: Video Google: A Text Retrieval Approach to Object Matching in Videos.
ICCV 2003
BoVW for Document Image
Retrieval
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image
Retrieval
Histogram of Visual Words
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image
Retrieval
Cuts
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image
Retrieval
Cuts
Histogram of Visual Words
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image
Retrieval
Merges
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image
Retrieval
Merges
Histogram of Visual Words
IIIT Hyderabad
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
BoVW for Document Image
Retrieval
• Robust against degradation
• Lost Geometry
• Use Spatial Verification
– SIFT based.
– Longest Subsequence alignment.
y
1
0.5
Clean
0
0.5
IIIT Hyderabad
V
1
1
V
2
V
6
1.5
Cuts 2
V
4
V
4
2.5
V
8
3
V
9
x
Merge
R. Shekhar and C. V. Jawahar. Word Image Retrieval Using Bag of Visual Words. In DAS, 2012.
I. Z. Yalniz and R. Manmatha. An Efficient Framework for Searching Text in Noisy Document Images. In DAS, 2012
Query Expansion
Querying
Database
Query Image
Rank 1
Rank 2
Histogram
Rank 3
Rank 4
Rank 5
Rank 6
IIIT Hyderabad
Refined Histogram
Query Expansion
Querying
Database
Query Image
Rank 1
Rank 2
Query Histogram
Rank 3
Rank 4
Better Results
Rank 5
Rank 6
IIIT Hyderabad
Text Query Support
• Originally formulated in a “query by example”
setting.
Input Query Image
Histogram
IIIT Hyderabad
Text Query Support
• Originally formulated in a “query by example”
setting.
• Need Text Queries
Input Text Query
IIIT Hyderabad
Text Query Histogram
Observations
• Are the results of OCR and BoVW
complementary?
IIIT Hyderabad
BoVW
OCR
OCR
BoVW
Observations
mAP
• mAP v/s Word Length
IIIT Hyderabad
No. of Characters
Observations
• “OCR system has a high precision while BoVW
approach has a high recall.”
• Example: #GT = 5
OCR Out List; Precision = 1 ; Recall = 0.4
BoVW Out List; Precision = 0.8 ; Recall = 1
IIIT Hyderabad
Fusion
• Fusion Techniques:• Naïve Fusion
mAP
Chart
OCR
IIIT Hyderabad
Fusion
• Fusion Techniques:• Naïve Fusion
mAP
Chart
BoVW
IIIT Hyderabad
Fusion
• Fusion Techniques:• Naïve Fusion
Concatenating OCR Results with BoVW
OCR
BoVW
mAP
Chart
IIIT Hyderabad
Fusion
• Fusion Techniques:• Edit Distance Based Fusion
OCR
BoVW
mAP
Chart
IIIT Hyderabad
Fusion
• Fusion Techniques:• Edit Distance Based Fusion
• Reordering BoVW
• BoVW score
• Modified Edit distance cost
BoVW
mAP
Chart
IIIT Hyderabad
Fusion
• Fusion Techniques:• Edit Distance Based Fusion
• Reordering BoVW
• BoVW score
• Modified Edit distance cost
BoVW
mAP
Chart
IIIT Hyderabad
Fusion
• Fusion Techniques:• Edit Distance Based Fusion
OCR
BoVW
mAP
Chart
IIIT Hyderabad
Fusion
• Fusion Techniques:• Hybrid Fusion
OCR
BoVW
mAP
Chart
IIIT Hyderabad
Fusion
• Fusion Techniques:• Hybrid Fusion
mAP
Chart
• Re-querying BoVW using
• OCR retrieved results.
• Using rank aggregation
techniques
BoVW
IIIT Hyderabad
Fusion
• Fusion Techniques:• Hybrid Fusion
mAP
Chart
• Re-querying BoVW using
• OCR retrieved results.
• Using rank aggregation
techniques
BoVW
IIIT Hyderabad
Fusion
• Fusion Techniques:• Hybrid Fusion
OCR
BoVW
mAP
Chart
IIIT Hyderabad
Experimental Results
IIIT Hyderabad
Experimental Details
• OCR [1]
• Feature Detector
– Harris Interest point detection. [2]
• Feature Descriptor
– SIFT [2]
• Indexing
– Lucene [3]
IIIT Hyderabad
[1 ] D. Arya, T. Patnaik, S. Chaudhury, C. V. Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G. S. Lehal, and C. Bhagavati, “Experiences of
Integration and Performance Testing of Multilingual OCR for Printed Indian Scripts,”in ICDAR MOCR Workshop, 2011.
[2] http://www.vlfeat.org
[3] http://lucene.apache.org/
Test Bed
Sample Word Images
Language
#Books
#Pages
#Words
#Annotation
Hindi (HS1)
11
1000
362,593
Yes
Hindi (HS2)
52
10,196
4,290,864
No
Telugu (TS1)
11
1000
161,276
Yes
Telugu (TS2)
69
13,871
2,531,069
No
DLI Corpus
IIIT Hyderabad
• In addition, we used HP1 & TP1 fully annotated dataset
Evaluation Measures
• Precision
• Recall
•
TP = True Positive
FP = False Positive
FN = False Negative
mAP (Mean Average Precision)
Mean of the area under the precision
recall curve for all the queries.
• Precision @ 10
Shows how accurate top 10 retrieved
IIIT Hyderabad
results are.
Precision-Recall Curve
BoVW Search
Language
#Query
BoVW + Query
Expansion
mAP
Prec@10
mAP
Prec@10
Hindi (HP1)
100
62.54
81.30
66.09
83.86
Telugu (TP1)
100
71.13
78
73.08
79.89
Comparison of naïve BoVW with BoVW + Query Expansion
IIIT Hyderabad
BoVW Search
Language
#Query
BoVW using
Text Queries
mAP
Prec@10
mAP
Prec@10
Hindi (HP1)
100
62.54
81.30
56.32
73.89
Telugu (TP1)
100
71.13
78
69.06
78.83
Comparison of naïve BoVW with BoVW + Text Query Support
IIIT Hyderabad
Naïve
Language #Query
Edit Distance
Hybrid
mAP
Prec@10
mAP
Prec@10
mAP
Prec@10
Hindi
(HP1)
100
75.66
90.7
79.58
90.8
80.37
91.4
Telugu
(TP1)
100
76.02
81.2
78.01
81.4
80.23
83.7
Comparative performance of different fusion
techniques on HP1 & TP1
IIIT Hyderabad
OCR
Language #Query
BoVW
Fusion
mAP
Prec@10
mAP
Prec@10
mAP
Prec@10
Hindi
(HS1)
100
14.95
62.60
60.55
95.5
68.81
95.6
Telugu
(TS1)
100
27.03
62.10
74.38
90.6
78.41
91.9
Performance statistics on DLI Annotated Corpus
IIIT Hyderabad
Language
Hindi
(HS2)
Telugu
(TS2)
#Query
50
50
Precision @ N
OCR
BoVW
Fusion
Prec@10
82.03
96.94
97.11
Prec@20
75.16
94.83
95.42
Prec@30
71.12
92.82
93.16
Prec@10
90.85
99.14
99.14
Prec@20
85.42
98.00
98.85
Prec@30
80.76
96.38
96.57
Performance statistics on DLI Un-Annotated Corpus
IIIT Hyderabad
Retrieved Results
IIIT Hyderabad
Retrieved Results
IIIT Hyderabad
Failure Cases
• The word images shown in the figure fails in both OCR and
BoVW.
• Reason:
– (a) Word Image smaller in length and containing a character not
used these days.
IIIT Hyderabad
– (b) A highly degraded word image.
Implementation Details
• Search Engine Development
– An elegant web based search
and retrieval interface.
No of Images
Time in milliseconds
Lucene Scalability
IIIT Hyderabad
Sample Retrieved Page
No of Visual Words
Search Architecture (Ongoing)
Search Query
Ranked Results
Delegator
Partial
Scores
FUSION
Query
Expansion
Ranking
OCR
BoVW
IIIT Hyderabad
OCR
Index
Web Service
BoVW
Index
Web Service
Web Service
Ongoing Work
• Learn to improve from annotated dataset
– Use of visual confusion matrix to improve BoVW
results from annotated datasets.
• Necessity of Costly Features for Re-ranking
– The images shows in failure cases would require
costly features to show up.
– Use of machine learning algorithms.
IIIT Hyderabad
• Exploration of features better than SIFT.
Thank You
IIIT Hyderabad