Transcript Document
Document Image Analysis CSE 717 An Introduction
Document Image Analysis
DIA is the theory and practice of recovering the symbol structures of digital images scanned from paper or produced by computer DIA is a subfield of Digital Image processing
Digital images of natural objects: X-rays, fingerprints, faces, scenery, etc. are NOT part of DIA Digital images of symbolic objects: Postal addresses, printed articles, forms, music sheets, engineering drawings, topographic maps belong to DIA Source: Scanners, printers, fax machines, hand!
Incidental text: license plates, billboards, subtitles, in photos and video WWW ??
DIA’s grand goal is take us to the land of paperless office
Document Image Analysis Textual Processing Graphical Processing Optical Character Recognition Text Page Layout Analysis Line Processing Region and Symbol Processing Skew, blocks, paragraphs Lines, curves, corners Filled regions
Processing Pixels Primitives Structures Documents Corpus
Document Image Analysis
Text Preprocessing
Representation, Noise removal, binarization, skew, script id, font id
Graphics Preprocessing
Representation, Noise removal, binarization, thinning, vectorization
Glyph Recognition
Connected components, strokes, punctuations, words
Text Recognition
Word segmentation, text line reconstruction, table analysis, linguistics
Primitive Recognition
Straight lines, curve segments, junctions, nodes, loops, characters
Structure Recognition
Text fields, legends, labels, dimensions, graphics symbols
Page Layout Analysis
Text versus non-text, physical component analysis, logical component analysis, functional component analysis, compression
Interpretation
Component recognition, connectivity analysis, CAD layer separation, Database attribute extraction, Compression
Information Retrieval
Document Classification, indexing, search, security, authentication, privacy
Database, CAD
Validation, search, update
Sender’s Address Linear Code Endorsem
In Case of Undeliverable as Addressed Return to Sender
ent Delivery Address
Postal Examples
Meter Mark Digital Post Mark
Forms
Unconstrained Text
Graphics Documents
References
Handbook of Character Recognition and Document Image Analysis, H. Bunke and PSP Wang (editors), World Scientific Press Document Image Analysis, Gorman and Kasturi , IEEE Computer Society Press International Conference on Document Analysis and Recognition proceedings International Workshop on Document Analysis Systems proceedings Symposium on Document Image Understanding Technology
•OCR Features and Systems –Script ID, Devanagari OCR, Tamil OCR, MP versus HW •Handwriting Recognition –Postal applications, Arabic Documents •Classifiers and Learning –Multi-classifier systems •Layout Analysis –Skew correction, geometric methods, test/graphics separation, logical labeling •Tables and Forms –Detecting tables in HTML documents, use of graph grammars, semantics •Document Engineering –Processing of historical documents (palm leaf manuscripts).
•Camera Based DIA –Locating and reading Barcodes •New Applications -CAPTCHA