UN Workshop on Data Capture, Minsk Session 15 Data Capture Process with Optical Character Recognition Image Character Recognition Intelligent Recognition Christoph Steinl Vice Director International Enterprise Content Management ©
Download
Report
Transcript UN Workshop on Data Capture, Minsk Session 15 Data Capture Process with Optical Character Recognition Image Character Recognition Intelligent Recognition Christoph Steinl Vice Director International Enterprise Content Management ©
UN Workshop on Data Capture, Minsk
Session 15
Data Capture Process with
Optical Character Recognition
Image Character Recognition
Intelligent Recognition
Christoph Steinl
Vice Director International
Enterprise Content Management
© Beta Systems Software AG 2008
1
Agenda
OCR
Optical Character
Recognition
ICR
Image Character
Recognition
DFR
Dynamic Form Recognition
11/7/2015
© Beta Systems Software AG 2008
2
OCR = optical character recognition
Technology
was first invented in 1929
Gustav
Tauschek obtained
a patent on OCR in Germany
Mechanical
device that used templates
First
commercial system was installed at
Readers Digest in 1955
Years
later donated to the Smithsonian Institution
Today
Recognition
of machine written text is
now considered largely a solved problem
Accuracy
11/7/2015
rates exceed 99%
© Beta Systems Software AG 2008
3
OCR
Beta
Systems well experienced with this recognition engines in Banks
in Germany OCR A
⑁
⑀
⑂
Chair
Hook
Fork
Austria
+
11/7/2015
OCR B
Plus
© Beta Systems Software AG 2008
4
ICR Image Character Recognition
The
technique is far ahead of OCR
because of ongoing development of ICR
Handwriting
recognition system
Allows
different styles of handwriting
to be learned by a computer
during / before processing
to improve accuracy
and recognition rates
11/7/2015
© Beta Systems Software AG 2008
5
ICR Process:
Capturing
Processing
the image with Scanners
by (ICR) and/or (OCR)
Segmentation
is a very important step
Decision
if the homogenous criteria belong
to the foreground or to the background
Human
editors can do that depending on the context
Comparable
to computer tomography:
according to different results from radio waves reflected
from different angels the computer can reconstruct the picture
With
the first step only a suitable starting point
(sets of pixels) is possible
The
increasing process links all closer pixels (computation of
valleys and peaks with high degree of confidence)
11/7/2015
© Beta Systems Software AG 2008
6
ICR Process:
Pre-processing
Deskew
Shift,
rotate
Stretch
11/7/2015
© Beta Systems Software AG 2008
7
Recognition – Image Pre-processing
Skewed
document ...
…after alignment
11/7/2015
© Beta Systems Software AG 2008
88 1
ICR Process:
Enhance
Less
/ More Contrast
Clean
up
(de-noise,
halftone removal)
to
enable the recognition engine
to give best results
11/7/2015
© Beta Systems Software AG 2008
9
Recognition: Noise and box removal
11/7/2015
© Beta Systems Software AG 2008
10
ICR Process:
Classification
A
one was written
90
% =1
8
%
=7
2%
11/7/2015
© Beta Systems Software AG 2008
=4
12
ICR Algorithm:
Neural
Using
Network
kNN
k-Nearest Neighbour
SVM
Support Vector Machine
Minimize simultaneously the empirical classification error
and maximize the geometric margin;
hence they are also known as maximum margin classifiers
11/7/2015
© Beta Systems Software AG 2008
13
ICR Process:
After
different classification alternatives
the appropriate confidence will be provided
Recognition
Limitation only for most probable characters
e.g. if only characters 3,6,0 are possible
the engine can also be limited to this set
and the results are much better
Voting
Machine
Usability:
security,
efficiency
and
Accuracy
11/7/2015
© Beta Systems Software AG 2008
14
Dynamic Field Recognition
No
If
fixed position is required
form is only ½ available still ½ readable
No
special Forms are required
No
timing tracks are necessary on the forms
for OMR but results are also available
the same time
no cleaning of LEDs in the scanner necessary
Robust
against vertical / horizontal stretching,
shrinking and displacement
(e.g. Variation in printing)
11/7/2015
© Beta Systems Software AG 2008
15
Dynamic Field Recognition
Recognizes:
features
(word as pixel cloud)
boxes,
lines
and
symbols
11/7/2015
© Beta Systems Software AG 2008
16
11/7/2015
© Beta Systems Software AG 2008
17
Hardware- / Software - Requirement
Hardware
Scanner
PC
Network
Disc
Storage necessary for re-processing and
if images are needed for audit purposes
Software
Scan
Software
One
Recognition and Voting Software
for OMR, OCR, ICR, Barcode
11/7/2015
© Beta Systems Software AG 2008
18
OMR
Cost Comparatives in general
OMR/ICR from image
Forms Design
Same
Forms Production
-
Up to 50% More
Enumerator
Training
-
Up to double the cost
Scanners
-
Up to double the cost
PC
Low cost PC
PC Operators
Same
Servers
Same
Cost of more/new
flexibility
11/7/2015
OMR/ICR from
dedicated OMR Scanner
© Beta Systems Software AG 2008
low
high
19
ICR Advantages
Better
than:
Manual
keying
90
% (plus) correct keys
Manual = higher substitution rate
than automated recognition
Time
consuming
Deliberate
OMR,
manipulation possible
because OMR is space consuming
OCR,
because OCR is machine written
and therefore of limited use
11/7/2015
© Beta Systems Software AG 2008
20
ICR Advantages
Clear
accuracy for OMR
because of dirt removal by software
depending on the mark size and figure
Can
detect line
Clear
11/7/2015
and can ignore dirt
result
© Beta Systems Software AG 2008
21
ICR Advantages
Barcode,
OCR
OMR,
and
ICR
Recognition with one Software
11/7/2015
© Beta Systems Software AG 2008
22
ICR Advantages
Pro:
Only
rejected characters/fields need correction
Rest of the form untouched
With
new technologies open for future
faster, better quality
With
standardized correction mode
Handwriting
of the corresponding country will be recognized
The
previously mentioned advantages
do not have to be repeated here again
11/7/2015
© Beta Systems Software AG 2008
23
ICR Advantages : Capture Process
SORM
Scan Once Read Multiple
Images are Scanned once and stored for re-processing. (disk space is cheap)
In several serial sessions parts of the
data is collected from the Image (important fields first).
Example:
SORM Session 1: Fields Age, Sex and Nationality -> provisional partial results
SORM Session 2: All other numeric fields
SORM Session 3: Alphanumeric fields that need more manual coding (Occupation ->
Occupation Code)
Each Session Updates the Data files / Database until all data is captured.
Faster preliminary results. Less political stress.
Faster data for PES planning
Analysis of Session 1 results is possible in parallel
to Recognition, Coding and Editing of Session 2
Data lifting on different batching levels is possible. (EA, settlement)
11/7/2015
© Beta Systems Software AG 2008
24
Process Stages
of Census Surveys
Christoph J. Steinl, Vice Director int. ECM
December 2008, Minsk
© Beta Systems Software AG 2008
25
Capture Process
Store
In (EA Batch Header Creation – EA Paper store Database)
Scanning
Recognition
Verifying
The
Processes
solution
Data
capture
Census
Process internal
Census
data flow
Quality
assurance
11/7/2015
© Beta Systems Software AG 2008
26
Scanning
11/7/2015
© Beta Systems Software AG 2008
Kleindienst SC80HC
27
Scanning
Simultaneous creation of up to six images
Optical lens > 10 mm, sharpness-depth-area 3 mm
Optical und ultrasonic double feed control
Energy saving / live cycle extending Mode
Consistent jam handling:
no document is lost or double captured due to physical jams
cleanness check program: detects white - and black dirt spots
11/7/2015
© Beta Systems Software AG 2008
28
Scanning
Pockets 2 – 12 why:
if the document is scanned skewed or de-skewed
if the very important questions are filled / are readable
(if OMR, OCR, Barcode)
if there are fingerprints on the questionnaire or not
if the Barcode/OCR/OMR numbers (not ICR) numbers
are in the given range
if there are double entries – we check the unique number
if there are (colour) copies used
if there are mismatches in quantity:
Batch header shows 50 and only 40 are scanned
Transport stop can be programmed to clarify the issue
11/7/2015
© Beta Systems Software AG 2008
29
Scanning
Customer:
I have a printer and print since long my own questionnaire
…I learnt from the internet that it is just a matter of software …
Before printing we should be consulted
to give best advice, we will test and optimize.
Single side printing or higher scaled paper is necessary
(shine through factor = opacity)
Paper should be white without any spots inside
Discuss different methods before making big investments
11/7/2015
© Beta Systems Software AG 2008
30
Census Process internal
Form Type
Analysis
Structure
Analysis
ICR voting
Batch Job
Processing
ICR 1
ICR 2
Editing / Coding
Output
Assembly
11/7/2015
© Beta Systems Software AG 2008
Logical
Result
Analysis
ICR Result
Analysis
32
The Path to Recognition
Analyze
11/7/2015
the structure of documents for identification
© Beta Systems Software AG 2008
33
The Path to Recognition
Perform
proper clean-up and image
pre-processing
Analyze
individual page layout
Dynamically
Character
locate fields of interest
recognition :numeric handwriting
voting of two ICR Engines and also with OMR.
Compile
11/7/2015
results
© Beta Systems Software AG 2008
34
Verifying Processes
Unique
Number – double Scan check
Double
feed check
Check
Trace
if Copy
of all editing work
Logical
checks
Completeness
checks
Reports
11/7/2015
© Beta Systems Software AG 2008
35
The Solution: SC80HC + FC Census
FC Census
recognition
DevInfo
Data +
Images
Work
Data
Storage & DB
Preparation:
cut & jogg
CSPro
Batch
Header
Archive
Paper
Archive
Data
Storage & DB
Redatam
Form
x
TAPE
Editing
11/7/2015
© Beta Systems Software AG 2008
Local reports
36
Data capture
Data
Processing Centres in different locations
Peak
period 3 shifts, average 1-2 shifts
Local
operators trained by our supervisors
Supervisors
Central
support from Lab
Training
Help
11/7/2015
local
& documentation realised in advance
to design the documents
© Beta Systems Software AG 2008
37
Thank you for your attention
11/7/2015
© Beta Systems Software AG 2008
38