Transcript ppt
Handwritten Character
Recognition using
Hidden Markov Models
Quantifying the marginal benefit
of exploiting correlations between
adjacent characters and words
Optical Character Recognition
Rich field of research with many applicable domains
Off-line vs. On-line (includes time-sequence info)
Handwritten vs. Typed
Cursive vs. Hand-printed
Cooperative vs. Random Writers
Language-specific differences of grammar and dictionary size
We focus on off-line mixed-modal English data set with mostly
handwritten and some cursive data
Observation is monochrome bitmap representation of each letter
with segmentation problem already solved for us (but poorly)
Pre-processing of dataset for noise filtering and normalizations of
scale also assumed done
Common Approaches to OCR
Statistical Grammar Rules and Dictionaries
Feature Extraction of observations
Global features: Moments and invariants of image (e.g.,
percentage of pixels in certain region, measuring curvature)
Local features: Group windows around image pixels
Hidden Markov Models
Used mostly in cursive domain for easy training and to
avoid segmentation issues
Most HMMs use very large models with words as states,
combined with above approaches, which is more applicable
to domains of small dictionary size with other restrictions
Visualizing the Dataset
Data Collected from 159
subjects with varying styles,
printed and cursive
Missing first letter of each
word to simplify capital
letters
Each character represented
by 16x8 array of bits
Character meta-data
includes correct labels and
end-of-word boundaries
Pre-processed into 10
cross-validation folds
Our Approach: HMMs
Primary Goal: Quantify the impact of correlations between
adjacent letters and words
Secondary Goal: Learn an accurate classifier for our data set
Our Approach: Use a HMM and compare to other algorithms
26 states of HMM each represent letter of alphabet
Supervised learning of model with labeled data
Prior probabilities and transition matrix learned by frequency of
letters in training
Learning algorithm for emission probabilities uses Naive Bayes
assumption (i.e., pixels conditionally independent given the letter)
Viterbi algorithm predicts most probable sequence of states given
the observed character pixel maps
Algorithms and Optimizations
Learning algorithms implemented and tested:
Baseline Algorithm: Naïve Bayes Classifier (no HMM)
Algorithm 2: NB with maximum probable classification over a set
of shifted observations
Algorithm 3: HMM with NB assumption
Fix for incomplete data: Examples ‘hallucinated’ prior to training
Algorithm 4: Optimized HMM with NB assumption
Motivation was to compensate for correlations between adjacent
pixels not included in Naïve Bayes assumption
Ignore effects of inter-word transitions when learning HMM
Algorithm 5: Dictionary Creation and Lookup with NB assumption
(no HMM)
Geared toward specific data set with small dictionary size, but less
generalizable to more constrained data sets with larger dictionaries
Alternative Algorithms and
Experimental Setup
Other variants considered but not implemented:
Joint Bayes parameter estimation (too many probabilities
to learn, 2^128 vs. 3,328)
HMM with 2nd-order Markov assumption (exponential in
number of Viterbi paths)
Training Naïve Bayes over a set of shifted and overlayed
observations (preprocessing to create thicker boundary)
All experiments run with 10-fold cross-validation
Results given as averages with standard deviations
Experimental Results
100%
95%
90%
Test Accuracy
85%
80%
75%
70%
65%
60%
55%
50%
Naïve Bayes
NB w/ Shifted
Observations
Standard HMM
Classification Algorithms
Optimized HMM
Dictionary
Creation
Conclusions
Naïve Bayes classifier did pretty good on its own (62.7%
accuracy - 15x better than random classifier!)
Classification on shifted data did worse since we lost data on
edges!
Small dictionary size of dataset affected results:
Optimized HMM w/ NB achieves 71% accuracy
Optimizations only marginally significant because of dataset
More simple and flexible approach for achieving impressive results on
other datasets
Dictionary approach is almost perfect with 99.3% accuracy!
Demonstrates additional benefit of exploiting domain constraints,
grammatical or syntactic rules
Not always feasible: dictionary may be unknown, too large, or the
data may not be predictable