sin_Group2 – Sinhala OCR

Download Report

Transcript sin_Group2 – Sinhala OCR

SINHALA LANGUAGE OCR
Coursework for ISSALE - 2014
Project Demonstration
●
●
●
●
Kasun Perera
Chamila Liyanage
Tharaka Viswakula
Laksri Wijerathna
Sinhala Script consists of:
18 vowels
40 consonants
Sinhala Script
18 modifiers
other symbols (rakaranshaya, yansaya)
Font: Abhaya
Font Size :12
Selected characters
700
අ
708
ල්
701
ැ
709
න්
702
නි
710
ණ
703
ර
711
සි
704
ස
712
ත්
705
ත
713
යි
706
ක්
714
එ
707
කි
708
ල්
Document Image
Image document
has 16 different
character types
and 11 samples of
each character
type.
Line and Main Body segmentation
● All lines were segmented correctly
o No of Lines in input Image -9
o Program Outputs 9 line segments
o 100% accuracy
● All Main bodies were segmented correctly(No
diacritics)
o 100% accuracy
Decision Tree Recognition results
● Creation of Training(35) and Test data(15)
● Decision Tree created using Weka - using Training data
● Tested accuracy using Test data
Overall accuracy:
70 %
Bad recognition Chars
702- නි / 708- ල් / 711- සි / 712- ත්
Tesseract Recognition results
Overall accuracy:
93.181%
Complete OCR- DT Method
Overall accuracy - 28%
Complete OCR - Tesseract
Overall accuracy - 92.8%
Tesseract Output File
Conclusion
Test dataset (15)
● Tesseract Accuracy- 93%
● DT Accuracy- 70%
Document Image
● Tesseract Accuracy- 92.8%
● DT Accuracy- 28%
ස්තුතියි...!
(Thank you...!)