Skip to Main Content

Degradation Specific OCR

Masters Thesis, Boise State University, December 2010

Subramaniam Venkatraman

Abstract:
Optical Character Recognition (OCR) is the mechanical or electronic translation of
scanned images of handwritten, typewritten or printed text into machine-encoded
text. OCR has many applications, such as enabling a text document in a physical
form to be editable, or enabling computer searching on the computer digitally of
a text that was initially in printed form. OCR engines are widely used to digitize
text documents so that they can be digitally stored for remote access, mainly for
websites. This facilitates the availability of these invaluable resources instantly,
no matter the geographical location of the end user. Huge OCR misclassification
errors can occur when an OCR engine is used to digitize a document that is
degraded. The degradation may be due to varied reasons including aging of the
paper, incomplete printed characters on the original document and blots of ink
on the document being a few. In this thesis, the degradation clue to scanning of
text documents was considered. To improve the OCR performance it is vital to
train the classifier on a large training set that has significant data points similar
to the degraded real life characters. In this thesis characters with varying degrees
of blurring and binarization thresholds were generated and they were used to
calculate Edge Spread degradation parameters. These parameters were then used
to divide the training data set of the OCR engine into more homogeneous sets. The
resulting classification accuracy by training on these smaller sets was analyzed.

The training data set consisted of 100,000 data points of 300 DPI, 12 point Sans
Serif font lower case characters ‘c’ and ‘e’. These characters were generated with
random values of threshold and blur width with random Gaussian noise added. To
group the similar degraded characters together clustering was performed using the
Isodata clustering algorithm. The two edge spread parameters, one calculated on
isolated edges named DC, one calculated on edges in close proximity accounting for
interference effects, named MDC, were estimated that fit the cluster boundaries.
These values were then used to divide the training data and a Bayesian classifier
was used for recognition. It was verified that MDC is slightly better than DC as
a division parameter. A choice of either 2 or 3 partitions was found to be the best
choice for dataset division. An experimental way to estimate the best boundary
to divide the data set was determined and tests were conducted that verified it.
Both crisp and fuzzy approaches for classifier training and testing were implemented
and various combinations were tried with the crisp training and fuzzy
testing being the best approach giving a 98% classification rate in comparison to
94% for the classification of the data set with no divisions.

Thesis in Scholarworks