Masters Thesis, Boise State University, December 2004
By Hok Sum (Johnny) Yam
It is desirable to convert paper text documents to a computer readable and searchable form. For current technology, the combination of a scanner and Optical Character Recognition (OCR) software brings us closer to this goal. However, the degradations introduced to digital images during the scanning process significantly reduce the accuracy of OCR. Because of this imperfection, a degradation model has been developed to predict how a document image will look after being subjected to the appropriate scanning process. Methods exist to calibrate the model parameters using specialized charts such as large wedges. With a calibrated model, controlled experiments can be conducted to generate a large set of synthetic characters. These characters can then be used as the data in the training set for the OCR software, which should increase the accuracy rate of the classification process could increase.
Character corners serve as a potential source to calibrate the degradation model. In this thesis, the acute corners in the sans-serif font text images kvwxyzAKMVWXYZ are used to characterize the degradation model. These characters are more readily available than large wedges. If characters’ corners can provide a high confidence level at estimating the parameters of the degradation model, they may allow us to calibrate a scanner without specialized calibration images.
Many aspects of character corners are examined. The disadvantage of using character corners is that they are much smaller than large wedges. Misusing the information of the character corners increases the probability of providing a poor estimate. Synthetically generated large wedges are first investigated and used to estimate the degradation model to determine the constraints of the wedges. These results provide guidelines as to how character corners can be used robustly. They also allow us to judge the accuracy level of the estimation results from character corners. Large wedges theoretically estimate the parameters of the degradation model the best because their signal-to-noise ratio is high.
The quantity of character corners is very limited on a typical page. Experiments using different resolutions of a limited quantity of synthetically generated characters to estimate the degradation model are conducted. The estimation results from high-resolution (1200dpi) characters are better than from low-resolution (600dpi) characters. The high-resolution characters provide estimation results that are comparable to the large wedges.
Character images on paper are scanned and used to estimate the degradation model. Large quantities of corners from characters are used to investigate how their contribution affects the mean and the standard deviation of the parameter estimators. The relationship between the angles of the corners used in estimation and the estimation results is studied. The number of typical pages of character images required to offer a reasonable result is also examined. Experimentation shows that using selected angles from characters improves the estimation results. Some characters contribute to estimators more than the others.