Current BSU Projects
- Melville Marginalia Online Research
- Document Imaging Defect Analysis
- Image Binarization Ground Truth analysis
- Line drawing restoration
- Open Source OCR Tools
- Style Effects on OCR Errors (Barney Smith & Andersen)
- Improvements to adaptive thresholding algorithms to binarize poorly illuminated documents (Barney Smith & Andersen)
- Using Neural Networks for Automatic Document Segmentation (Andersen)
Document Image Analysis aims to develop algorithms and processes through which machines (computers) can automatically read and develop some basic understanding of documents. Documents include
- Machine printed documents – such as memos, letter technical reports, books.
- Hand written documents – personal letters, addresses on postal mail, notes in the margins of documents.
- On-line handwritten documents – writing on PDAs or tablet PCs.
- Video documents – annotating videos based on text in the video clips.
- Music scores – turning sheet music into MIDI or other electronic music formats.
Congress’s Joint Intelligence Committee has required the FBI to review all files collected since January 1993 to see what information they have on counter intelligence and see what is being shared. They estimate this is between 30 and 90 million documents that will need to be reviewed, the majority of which are in paper form and will need to be converted to computer readable and searchable form. The growth of the World Wide Web has made it easier to make information publicly available, but to make that information useful it must be in computer readable form so it can be searched and the items of interest retrieved. Documents are converted to computer readable form through the process of Document Image Analysis (DIA) which encompasses the process of Optical Character Recognition (OCR).An automated OCR system can reduce the time needed to convert a document to computer readable form to 25% of the time a human needs to hand enter the same data. Although much effort has been dedicated to developing methods of automatically converting paper documents into electronic form, many documents that are easy for humans to read still have only 92% recognition accuracy. This is too high to remove the human from the process, increasing the time and cost of document conversion.
Low accuracy rates are most common in documents with image degradations caused by printing, scanning, photocopying and/or FAXing documents. These four operations all share the processes of spatial and intensity quantization. These are the primary sources that change the appearance of bilevel images such as characters and line drawings. To date the most common method of overcoming these degradations is to provide the classifier with enough variety of samples that the classifier can recognize the degraded characters. However, by understanding the degradation and being able to estimate the degradation characteristics for each document, a more effective method of recognizing the characters can be developed.
Document Image Analysis can be applied to may applications beyond the desktop OCR package that comes with most commercial scanners. Some applications include:
- Reading books and documents for visually impaired
- Conversion of books to digital libraries
- Signature verification
- Reading license plate or cargo container numbers
- PDA or tablet PC technology
- Sorting of large document datasets (legal, historical, security)
- Search engines on the Web
Melville Marginalia Online
(Barney Smith and Olsen-Smith)
Dr. Olsen-Smith has developed the Melville Marginalia Online project. He has had the pages of books once owned by Herman Melville scanned and presents them online. Dr. Olsen-Smith’s team is identifying locations where Herman Melville hand-wrote notes, markings or underlines in the text. These provide him and other scholars information about the process and thoughts Melville used while writing his books. We are assisting him by implementing the open-source Tesseract OCR engine to convert the machine printed text contents to searchable text. We are also adding other technical Document Analysis techniques to improve the content in his system.
Document Imaging Defect Analysis
Scanning A model for the degradation caused by scanning has been developed. It consists primarily of a variable for the optics and a variable for the threshold level plus additive noise. Each combination of parameters will affect the resulting character differently:
Along with the modeling, we have developed a few methods for model parameter estimation from bilevel images. This will enable us to answer the question “Which degradation is accurate for a given scanner?” We are working on how we can use this to improve scanner development or recognition of the printed characters. While the degradations from scanners represent only one facet of document image degradation sources, it is the most accessible. Once estimation of scanning parameters can be done accurately and efficiently, then other types of degradations can be accessed.PrintingUsually in the printing process, it is assumed that the characters will be printed with nice smooth boundaries as people think they see the characters. In reality printing processes such as laser printers and inkjet printers cause a spread in the amount of toner or ink around the image boundaries:
Photocopying and FAXingA photocopier is basically a scanner coupled with a printer, and a FAX machine is basically a low resolution copy machine. When the printer and scanner models above are merged, models of copiers and FAX machines can be developed.
To improve the performance of DIA, four major themes are being investigated:
- Model the nonlinear systems of printing, scanning, photocopying and FAXing, and multiple combinations of these, that produce degraded images, and develop methods to calibrate these models. From a calibrated model one can predict how a document will look after being subjected to these processes. This can be used to develop products that degrade text images less.
- Statistically validate these models. This will give other researchers the confidence to use these models to create large training sets of synthetic characters, with which they can conduct controlled DIA and OCR experiments.
- Estimate the parameters to these models from a short character string to allow continuous calibration to account for spatially-variant systems.
- Determine how these models and parameters can best be used to improve OCR accuracy by partitioning the training set based on modeled degradations and matching the appropriate partition to the test data at hand.
- Printer Modeling: A model was developed by Yi at University of Idaho for the amount of electrostatic charge on the charge roller of a laser printer. A method to convert this charge density to a measure of the average amount of coverage toner will produce on a piece of paper has been completed as an MS thesis in this lab. The coverage is a function of the number of toner pieces available to be distributed, the size of the toner pieces and the laser trace pattern. Simulations visually match magnified samples and averages of simulated toner placement. Current work is confirming averages of printed samples match our expected average coverage by comparing model outputs to test samples printed on a printer with special control and that populations of individual printed samples are statistically similar to populations of samples generated by our model. The coverage will then be converted to expected reflectance and a qualitative measure of how the printing degradations affect images representing characters will follow.
- Scanner Model Parameter Estimation: Estimation of parameters to the scanner defect model can be done with features available in common textual images. We have selected several characters that are suitable for this estimation, and have estimated how many of them are needed to produce a ‘good’ estimate. A method to determine when the measurements are changing enough to indicate that the model has changed either page to page or within a given page is currently being explored.
- Statistical Validation: Statistical validation of these models is one of the major themes for this research grant. Code has been written to make a flexible platform through which these validation experiments can be run. This includes configuration to run under a grid computing framework. Validation will consider model choice of PSF and we hope to compare our model with other degradation models. Currently source data is being collected to enable us to run these experiments.
- Training Methods: In a paper by Barney Smith and Qiu, regions of the degradation space were found where characters are statistically similar by multiple metrics. Training sets are currently being prepared in these regions and evaluation of the effect on OCR will follow.
- Noise Effects: Noise has an effect on images, but quantifying that effect is the focus of this project. Additive noise will affect a bilevel image differently depending on the nature of the gray level image to which the noise is added. If we assume the gray level image is formed from blurring a high contrast image, as done in the scanning process, then after noise is added the image is thresholded, the blur width and the binarization threshold have a large effect on how many pixels are affected and on how far from an edge pixels are affect. This in turn can affect the ability to fit a line or shape to an edge. WE have defined a metric called Noise Spread to capture this effect.
- Comparing Models: Features that often appear in degraded documents have been identified and are used by many researchs. These include estimating the amount of touching characters, small speckles, broken characters, etc. in a document. Several metrics to measure these factors exist. We are comparing how these factors compare to the parameters in our degradation model.
- Human Comparative Studies: We have proposed that the amount of Edge Spread in a document is a good metric for how degraded the document appears. This is the foundation of the “Training Methods” project described above. We also want to see if this or other parameters of our model are correlated with how humans rank the amount of image quality, or lack there of.
Open Source OCR Software under the GAMERA platform
(Barney Smith & Andersen)
Most OCR researchers focus their research on one portion of the OCR problem. These areas can include thresholding, segmentation, filtering, OCR, retrieval, etc. In order to test any one portion of the OCR process, all the other portions also need to be developed, usually in-house, even though they are not the focus of that current research, and have been developed, optimized and fine tuned by other research groups.
It is desired to create an open source repository for OCR tools and data sets that researchers can utilize to confirm others’ results and to reduce the amount of infrastructure each DIA researcher needs to develop in order to get their research tested. To help access all these tools, a graphical interface based on the GAMERA structure is being developed. One student is developing the graphical interface and another is populating it with several tools. Contributions to this project from the community are welcome.