Skip to Main Content

Extending the Page Segmentation Algorithms of the Ocropus Document Layout Analysis System for Mixed-Layout Document Processing

Masters Thesis, Boise State University, August 2010

Amy Winder

Abstract:

With the advent of more powerful personal computers, inexpensive memory and digital cameras, curators around the world are working towards preserving historical documents on computers. Since many of the organizations for which they work have limited funds, there is world-wide interest in a low cost solution to obtaining these digital records in computer readable form. An open source layout analysis system called OCRopus is being developed for such a purpose. In its original state, though, it could not process documents that contained information other than text. Segmenting the page into regions of text and non-text areas is the first step of analyzing a mixed content document, but it did not exist in OCRopus. Therefore, the goal of this thesis was to add this capability so that OCRopus could process a full spectrum of documents.

By default, the RAST page segmentation algorithm processed text-only documents at a target resolution of 300 DPI. In a separate module, the Voronoi algorithm divided the page into regions, but did not classify them. Additionally, it tended to oversegment non-text regions and was tuned to a resolution of 300 DPI. Therefore, the RAST algorithm was improved to recognize non-text regions and the Voronoi algorithm was extended to classify text and non-text regions and merge non-text regions appropriately. Finally, both algorithms were modified to perform at a range of resolutions.

Testing on a set of documents consisting of different types showed an improvement of 15-40% for the RAST algorithm, giving it at an average segmentation accuracy of about 80%. Partially due to the representation of the ground truth, the Voronoi algorithm did not perform as well as the improved RAST algorithm, averaging around 70% overall. Depending on the layout of the historical documents to be digitized, though, either algorithm could be sufficiently accurate to be utilized.

Thesis in Scholarworks