A very large and prestigious Western Canadian university had been considering the digital conversion of over thirty thousand thesis documents stored as contributed, hardcover books. The collection covered seven decades and numbered approximately five million pages. There were many reasons to consider such a conversion. The collection consumed a considerable amount of physical room, space that could easily be repurposed. The paper was aging well, but it would not last forever in its native state, and most importantly, thesis documents contributed more recently in digital format were proving to be a very valuable research tool. Could access to the historical thesis documents also provide such scholarly communication? The answer was yes, but many factors had to be taken into account.
With thirty thousand bound books to process, a streamlined conversion protocol would need to be established. Could the books be scanned without removing the bindings? If so, what type of scan device would be required? If the bindings were to be removed, how could it be done without sacrificing written words on pages? What about scanning five million pages? That’s a significant amount of work. Should it be done in house, or could an outside service bureau be used? After scanning, which OCR product would provide the best data extraction results? Again, would it be better to perform this work in house, or farm it out? How long would the project take?
In early 2008, our firm was contacted to consult on this project, as we had a very good long-term relationship with the university. After familiarizing ourselves with the scope and goals of this task, we examined each portion of the workflow to determine what would make the best sense from the standpoints of productivity and cost effectiveness.
We quickly identified the binding issue as a critical part of this assignment. While we have equipment that can capture extremely high quality, high resolution images from bound books, the time required to process individual pages in this manner would inflate the costs of image capture to an unrealistic extreme.
We had lots of experience in accurately slicing binding from bound books so that the valuable pages inside were intact, with healthy margins. We acquired an ancient guillotine from a commercial printing operation in the early 1980’s just for this purpose.
With this decision made, we turned our attention to the most timely way to scan approximately five million pages. MCS has always been very partial to Fujitsu scanners, and this was an excellent opportunity to test their newest model at the time, the fi-5900. It was fairly fast at 120 pages per hour, featured advanced double feed detection, and was mated to the latest version of Kofax VRS (Virtual ReScan, a technology that would assist us in obtaining the best possible image the first time, significantly reducing rescans). We ordered a fi-5900 and aggressively evaluated it for a week using actual client documents. The results were very encouraging.
Our workflow model was binding removal, document preparation (very little was required), scanning, quality control (second pass scans as required), indexing (author name and thesis title) and OCR.
Our results through the scanner evaluation indicated that a scan resolution of 300 DPI (dots per inch) allowed us to achieve the best data extraction results through the OCR (optical character recognition) process. We have used a variety of OCR engines throughout the years, but we settled on Abbyy Recognition Server, as its speed and accuracy eclipsed all the other products we tested.
There were a few ancillary details to iron out. The university needed to ensure that the scanned images would be cleansed of any personal author details (other than name) in respect to Freedom of Information and Personal Privacy legislation. We have developed a fairly comprehensive set of image-related tools over the years for a variety of clients. One of these was a redaction utility that could be used to completely obscure any details (such as address or signature) that were deemed to be sensitive.
We needed to arrive at a very competitive, all-inclusive price to accomplish the tasks required for this conversion, binding removal, scanning, data extraction and redaction. The component parts of the pricing formula were the equipment and technology investments, staff labour, software licensing and project duration.
Finally, we had to find a method whereby the images and corresponding full text data for each thesis could be transferred to the university. In this case, external USB hard drives were swapped back and forth.
We came to terms with the university in late 2008 and began a small pilot project as a proof of concept. Approximately 500 volumes were converted end to end and uploaded to the university research website. The feedback was very strong and grant monies were obtained to begin the task of processing the balance of the material.
The university had limited resources to monitor and upload the material we were creating, so it was determined that a project duration of roughly two-and-a-half years would allow for the correct utilization of their staff.
The project proceeded as planned and was a complete success. Today, students around the world can access these theses over the internet as an invaluable research tool.
Craig Hollingum has been in the Document Imaging business for well over half of his life. He has been involved in Micro Com Systems Ltd. on an evolutionary path as an employee/partner/sole owner since 1982