1. The Challenge
The Definitive Logic (DL) Business Intelligence and Data Science (BIDS) Team was delivering work to a customer when another need surfaced: They needed a way to extract valuable information from static Portable Document Format (PDF) items. They had scanned a large volume of hard copy legacy forms into PDF document images, but the information inside those documents was still not accessible, searchable, or available for analytics. The template formats were inconsistent, containing both typed and handwritten information. Manual processing of that many items was unreasonable due to resource, time, and
Optical Character Recognition Case Study
cost constraints. In short, valuable information was locked in more than 44,000 document pages that could not be readily accessed or utilized.
2. The Solution
Members of the BIDS Team researched and tested combinations of potential solutions before eventually designing something custom. The Optical Character Recognition (OCR) process, when performed on the documents, leveraged a serverless, on-demand cloud architecture. It is proven to extract more than 28 million pages’ worth of document images – 24 pages per second. This not only created a repository of dynamically searchable PDFs, but also stored the textual data elements in a document engine designed for fast search and retrieval. Below is the technical process of the image and document processing architecture using Amazon Web Services (AWS):
2.1 Process Flow
The following steps describe the data extraction process:
- Upload all documents (non-machine readable) that need processing into Amazon S3.
- Send one Amazon SQS message for each document that needs to be processed, containing its S3 path. The message count in the queue represents the document work to be done. Each message in SQS includes information about each document stored in S3.
- Perform a Lambda function to read a message from the queue (SQS) and submit a request to Textract to process a document.
- Amazon Textract processes the document request from the Lambda function (previous step) and simultaneously provides the raw text, coordinate information (text location on a document), and confidence scores.
- Amazon Textract sends a notification via a Lambda function that the processing results are ready.
- A Lambda function then retrieves the Textract results and generates a searchable PDF that will index the text to Elasticsearch.
3. Solution
This solution produced an effective, efficient, and economical data extraction method for static documents. Data elements once locked in static documents are now transformed into discoverable and usable assets that can be accessed via standard reporting tools. The architecture is based on publicly available Amazon services and is capable of processing an average of 24 document pages per second.
4. Lessons Learned
We choose to use Amazon Textract because of its optimal performance and text extraction accuracy. Some other libraries and algorithms we evaluated were the Tesseract Open Source OCR engine and Character Region Awareness for Text Detection (CRAFT) / Scene Text Recognition (STR).
Our test set of images included scans of pristine documents, as well as scans of 70+ year old documents that have visible degradation due to age or damage. We found the Tesseract OCR engine performed well on pristine documents but had trouble extracting text from documents with noisy backgrounds. Several image preprocessing steps had to be written in order to increase the accuracy of Tesseract OCR. These extra preprocessing steps slowed the overall document throughput of the system.
CRAFT and STR performed well on our test set of documents. These algorithms were capable of extracting text from noisy images as well as documents that contained text that was slightly askew. However, document throughput was suboptimal. In most cases, our implementation of these algorithms was an order of magnitude slower than Tesseract and Amazon Textract due to the fact CRAFT and STR algorithms use deep learning methods and are resource-intensive.
Our solution using Amazon Textract did not include any image pre-processing steps before submitting images to Amazon Textract. Since Amazon Textract is a fully managed service, we did not have to maintain any OCR system, which would have been required for both a Tesseract and a CRAFT/STR solution. Finally, Amazon Textract was able to achieve a document throughput that exceeded Tesseract and CRAFT/STR, while extracting text that was on par to Tesseract with image pre-processing steps.
DL is a management and technology consulting firm known for delivering outcomes and ROI for agencies’ most complex business challenges. To find out how the DL Business Intelligence and Analytics Team can help you, visit our Data Management and Strategy page.