A Training-Free and Layout-Agnostic Approach to Extract Key-Value Pairs from Business Documents
Business documents are a cornerstone of any business transaction. Invoices, receipts, lease agreements, bill of lading are some examples of such documents. The information contained in these documents is critical for maintaining the sanctity of the underlying business transaction and for downstream processes. Till recently, extracting the data from such documents and entering it into the business system was a manual process. In recent years, machine learning models have been developed to extract this data in an automated fashion. A quick fix to a laborious endeavor. However, there is no free lunch! Such models are data thirsty. Thousands of annotated documents must be provided for model training. This causes a bottleneck for business users as they will have to wait for a long time to accumulate enough documents and prepare them for machine-level training. This requirement typically puts a damper on the efforts to automate this manual process.
To learn more about the need for business document processing and information extraction process, please follow through the post. Hopefully, by the end of it, I will convince you of the proposed approach.
Note 1: All the content depicted in the bill-of-lading figures in this blog post are entirely fictitious and are created based on an actual bill-of-lading while keeping the same alpha-numeric word formats
Note 2: This blog is meant to illustrate a viable information extraction approach, but not to detail all the particulars or gotchas within the process and invariably need expertise in image processing to resolve some of the steps discussed below.
There is no doubt that information is the oil to all businesses worldwide. The common way of sharing information is in the form of documents, images, videos. No matter what business you are in, there is no getting away from the physical documents in your business process. At least not yet.
Until a few years ago, information extraction from images wasn’t much heard of in industrial process automation. With the help of image processing algorithms and hardware that can process data faster than ever before we are increasingly asking ourselves whether we can extract information from unstructured documents like images and PDFs and tabulate the extracted data in structured format to facilitate the downstream processes.
The layouts, content, information structure widely vary across documents especially when there are multiple business partners involved in the business process. There is an absurd amount of information locked in an unstructured form which when rightly used could drive informed and data-driven decisions that a company can benefit from. According to a study conducted by International Data Corporation (IDC), 80% of worldwide data will be unstructured by 2025. Most of the organizations are already at this point. This begs the question; how do we design the information extraction process to effectively filter out the unwanted information (noise) from the surge of unstructured data that companies are going to receive overtime?
How do corporations extract information today?
To extract information from documents, traditionally, companies resort to manual processes for reviewing and analyzing the documents, identifying relevant fields, and entering data in the computer systems. While there exist solutions to extract information from semi-structured ‘True-PDF’ (ex. e-invoice) documents, these solutions will not cater to fully unstructured or ‘Image-Only’ document types.
A few examples below present the magnitude of the problem:
A company that is a global food corporation that deals in sourcing, storing, trading grains receives thousands of trucks a day during harvest season. The trucks carry grains and other agricultural produce from several parts of the U.S. to deliver to the company’s storage plants. The truck drivers carry scale tickets, bill-of-lading, rate confirmation documents, etc. The storage plant facility personnel collect these documents, cross-checks the vendor information, and the agreed-upon rates to pay the drivers. The storage plant receives thousands of such documents on a given day. These documents then need to be processed manually in a timely manner to make the payments. This is a herculean task.
Another example, a pharmaceutical company procures raw material for manufacturing medicines. Upon delivery, truck drivers deliver the material to the plant personnel who collects documents on scale tickets, certificate of origin, moisture content of the material, certificate of analysis, chemical compositions which might change over the trip duration. The information from the documents needs to be processed instantaneously to make a decision on acceptance of the delivery. Hence, they must be processed in a timely manner which is currently a manual process.
Yet another example is where an Oil & Gas company purchases biofuel on a regular basis that is to be blended with other petroleum fuels. During this transaction, they are provided with a product transfer document (short, PTD) that authenticates the transfer of fuel between parties. A PTD may include bills of lading, invoices, contracts, meter tickets, rail inventory sheets, etc. These documents are subjected to Environmental Protection Agency (EPA) inspections and are needed to be processed and captured accurately. Thousands of these documents are being collected and processed on daily basis.
The above-mentioned processes are only some examples of companies across the industries having to process enormous amounts of information from different types of documents. The manual key-stroking process is a time-consuming, error-prone, a mundane task which sometimes is also a legal obligation.
As we have seen from the examples above and in a plethora of other cases, despite being in digital times, companies still use tons of physical documents and use human labor to extract information out of the documents. Though some business process uses digital copies of scanned documents and PDFs, there doesn’t exist automated tools to extract information out without enough training. This is especially in the case where the business has to deal with myriad document types and layouts. All these challenges lead to ‘bottlenecks’ in service or payment delivery and increase the risk of poor organizational response to business and lost time that should be spent carrying out core business and value-added tasks. Organizations struggling to keep up with the data deluge are turning to a relatively new set of technology solutions to handle the rush of unstructured data.
The solution is the digitization of documents and automation of data flows to lower the costs, improve operational efficiency, quality, and increase flexibility. Companies now understand that a digital transformation is no longer an option, it is imperative. Capturing documents (via imaging) or capturing data (via electronic forms, or e-forms) at the point of origin or receipt gets the information into the process faster, enabling process, workers to access this information far sooner.
In order to create an automation workflow, one needs to extract information from document images that are characterized by heterogeneity in document formats or layouts. This makes it extremely difficult for a single algorithm to address the problem of data extraction from any arbitrary document. The ubiquity of business documents in industry and the challenge described above make the problem of data extraction extremely attractive for researchers and practitioners.
A training-based approach
Unstructured documents such as Invoices and Bill-of-Lading has a certain order of word or term occurrence for entities such as headers, key-value pairs such as Date, Total amount, Tax, transaction ID, etc. Modern Linguistics models make use of supervised learning techniques like Convolutional Neural Networks and Recurrent Neural Networks to identify and extract specific entities of interest. As the majority of the documents are unstructured in nature, the variability in document sources and layouts brings the challenge in defining ways to extract relevant content from the documents. The recent breakthroughs in automatic content extraction propelled by the advancements in deep neural nets are huge data thirsty and computationally expensive. These de-facto methods pose challenges to users or customers who are required to provide:
- thousands of sample documents to train the extraction model
- annotate the fields of interest which are not commonly available and that are needed to explicitly be created at great cost, and
- re-training the whole model when they need to add an additional ‘key’ for ‘value’ extraction.
So, how do you teach machines to extract information without annotation and training data?
The answer is using Optical Character Recognition (OCR) coupled with regular expression definitions to extract the relevant content out of the unstructured documents.
A training-free approach
I would like to emphasize that our goal is to demonstrate that, good model performance can also be drawn by using alternative and training-free methods based on the problem we are trying to solve. Specifically, I focus on extracting key-value pairs out of the unstructured business documents like scanned bill-of-lading.
Let’s say you were given a physical Bill-of-Lading document and the ask was to build a model to extract key-value pairs. In this case extract ‘ticket type’, ‘shipment date’, ‘gross weight’, ‘tare weight’, and ‘net weight’. You were told that your model will need to extract the above-mentioned fields coming from various vendors where the layouts are going to be widely varying and that they are not something you have already seen at this point.
Clearly in this case you do not have enough data to teach a machine what to extract. Enter the Key-Value Pair Extractor !!!
The steps below will walk you through the solution approach.
Define the ‘key’s and their associated ‘value’s regular expression definitions that are to be extracted.
Depending on the need, on the applications’ configuration pane, you can optionally add ‘synonyms’ for a particular ‘key’ that share value’s regular expressions among multiple keys. This helps in sharing regular expressions among multiple keys either from the same or across different documents.
Example: The key entity called ‘Customer’ and ‘Vendor’ may share the same value expression
On saving the configuration, the application creates a JSON file with all the specified fields and will be used in the information extraction process.
Capture a picture of the physical Bill-of-Lading document using a mobile phone or a scanner
The document after being captured may not always be in a shape or form that we would like to use for extracting information. In this case, you can see that the document is tilted, inclined, and is of poor contrast. This will be transformed into a workable format using corner detection, image registration, and local-histogram equalization techniques.
Entity location and centroid determination
- Extract all words and word locations from the transformed document image. In this case, an open-source optical character recognition algorithm called ‘Tesseract’ is used to extract the words and word coordinates information.
- Check if the keys have multiple words. If so, identify and merge the multi-word keys to form a single entity.
- Calculate the centroid of all the word or entity locations as shown in the example below. A centroid is the arithmetic mean of all the positions in a given shape (rectangle in this case) box. The red star in the center of the box from the following figure represents the centroid of the particular entity.
- Based on the resolution of the input document, set the radius-of-influence (ex: 1/4th of document width in the x-direction) and draw a hypothetical circle with a center around the key’s centroid to search for the respective value within this space.
- Within the circle-of-influence around the key, identify the word/s next to the ‘key’ that satisfies the associated regex expression. It is not uncommon to find multiple results within this search.
- To narrow down the results, set a pre-defined priority based on the angle or direction between the key and the identified value’s centroid. It is common to find the value either to the right side or the bottom of the key location. In this case, the priority angle ranges between 360o – 1o in a clockwise direction
- Calculate pair-wise angles between the key and all the identified values that satisfied regular expression.
- Identify the first value that satisfies the criteria based on the priority angle that is set in step 3
- Break the process once the value has been identified and associate the output to the corresponding key. Note that in some cases, it is possible to have multiple words within the value. In this case, if the expected words are few (Ex: < 5-word entities), they can be extracted using regular expression and the distance between the adjacent words. Or if there are more than a handful of words or multiple lines that need to be extracted (Ex: > 5-word entities) then they can be extracted using text-blob analysis which uses morphological image processing techniques to determine the text blob.
- Continue the above process for the next key for value extraction until you extract all the key-value pairs.
Below are some screenshots of the application that was built to define configurations, upload a document, run and display the extraction content.
Application home page that displays filter, document upload options, and the information extracts
Information extracted from the document image along with the location of the extracts
The above techniques have been tested on a variety of documents which include Bill of Lading, consignment orders, and Product Transfer Documents with a various number of key fields for value extraction.
The performance is calculated by comparing the extracted values to the actual values that are present in the documents for the given ‘key’ field. The comparison is based on two text similarity metrics namely, normalized Levenshtein distance and Jaro-Winkler (JW) distance. Levenshtein distance is the total number of single-character edits required to convert one string to another. Jaro-Winkler distance is a variation of Damerau-Levenshtein, where the substitution of 2 close characters is considered less important than the substitution of 2 characters that are far from each other. Also, JW distance penalizes more if there are differences in the prefix compared to the differences in suffix. The scores for both the metrics are normalized such that 1 means an exact match and 0 means there is no similarity. It is worthy to note that the error rates are a function of both the OCR (‘Tesseract’) outcome and the identified value from the above methods for a given key. All the error rates are then averaged at error type to demonstrate the overall effectiveness at a document level.
This work is focused on the development of an information extraction algorithm inspired by real-world business challenges i.e. to extract relevant content as key-value pairs out of the unstructured documents especially when there are too few document samples to train a machine learning model. The proposed algorithm is quite general in the sense that the methods discussed can be used irrespective of the document type and layout. The results from several sets of experiments clearly demonstrated the effectiveness of this approach on a variety of document types. Furthermore, to make decisions based on the business data that is locked in unstructured form, this application finds relevance in the information extraction process with minimal human intervention and easy integration into any business systems.