Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
former_member104848
Participant
SAP announced Data Intelligence last year and here is an overview of what it is all about.



One key thing to note here is, a set of pre-trained models like Image OCR, Image Classification, Object Detection etc. that SAP offered earlier under the bucket of SAP Leonardo Machine Learning Foundation - Functional Services was deprecated and is now included with SAP Data Intelligence.

However, we now have SAP also offering business services for artificial intelligence and machine learning, catering to specific business cases, under the SAP AI Business Services portfolio on SAP Cloud Platform. Take a look at this excellent blog by joni.liu on using the Document Information Extraction service trial. This service extracts information using OCR intelligently for certain business documents like invoice or payment advice.

In this blog post I will touch upon

  • a simple scenario that I implemented using the ML Functional Services OCR API before it was deprecated and offered with SAP Data Intelligence.

  • how I use MLFS within Data Intelligence for the same scenario.


Scenario:

As we all know companies, process and extract data from huge volumes of scanned documents like invoices, RFPs , payment advices, CoAs etc and mostly they are done manually. Naturally this consumes many many man-hours, and is repetitive, exhausting and hence error prone and cost intensive. It is of course not a revolutionary thought that, what if we could automate this process to free up human resources to work on other complex tasks. Enter RPA + ML , and this is now a reality - automation of data extraction and post processing of PDF documents is now implemented by multitude of businesses.

The How?

I am going to keep SAP iRPA out of scope for this blog post. For automated document processing in this scenario - I use the Optical Character Recognition API to extract data from a document, and process it to be saved in and put to use downstream in the business process.

The main challenge however is, companies receive and process many unstructured documents each day from different sources. Each document looks different, varies in their structure.This makes it  increasingly difficult to understand, what is the type of the document? how to process the data extracted? how to weed out unnecessary data and get exactly the required information in a meaningful way? Here is where the Document Information Extraction service is extremely useful. Based on the document type, it is intelligent enough to exactly know what information needs to be extracted and how.

However , let's look at the document types which are not supported by this API. Traditionally in such scenarios, many times this problem has been solved by using Zonal OCRs

What is zonal OCR? Zonal OCR is a a way of using OCR to read specific zones in a document. Zones of interest can be defined in a document using coordinates and OCR can be used to read characters in that area and correlated to the business information that needs to be extracted. Of course then the assumption here is that a specific type of document always follows the same template and the information that needs to be read from the document always stays within the predefined area. This will be a one time effort for each template.

Part 1: With this background let's take a quick look at how I implemented zonal OCR using the SAP OCR API


Take a brief look here at the SAP MLFS OCR API definition.

This API used the tesserract-ocr engine underneath. Tesserract OCR provides multiple output type formats which includes plain text/hOCR. Naturally, the SAP MLFS OCR API also exposes two type of output types - 1. Plain text 2. xml (which is the hOCR) . hOCR basically is a predefined open standard format that not only provides the text identified through OCR but also provides positional information of the text. You can take a look at the format on wiki here. With this background, lets look at the approach here with a diagram:



Let's get a better understanding on the diagram ,
Step 1.1 Convert pdf to image

There are many many tools out there to convert a pdf to an image for eg. image magick, ghostscript. I used the Image Magick on my Mac to convert the pdf into png. Make sure your image is of good resolution and set the density/dpi to 300.
Step 1.2 Define zones in pdf in the form of  bounding boxes.

I used Vott to define zones in the form of keys and bounding boxes. See an example screenshot below.


Step 2 Call SAP MLFS OCR API to get hOCR output

Here is the input parameters I used to call the ML OCR API

{"lang": "en,de", "outputType": "xml", "pageSegMode": "1", "modelType": "lstmStandard"}

Note that I used 'xml' output type to get the hOCR , and the rest of the parameters I based it on the structure of my pdf file.

Here is a snapshot of how my xml output looked:



Note : When converting the pdf to image for defining zones, the image should be the same size as the bounding box dimensions received from the API in the xml/hOCR output. See the highlited rectangle above for the dimensions, Here it is : (2550 x 3300 px)



Step 3 Use the zones defined for the document , and parse the xml/hOCR output to get data within those zones.

A custom parser (in my case I used Python) and hosted it on the SCP Cloud Foundry account.

Parser endpoint : https:/<scp cf url>/parseBBoxOCR

Parser Input Params:

  • hOCR : hOCR output received from from the MLF OCR API in the previous step.

  • bbox : json input of regions for which values should be extracted. Example:


{
"regions":[
{
"id":"product",
"key":{
"id":"product_key",
"boundingBox":{
"posLeft":279,
"posTop":1101,
"posRight":792,
"posBottom":1218
}
},
"value":{
"id":"product_value",
"boundingBox":{
"posLeft":796,
"posTop":1103,
"posRight":1295,
"posBottom":1219
}
}
},
{
"id":"customer",
"key":{
"id":"customer_key",
"boundingBox":{
"posLeft":1290,
"posTop":1102,
"posRight":1799,
"posBottom":1226
}
},
"value":{
"id":"customer_value",
"boundingBox":{
"posLeft":1796,
"posTop":1105,
"posRight":2299,
"posBottom":1226
}
}
}
]
}

Quick Demo Flow


Here's a quick demo flow to show the final outcome :

  1. Select template




2. Apply OCR on the template and request the output type as xml/hocr. You can see from the screenshot below, the API provides the extracted text along with additional attributes like the bounding box location on the document.



3. Apply OCR on the template and request the output type as xml/hocr and call the parser with the hOCR output and the pre-defined zones, You can see from the screenshot below, how data can be extracted within zones using this approach,



 

With this blog post I tried to give some insight into one of the approaches that I took for OCR and some of the challenges that came along. There might be several other ways. I would be interested in knowing them and would appreciate if readers could post other approaches as comments to this blog.

 

In my next blog post, I will talk about,

  • Use the OCR Functional Service inside Data Intelligence.

  • Call the same parser hosted on my SCP from inside DI

  • How I expose this complete execution as an Open API using Swagger.


 

 

 
1 Comment