Skip to Content
Technical Articles
Author's profile photo Jerome GRONDIN

SAP Intelligent RPA 2.0 : Integrate Document Information Extraction service to automatically extract data from documents

Hello RPA fellows !

 

In the previous blog post, I presented how you could integrate the SAP Document Information Extraction service (also called DOX) with SAP Intelligent RPA to extract data from PDF documents. Now that SAP Intelligent RPA 2.0 is officially released, I will show you how to do it again with the low-code approach of the new version of our solution.

 

To set up the service, please read the previous blog post. Once it is done, let’s dive into the real topic !

API service key

As a reminder, the service key for the DOX service should have the following structure :

{
	"url": "",
	"uaa": {
		"uaadomain": "",
		"tenantmode": "",
		"sburl": "",
		"clientid": "",
		"verificationkey": "",
		"apiurl": "",
		"xsappname": "",
		"identityzone": "",
		"identityzoneid": "",
		"clientsecret": "",
		"tenantid": "",
		"url": ""
	}
}

Let’s save the urluaa.clientiduaa.clientsecret and uaa.url as we will need them later.

Overview

How to extract data from document ?

First, let’s remember the main steps of the process when we extract data from a document using DOX :

  • we must get the access token to be able to use the service
  • we upload the document to the service. The service sends back the document ID, which will be used later
  • Last we try to access the document. If the service has finished processing the document, we get a status DONE. Otherwise, the status is still PENDING and we need to wait a bit before trying again.

How to use a web service call activity ?

The way the activity HTTP call is designed, we need to provide an option object.

To make it easier, let’s create this options object. To do so, we need to use a Custom Script activity. The list of input/output of this activity is displayed in the screenshot below :

No input needed. Just an output, where type is Any (equivalent to an object). And in the script editor, we just need to insert the following :

return {
    method: 'GET|POST',
    url:'',
    resolveBodyOnly: true,
    headers:{
        Authorization:''
    }
};

Note: With a POST query, the options might be more complex. Don’t hesitate to read the documentation for more details.

Note: To build this options object you might need to use data from previous activities. In that case, feel free to add input parameters.

 

Important note: The attribute resolveBodyOnly allows to directly retrieve the result sent by the service as an object.

When set to false, the HTTP Call activity returns an object which is wrapped into another one. To be more precise, all the data returned by the service are contained in the attribute body of the output of the Call activity Note: the content of this attribute is a stringified object. So to get it, let’s create another Custom Script activity, where the input would be the response of the Call activity:

And in the script editor, we insert the following :

let json = JSON.parse(response);
return json;

In that case json would be an object, containing all the data sent by the web service activity. But depending on the case, we can also return something else (such as json.somedata).

But again, this last step is optional when resolveBodyOnly is not set to false.

 

At this point, each time we need to use the HTTP Call activity, we will implement the following structure:

 

Create the automation

OK. Now that we have a better understanding of the way we need to perform calls to web services, we can implement it in our context.

To make it easier, all URLs, credentials and paths are hard-coded. But in real-life you definitely should create a configurable automation with environment variables to ensure the security of all sensitive information.
Path of the file can be set as an input of the automation so you can reuse it.

Generate the authentication token

First, let’s generate the authentication token which will be used to call DOX. As explained before, we have the following activities :

The script to generate the token options is detailed below :

return {
    method: 'GET',
    url:'https://xxxxx/oauth/token?grant_type=client_credentials',
    headers:{
        Authorization:'Basic xxxxx'
    }
};

where xxxxx in the URL is uaa.url mentioned in the first part of this blog post, and xxxxx Authorization is a base64 encoded string composed of uaa.clientid:uaa.clientsecret.

Tip: as we will use the token several times, we can create a string variable and store the token in it. See below :

The value would be :

Upload the document

To upload the document, we will use the same pattern :

  • Generate the options
  • Make the Http call using a POST request
  • Get the document ID which is sent by DOX

To generate the options, we are using the following code :

return {
    method: 'POST',
    url:'https://aiservices-dox.cfapps.eu10.hana.ondemand.com/document-information-extraction/v1/document/jobs',
    headers:{
        Authorization: token
    },
    metadata:[
        {
            name:'file',
            file:'C:/Temp/invoice.pdf',
            type:'application/pdf'
        },{
            name:'options',
            value:'{"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress","deliveryDate","paymentTerms","deliveryNoteNumber","senderBankAccount","senderAddress","senderName"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","receivedDate":"2020-02-17","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}',
            type:'text/json'
        }
    ]
};

Note: In the metadata attribute, you need to provide the file path of the document (in this case, as it is a PDF document, we are using the application/pdf type.

Retrieve the data from the document

At this point, the token to use the service is generated, and the upload of the document is made. Now the fun part begins !

We know that the service might take a while to process the document, but we do not know exactly how long. The only solution we have is to periodically ask the service about the status of the processing: if it is PENDING, then we need to wait a few seconds and retry. Else (if it is DONE) that means the service has extracted the data, and we can retrieve them.

 

But… First thing first, let’s create a datatype with 2 attributes:

  • a string to store the Status of the processing of the document
  • a complex object name Data (type = Any) to store the result of the processing of the document

As we know it might take a while, let’s set the Status to PENDING first.

Then, to implement the wait & retry feature, we need to insert a Forever activity where the condition would be :

if (dtDox.Status !== 'PENDING'){
    // break loop
} else {
    // wait and retry
}

So we have:

In the activity Generate get options, we have the following code:

return {
    method: 'GET',
    url:'https://aiservices-dox.cfapps.eu10.hana.ondemand.com/document-information-extraction/v1/document/jobs/' + docId + '?clientId=c_00',
    responseType: 'json', 
    resolveBodyOnly: true,
    headers:{
        Authorization: token,
        'Cache-Control': 'no-cache'
    }
};

where docId is the output of the previous paragraph, and token is… well you get the idea !

To get the result, we can store result.status and result.extraction in the according attributes of the instance of the datatype we created before (result being the name of this instance of the datatype).

Now, if the status is DONE, we know that result.extraction will contain the data from the document (see this documentation for more details).

Note: according to the documentation, you will be able to access result.extraction.headerFields and result.extraction.lineItems (and loop over each one of them (they are arrays) to display the name and the value of each extracted fields)

Final result

And voila ! Here is what you should have :

Of course, after the loop you can log the content of the result if you want to.

Conclusion

With some experience, building this automation should not take more than half an hour, which is far less than what was needed with the previous version of SAP Intelligent RPA. But what is important here is that you did not have to write lots of code to complete this automation (only the options for each HTTP call) !

 

Also…

Don’t forget to check out the SAP Document Information Extraction documentation as there are some new features since my last blog post (it now supports JPEG and PNG format !). You might be interested in it !

Last, you can find a sample on the Store :

 

Find more information on SAP Intelligent RPA:

Exchange knowledge: SAP Community | Q&A | Blog

Learn more: Webinars | Help Portal | openSAP

Explore: Product Information | Successful Use Cases

Try SAP Intelligent RPA for Free: Trial Version | Pre-built Bots

Follow us on: LinkedInTwitter and YouTube

Assigned tags

      9 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Pierre COL
      Pierre COL

      Well done Jérôme!

      Hallo Nadine Hoffmann & Jana Wuerth 😉

      Author's profile photo Jana Wuerth
      Jana Wuerth

      Nice how-to guide Jérôme! 🙂

      Author's profile photo Nadine Hoffmann
      Nadine Hoffmann

      Thanks for sharing all the insights Jerome GRONDIN - good to see the interaction of SAP AI Business Services and SAP Intelligent RPA 2.0!

      Author's profile photo Ramichetty Mahesh
      Ramichetty Mahesh

      Small clarification why a REST service is required the details of the invoice if the Input parameters that are being extracted gets changed then a change in the Rest service is mandatory or if the Invoices are from different clients in different formats then need to  invoke the respective rest service over a period of time this may become legacy, in place the OCR can be used fo the same that is configuration based in creating the Automation process.

      Author's profile photo Tomasz Janasz
      Tomasz Janasz

      Hi Ramichetty, I am not sure I am getting your point correctly. Document Information Extraction features a global model for invoices (and also payment advice) with standard capabilities (header fields, line items, languages). OCR is an integral part of the service. The API reference can be found here: https://help.sap.com/viewer/5fa7265b9ff64d73bac7cec61ee55ae6/SHIP/en-US/ded7d34e60f1422ba2e04e892a7f0e25.html

      Best!

      Author's profile photo Ramichetty Mahesh
      Ramichetty Mahesh

      Thanks for sharing the URL, the Rest API that has been built to get the Invoice Details (parameters). The Invoices will be in different formats, eg:

      name:'options',
                  value:'{"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress","deliveryDate","paymentTerms","deliveryNoteNumber","senderBankAccount","senderAddress","senderName"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","receivedDate":"2020-02-17","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}',
                  type:'text/json'

      All the Invoices may not have the same header values for all the invoices if a centralized RPA(bot) has to process different Invoices that have different Fields (in place of the material number if it is Material ID) then the same REST service will not work, as this may throw " Null Pointer Exception" as the expected field would not be having the expected getter value. Why cannot these values be configurable Driven, this is how we have customized our Platform, that is getting released towards the end of FEB 21.

      Author's profile photo Noah Weiprecht
      Noah Weiprecht

      Hi, I am having problems uploading the PDF. The error is "irpa_core.error.RequestError: Response Code 400(Bad Request)" and I did not change anything in your code in Generate The Options module. The authentication with the access token works and the input and output parameters of Generate The Options are correct. But on the next call web service nothing is output on the variable obj. Are there any other parameters or modules I could add to make the upload work?

      Author's profile photo Jerome GRONDIN
      Jerome GRONDIN
      Blog Post Author

      Hello,

       

      Please have a look at the "options" value. Something might have changed on the service side, explaining why your call is failing.

       

      Regards,

      J.

      Author's profile photo Indrajit shah
      Indrajit shah

      Hi Jerome GRONDIN

       

      Can you look at this issue on Extract Data With Template,

      https://answers.sap.com/questions/13499519/how-to-work-with-multiple-extract-data-with-templa.html