Skip to Content
Technical Articles
Author's profile photo Jerome GRONDIN

Combining Document Information Extraction and Intelligent RPA to automatically extract data from PDF documents

Reading a PDF to extract and structure the data it contains can be a tedious work, especially when you got hundreds of files to process, and this can lead to many errors with all the consequences it can have.

In this blog post, we will see how it is possible to combine Intelligent RPA and Document Information Extraction to automate the processing of your PDF documents.

If you never heard of Document Information Extraction, let me introduce what Document Information Extraction is.

Document Information Extraction (also commonly called DOX) is a service you can use to process documents that have content in headers and tables. Typically, you can use it to extract data from invoices, or payment notes. With such a service you can upload a PDF document and get the extracted data as a JSON object.

You can find all kind of useful information regarding the service on this page.

But first things first : let’s set up the service so we can use it.

Setup the DOX service

You need to have a SAP CP global account and a CPEA license.

  1. Create a subaccount in your global account using the SAP Cloud Platform Cockpit (of course you can use the same subaccount than the one you’re using for Intelligent RPA)
  2. Create a Space
  3. Create a Service Instance of DOX (you might need to add a Service Plans for this service to your subaccount first)
  4. Create a Service Key (note that you can find more information about this procedure here)
  5. Once the Service Key is created, you should have a JSON object with the following structure:
{
	"url": "",
	"uaa": {
		"uaadomain": "",
		"tenantmode": "",
		"sburl": "",
		"clientid": "",
		"verificationkey": "",
		"apiurl": "",
		"xsappname": "",
		"identityzone": "",
		"identityzoneid": "",
		"clientsecret": "",
		"tenantid": "",
		"url": ""
	}
}

Let’s save the url, uaa.clientid, uaa.clientsecret and uaa.url as we will need them later.

Create a bot

Create a new project and a new workflow.

Set up the variables

  1. Create input/output parameters in the context of the scenario
  2. Insert the Set Context activities as shown below
  3. For each one of them, set the value in the dedicated variable (see below) :
    • Set the credentials (see below for the properties
      Don’t forget to replace uaa.clientid and uaa.clientsecret with the values from the Service Key you created before.
    • Set the tokenURL variable : Don’t forget to replace uaa.url with the values from the Service Key.
    • Set the doxURL variable :Replace the url value with the one from the Service Key. The last part was inserted as requested by the documentation.
    • Last, set the path of a PDF file. Ex:

Web services to use DOX

  1. Insert a web service activity, which properties are shown below :
  2. Insert another web service activity. This one will be used to send the PDF document to the DOX service. Properties of this activity are shown below :
  3. Finally, insert a last web service activity, and provide the following properties :

 

You should have the workflow below :

Now, build the solution. We’ll have to make some adjustments in the JavaScript code which is generated.

Note : in this case the value of the clientid and clientsecret were hard-coded in the property value of the Set Context activity. To be more compliant with security, you can of course store these values in some Factory credentials variable, and retrieve it in the workflow, as explained in this blog.

Update the code

Since we cannot provide all the required options in the web service calls, we need to do it manually.

  1. WS call to generate the token so it is as shown below
    // ----------------------------------------------------------------
    //   Step: Generate_token
    // ----------------------------------------------------------------
    GLOBAL.step({ Generate_token: function(ev, sc, st) {
    	var rootData = sc.data;
    	ctx.workflow('ExtractDataFromPDF', '7e436b4c-226a-4a50-93c3-4948023e62db') ;
    	// Generate token
    	ctx.ajax.call({
    	  url: rootData.WS.Input.ServiceKey.tokenURL +  '/oauth/token?grant_type=client_credentials',
    	  method: e.ajax.method.get,
    	  contentType: e.ajax.content.json,
    		header:{
    			Accept: e.ajax.content.json,
    			Authorization: rootData.WS.Input.ServiceKey.client
    		},
    	  success: function(res, status, xhr) {
    	    sc.localData.token = 'Bearer ' + ctx.get(res, 'access_token');
    		sc.endStep(); // Upload_file
    		return;
    	  },
    	  error: function(xhr, error, statusText) {
    	    ctx.log(' ctx.ajax.call  error: ' + statusText);
    	  }
    	});
    }});

    The data attribute has been removed and the header has been inserted. Also, in the success callback, we need to extract only the information we need to set it in the sc.localData.token variable.

  2.  When we upload the PDF file, we should use the following code instead of the one which is automatically generated :
    // ----------------------------------------------------------------
    //   Step: Upload_file
    // ----------------------------------------------------------------
    GLOBAL.step({ Upload_file: function(ev, sc, st) {
    	var rootData = sc.data;
    	ctx.workflow('ExtractDataFromPDF', '0a1f4176-8907-4544-a807-65095d30ea36') ;
    	// Upload file
    	ctx.ajax.call({
    	  url: rootData.WS.Input.ServiceKey.doxURL +  '/document/jobs',
    	  method: e.ajax.method.post,
    	  formData: [{
    			file:rootData.WS.Input.filePath,
    			type:e.ajax.content.pdf,
    			name:'file'
    		},{
    			value:ctx.json.stringify({"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","senderAddress","senderName","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}),
    			type:e.ajax.content.jsonText,
    			name:'options'
    		}],
    		header:{
    			Accept:e.ajax.content.json,
    			Authorization: sc.localData.token
    		},
    	  contentType: e.ajax.content.json,
    	  success: function(res, status, xhr) {
    	    rootData.WS.Output.docId = ctx.get(res, 'id');
    		sc.endStep(); // Retrieve_extracted_da
    		return;
    	  },
    	  error: function(xhr, error, statusText) {
    	    ctx.log(' ctx.ajax.call  error: ' + statusText);
    	  }
    	});
    }});

    We provide inputs as a formData, composed of the path of the PDF files, and options which are expected by the DOX service. Also, we need to provide the Authorization header so the service allows us to use it. When the upload of the document is done, a unique id is generated by the service. We will use this id so we can retrieve data from the document.

  3. Now comes the fun part : the service might take a while to process the document and to send back the data. While the service is processing the document, it will send PENDING as the result status. When the processing of the document is complete, we will get the status DONE. As we do not want to block to bot while it is waiting the result from DOX, let’s add a way to loop so it can periodically ask for the results.
    • First, insert a new link between the last step and itself in the definition of the scenario:
      // ----------------------------------------------------------------
      //   Scenario: ExtractDataFromPDF
      // ----------------------------------------------------------------
      GLOBAL.scenario({ ExtractDataFromPDF: function(ev, sc) {
      	var rootData = sc.data;
      
      	sc.setMode(e.scenario.mode.clearIfRunning);
      	sc.setScenarioTimeout(600000); // Default timeout for global scenario.
      	sc.onError(function(sc, st, ex) { sc.endScenario(); }); // Default error handler.
      	sc.onTimeout(30000, function(sc, st) { sc.endScenario(); }); // Default timeout handler for each step.
      	sc.step(GLOBAL.steps.Set_credentials, GLOBAL.steps.Set_tokenURL);
      	sc.step(GLOBAL.steps.Set_tokenURL, GLOBAL.steps.Set_doxURL);
      	sc.step(GLOBAL.steps.Set_doxURL, GLOBAL.steps.Set_file_path);
      	sc.step(GLOBAL.steps.Set_file_path, GLOBAL.steps.Generate_token);
      	sc.step(GLOBAL.steps.Generate_token, GLOBAL.steps.Upload_file);
      	sc.step(GLOBAL.steps.Upload_file, GLOBAL.steps.Retrieve_extracted_da);
      	sc.step(GLOBAL.steps.Retrieve_extracted_da, GLOBAL.steps.Retrieve_extracted_da, 'loop');
      	sc.step(GLOBAL.steps.Retrieve_extracted_da, null);
      }}, ctx.dataManagers.rootData).setId('57146a42-30d7-47af-8aad-9844f008f7d8') ;​

      Using the name ‘loop’ we will be able to make the bot go to the step Retrieve_extracted_da (which is, in our case, the same step it was in previously)

    • Then, update the code of the last web service call, so you get :
      // ----------------------------------------------------------------
      //   Step: Retrieve_extracted_da
      // ----------------------------------------------------------------
      GLOBAL.step({ Retrieve_extracted_da: function(ev, sc, st) {
      	var rootData = sc.data;
      	ctx.workflow('ExtractDataFromPDF', '6ae7f1e0-85e1-410d-9ce6-f9509d1d5242') ;
      	// Retrieve extracted data
      	ctx.ajax.call({
      	  url: rootData.WS.Input.ServiceKey.doxURL +  '/document/jobs/' + rootData.WS.Output.docId + '?clientId=c_00&timestamp=' + new Date().getTime(),
      	  method: e.ajax.method.get,
      	  header:{
      			Accept:e.ajax.content.json,
      			Authorization: sc.localData.token
      		},
      	  contentType: e.ajax.content.json,
      	  success: function(res, status, xhr) {
      	    if (res.status == 'DONE'){
      				rootData.WS.Output.data = res;
      			sc.endStep(); // end Scenario
      			return;
      			} else if (res.status == 'PENDING') {
      				ctx.wait(function(){
      					sc.endStep('loop');
      				},5000);
      			}
      	  },
      	  error: function(xhr, error, statusText) {
      	    ctx.log(' ctx.ajax.call  error: ' + statusText);
      	  }
      	});
      }});​

      As described above, when the status is PENDING, we will loop over this step after a short delay (here, 5000 ms). This can be achieved with the sc.endStep(‘loop’) instruction.

Important note : if we take a look at the code, we can see that we give a clientId parameter in the URL when we retrieve the data. This parameter is also present in the options we send to DOX when we upload the service. In our case, this parameter has the value c_00. If you read the documentation of the API of the service, it is written that this parameter is mandatory. To make sure that you can use the service with this client id, you need to create one client using the Client API (details here).

Conclusion

At this point the data which were extracted from the document are stored in the rootData.WS.Output.data variable. So you can pass them as parameter of another scenario to process invoices for example.

As detailed in the documentation, the result is a JSON object. For each data extracted, there is a confidence score (number between 0 and 1) which can be used in the scenario when you need to work with these data.

One might imagine a scenario where an error is raised if there is a value with a confidence score lower than 0.8 for example.

 

And more…

You can find this content as a webinar or see this page to find it in the list as well as other webinar offerings.

And you can even download the sample directly from the Store in the Factory !

Last you can watch this video presenting an end-to-end use case.

What’s next ?

Now… it’s up to you to build powerful bots combining Intelligent RPA and other services. You know what to do !

Assigned tags

      21 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo DAVIDE BRAMATI
      DAVIDE BRAMATI

      Hi Jerome, thanks for the blog, nice reading!

      From the recent 2004 release, RPA is able to read data from a PDF document using PDF library. In a scenario where the customer wants to retrieve the “Order Number” or the “Amount to be paid” from a PDF file like an invoice or a payment note, why should the customer use DOX? Can the customer use the new PDF activities instead of buying another services (DOX)?

      Thank you an best regards

      Davide

      Author's profile photo Tim Nusch
      Tim Nusch

      Hi Davide,

      the PDF library as described in https://blogs.sap.com/2020/05/12/how-to-extract-data-from-text-searchable-pdf-documents-in-sap-intelligent-robotic-process-automation/ will be able to extract some fields from text-based PDFs only.

      Document Information Extraction accepts any kind of PDF file (textual or image-based) and in future will also support further formats such as image formats, Excel, CSV, Text, Email among others.

      Please find more information in these blogs:

      • https://blogs.sap.com/2020/03/06/simplify-business-document-processing-with-sap-ai-business-services/
      • https://www.linkedin.com/pulse/business-document-processing-ai-markus-noga/

      Best regards,

      Tim

      Author's profile photo Jerome GRONDIN
      Jerome GRONDIN
      Blog Post Author

      Hi,

       

      Also, another reason is that you might have to process multiple formats of invoices. If you use the built-in PDF library, it might be difficult as you would need to think of all these formats when you're implementing the bot.

      On the other hand, using the DOX service you won't have to worry about this aspect

       

      Regards,

      J.

      Author's profile photo kratika varshney
      kratika varshney

      Hi Jerome GRONDIN ,

      Thanks for this nice blog. It was very helpful in implementing my scenario. I have some doubt regarding this. If we have pdf of different formats then what we have to do. Can we change below fields in upload document step.

      {"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","senderAddress","senderName","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}

       

      Thanks

       

      Author's profile photo Jerome GRONDIN
      Jerome GRONDIN
      Blog Post Author

      Hi,

       

      Even when you have different formats, you don't have to change anything (assuming that all your documents are invoices, for exemple).

       

      Regards,

      J.

      Author's profile photo Ramesh Yuvashree
      Ramesh Yuvashree

      A very helpful blog for understanding this usecase. I am trying to achieve something similar through a UI5 app. I am trying to upload an image/pdf to this API. Would the formdata/ajax post call be something similar to the one mentioned in step 2 of code?

      Thanks!

      Author's profile photo Dheeraj Agrawal
      Dheeraj Agrawal

      Hi Everyone ,

      Thank for the Blog...

       

      When we are trying this we are unable to find how to set up the variable in custom activity, as we are not getting option in desktop studio.

      Please find attached screen shot

      Author's profile photo Jerome GRONDIN
      Jerome GRONDIN
      Blog Post Author

      Hello,

       

      just curious : what is the version of the Desktop Studio ?

       

      This part (retrieve data from Factory and set variable) is quite common. Maybe you can copy/paste the code of these steps directly from the sample mentioned in the blog post.

       

      Regards,

      J.

      Author's profile photo Vijay Sharma
      Vijay Sharma

      Hi Dheeraj Agrawal,

      You need to use "Set Context" Activity instead of Set credentials.

      Thanks

      Vijay

      Author's profile photo Vilas Salunke
      Vilas Salunke

      Hi,

      I have followed same process as you mentioned in this blog. But I m getting time out error in generate token method as like below

      can you please help me to resolve this error?

      Author's profile photo Jerome GRONDIN
      Jerome GRONDIN
      Blog Post Author

      Hi,

       

      The timeout might be caused by different reasons. You should try to set a breakpoint in your code to read the exact message of the exception thrown in the ctx.ajax.call error

       

      Regards,

      J.

      Author's profile photo Francisco Martinez
      Francisco Martinez

      Hi Jerome, I have the same issue around here. I have checked traces, logs, and everything I could find in the Desktop Studio but I couldn't find further detail about the error.

      I'd appreciate any hint about this. Thank you!

      Author's profile photo Francisco Martinez
      Francisco Martinez

      Hi again,

       

      I would appreciate any comments about the timeout issue. Thank you.

      Author's profile photo Sandeep Sharma
      Sandeep Sharma

      Hi Vilas Salunke

       

      Were you able to fx the timeout error , I have exact same issue . Your inputs are appreciated

       

      Thanks

      Sandeep

      Author's profile photo Francisco Martinez
      Francisco Martinez

      I have the same issue around here. I have checked traces, logs, and everything I could find in the Desktop Studio but I couldn't find further detail about the error.

      Does anybody has any update about this error?

      Author's profile photo Fatma El Zahraa Samir
      Fatma El Zahraa Samir

      Are you solve this error?

      Author's profile photo Ganesh Jagtap
      Ganesh Jagtap

      Hi,

      I want to use 'documentNumber' value from extracted data in my next senario. how can I read 'documentNumber' from (res) i.e. success: function(res, status, xhr)

      Author's profile photo Jerome GRONDIN
      Jerome GRONDIN
      Blog Post Author

      Hello,

       

      Have a look at the sample mentioned in the blog post. There is an example of how to access the data you retrieve from the service.

       

      Regards,

      J.

      Author's profile photo İrfan DEMİRCİOĞLU
      İrfan DEMİRCİOĞLU

      Hello,

       

      First of all, thank you for your post. It is very useful. I also watched one of the webinar video from youtube: https://www.youtube.com/watch?v=qAf7WGkJ-8w&ab_channel=SAPCommunity,

      when i do same steps i got error. Error screenshots are below, could you please help me ?

       

      Best Regards,

      Author's profile photo Jerome GRONDIN
      Jerome GRONDIN
      Blog Post Author

      Hello,

       

      The errors seems not to be related to DOX, but to the use of credentials in environment variable. I suggest you have a look at it in a simpler script to make sure you can retrieve credentials this way.

       

      Regards,

      J.

      Author's profile photo İrfan DEMİRCİOĞLU
      İrfan DEMİRCİOĞLU

      Hello,

       

      I tried with a simple script, it works.

       

      Thank you.