Skip to Content
Technical Articles

Combining Document Information Extraction and Intelligent RPA to automatically extract data from PDF documents

Reading a PDF to extract and structure the data it contains can be a tedious work, especially when you got hundreds of files to process, and this can lead to many errors with all the consequences it can have.

In this blog post, we will see how it is possible to combine Intelligent RPA and Document Information Extraction to automate the processing of your PDF documents.

If you never heard of Document Information Extraction, let me introduce what Document Information Extraction is.

Document Information Extraction (also commonly called DOX) is a service you can use to process documents that have content in headers and tables. Typically, you can use it to extract data from invoices, or payment notes. With such a service you can upload a PDF document and get the extracted data as a JSON object.

You can find all kind of useful information regarding the service on this page.

But first things first : let’s set up the service so we can use it.

Setup the DOX service

You need to have a SAP CP global account and a CPEA license.

  1. Create a subaccount in your global account using the SAP Cloud Platform Cockpit (of course you can use the same subaccount than the one you’re using for Intelligent RPA)
  2. Create a Space
  3. Create a Service Instance of DOX (you might need to add a Service Plans for this service to your subaccount first)
  4. Create a Service Key (note that you can find more information about this procedure here)
  5. Once the Service Key is created, you should have a JSON object with the following structure:
{
	"url": "",
	"uaa": {
		"uaadomain": "",
		"tenantmode": "",
		"sburl": "",
		"clientid": "",
		"verificationkey": "",
		"apiurl": "",
		"xsappname": "",
		"identityzone": "",
		"identityzoneid": "",
		"clientsecret": "",
		"tenantid": "",
		"url": ""
	}
}

Let’s save the url, uaa.clientid, uaa.clientsecret and uaa.url as we will need them later.

Create a bot

Create a new project and a new workflow.

Set up the variables

  1. Create input/output parameters in the context of the scenario
  2. Insert the Set Context activities as shown below
  3. For each one of them, set the value in the dedicated variable (see below) :
    • Set the credentials (see below for the properties
      Don’t forget to replace uaa.clientid and uaa.clientsecret with the values from the Service Key you created before.
    • Set the tokenURL variable : Don’t forget to replace uaa.url with the values from the Service Key.
    • Set the doxURL variable :Replace the url value with the one from the Service Key. The last part was inserted as requested by the documentation.
    • Last, set the path of a PDF file. Ex:

Web services to use DOX

  1. Insert a web service activity, which properties are shown below :
  2. Insert another web service activity. This one will be used to send the PDF document to the DOX service. Properties of this activity are shown below :
  3. Finally, insert a last web service activity, and provide the following properties :

 

You should have the workflow below :

Now, build the solution. We’ll have to make some adjustments in the JavaScript code which is generated.

Note : in this case the value of the clientid and clientsecret were hard-coded in the property value of the Set Context activity. To be more compliant with security, you can of course store these values in some Factory credentials variable, and retrieve it in the workflow, as explained in this blog.

Update the code

Since we cannot provide all the required options in the web service calls, we need to do it manually.

  1. WS call to generate the token so it is as shown below
    // ----------------------------------------------------------------
    //   Step: Generate_token
    // ----------------------------------------------------------------
    GLOBAL.step({ Generate_token: function(ev, sc, st) {
    	var rootData = sc.data;
    	ctx.workflow('ExtractDataFromPDF', '7e436b4c-226a-4a50-93c3-4948023e62db') ;
    	// Generate token
    	ctx.ajax.call({
    	  url: rootData.WS.Input.ServiceKey.tokenURL +  '/oauth/token?grant_type=client_credentials',
    	  method: e.ajax.method.get,
    	  contentType: e.ajax.content.json,
    		header:{
    			Accept: e.ajax.content.json,
    			Authorization: rootData.WS.Input.ServiceKey.client
    		},
    	  success: function(res, status, xhr) {
    	    sc.localData.token = 'Bearer ' + ctx.get(res, 'access_token');
    		sc.endStep(); // Upload_file
    		return;
    	  },
    	  error: function(xhr, error, statusText) {
    	    ctx.log(' ctx.ajax.call  error: ' + statusText);
    	  }
    	});
    }});

    The data attribute has been removed and the header has been inserted. Also, in the success callback, we need to extract only the information we need to set it in the sc.localData.token variable.

  2.  When we upload the PDF file, we should use the following code instead of the one which is automatically generated :
    // ----------------------------------------------------------------
    //   Step: Upload_file
    // ----------------------------------------------------------------
    GLOBAL.step({ Upload_file: function(ev, sc, st) {
    	var rootData = sc.data;
    	ctx.workflow('ExtractDataFromPDF', '0a1f4176-8907-4544-a807-65095d30ea36') ;
    	// Upload file
    	ctx.ajax.call({
    	  url: rootData.WS.Input.ServiceKey.doxURL +  '/document/jobs',
    	  method: e.ajax.method.post,
    	  formData: [{
    			file:rootData.WS.Input.filePath,
    			type:e.ajax.content.pdf,
    			name:'file'
    		},{
    			value:ctx.json.stringify({"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","senderAddress","senderName","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}),
    			type:e.ajax.content.jsonText,
    			name:'options'
    		}],
    		header:{
    			Accept:e.ajax.content.json,
    			Authorization: sc.localData.token
    		},
    	  contentType: e.ajax.content.json,
    	  success: function(res, status, xhr) {
    	    rootData.WS.Output.docId = ctx.get(res, 'id');
    		sc.endStep(); // Retrieve_extracted_da
    		return;
    	  },
    	  error: function(xhr, error, statusText) {
    	    ctx.log(' ctx.ajax.call  error: ' + statusText);
    	  }
    	});
    }});

    We provide inputs as a formData, composed of the path of the PDF files, and options which are expected by the DOX service. Also, we need to provide the Authorization header so the service allows us to use it. When the upload of the document is done, a unique id is generated by the service. We will use this id so we can retrieve data from the document.

  3. Now comes the fun part : the service might take a while to process the document and to send back the data. While the service is processing the document, it will send PENDING as the result status. When the processing of the document is complete, we will get the status DONE. As we do not want to block to bot while it is waiting the result from DOX, let’s add a way to loop so it can periodically ask for the results.
    • First, insert a new link between the last step and itself in the definition of the scenario:
      // ----------------------------------------------------------------
      //   Scenario: ExtractDataFromPDF
      // ----------------------------------------------------------------
      GLOBAL.scenario({ ExtractDataFromPDF: function(ev, sc) {
      	var rootData = sc.data;
      
      	sc.setMode(e.scenario.mode.clearIfRunning);
      	sc.setScenarioTimeout(600000); // Default timeout for global scenario.
      	sc.onError(function(sc, st, ex) { sc.endScenario(); }); // Default error handler.
      	sc.onTimeout(30000, function(sc, st) { sc.endScenario(); }); // Default timeout handler for each step.
      	sc.step(GLOBAL.steps.Set_credentials, GLOBAL.steps.Set_tokenURL);
      	sc.step(GLOBAL.steps.Set_tokenURL, GLOBAL.steps.Set_doxURL);
      	sc.step(GLOBAL.steps.Set_doxURL, GLOBAL.steps.Set_file_path);
      	sc.step(GLOBAL.steps.Set_file_path, GLOBAL.steps.Generate_token);
      	sc.step(GLOBAL.steps.Generate_token, GLOBAL.steps.Upload_file);
      	sc.step(GLOBAL.steps.Upload_file, GLOBAL.steps.Retrieve_extracted_da);
      	sc.step(GLOBAL.steps.Retrieve_extracted_da, GLOBAL.steps.Retrieve_extracted_da, 'loop');
      	sc.step(GLOBAL.steps.Retrieve_extracted_da, null);
      }}, ctx.dataManagers.rootData).setId('57146a42-30d7-47af-8aad-9844f008f7d8') ;​

      Using the name ‘loop’ we will be able to make the bot go to the step Retrieve_extracted_da (which is, in our case, the same step it was in previously)

    • Then, update the code of the last web service call, so you get :
      // ----------------------------------------------------------------
      //   Step: Retrieve_extracted_da
      // ----------------------------------------------------------------
      GLOBAL.step({ Retrieve_extracted_da: function(ev, sc, st) {
      	var rootData = sc.data;
      	ctx.workflow('ExtractDataFromPDF', '6ae7f1e0-85e1-410d-9ce6-f9509d1d5242') ;
      	// Retrieve extracted data
      	ctx.ajax.call({
      	  url: rootData.WS.Input.ServiceKey.doxURL +  '/document/jobs/' + rootData.WS.Output.docId + '?clientId=c_00&timestamp=' + new Date().getTime(),
      	  method: e.ajax.method.get,
      	  header:{
      			Accept:e.ajax.content.json,
      			Authorization: sc.localData.token
      		},
      	  contentType: e.ajax.content.json,
      	  success: function(res, status, xhr) {
      	    if (res.status == 'DONE'){
      				rootData.WS.Output.data = res;
      			sc.endStep(); // end Scenario
      			return;
      			} else if (res.status == 'PENDING') {
      				ctx.wait(function(){
      					sc.endStep('loop');
      				},5000);
      			}
      	  },
      	  error: function(xhr, error, statusText) {
      	    ctx.log(' ctx.ajax.call  error: ' + statusText);
      	  }
      	});
      }});​

      As described above, when the status is PENDING, we will loop over this step after a short delay (here, 5000 ms). This can be achieved with the sc.endStep(‘loop’) instruction.

Important note : if we take a look at the code, we can see that we give a clientId parameter in the URL when we retrieve the data. This parameter is also present in the options we send to DOX when we upload the service. In our case, this parameter has the value c_00. If you read the documentation of the API of the service, it is written that this parameter is mandatory. To make sure that you can use the service with this client id, you need to create one client using the Client API (details here).

Conclusion

At this point the data which were extracted from the document are stored in the rootData.WS.Output.data variable. So you can pass them as parameter of another scenario to process invoices for example.

As detailed in the documentation, the result is a JSON object. For each data extracted, there is a confidence score (number between 0 and 1) which can be used in the scenario when you need to work with these data.

One might imagine a scenario where an error is raised if there is a value with a confidence score lower than 0.8 for example.

 

And more…

You can find this content as a webinar or see this page to find it in the list as well as other webinar offerings.

And you can even download the sample directly from the Store in the Factory !

Last you can watch this video presenting an end-to-end use case.

What’s next ?

Now… it’s up to you to build powerful bots combining Intelligent RPA and other services. You know what to do !

15 Comments
You must be Logged on to comment or reply to a post.
  • Hi Jerome, thanks for the blog, nice reading!

    From the recent 2004 release, RPA is able to read data from a PDF document using PDF library. In a scenario where the customer wants to retrieve the “Order Number” or the “Amount to be paid” from a PDF file like an invoice or a payment note, why should the customer use DOX? Can the customer use the new PDF activities instead of buying another services (DOX)?

    Thank you an best regards

    Davide

  • Hi Jerome GRONDIN ,

    Thanks for this nice blog. It was very helpful in implementing my scenario. I have some doubt regarding this. If we have pdf of different formats then what we have to do. Can we change below fields in upload document step.

    {"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","senderAddress","senderName","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}

     

    Thanks

     

    • Hi,

       

      Even when you have different formats, you don’t have to change anything (assuming that all your documents are invoices, for exemple).

       

      Regards,

      J.

  • A very helpful blog for understanding this usecase. I am trying to achieve something similar through a UI5 app. I am trying to upload an image/pdf to this API. Would the formdata/ajax post call be something similar to the one mentioned in step 2 of code?

    Thanks!

  • Hi Everyone ,

    Thank for the Blog…

     

    When we are trying this we are unable to find how to set up the variable in custom activity, as we are not getting option in desktop studio.

    Please find attached screen shot

    /
    • Hello,

       

      just curious : what is the version of the Desktop Studio ?

       

      This part (retrieve data from Factory and set variable) is quite common. Maybe you can copy/paste the code of these steps directly from the sample mentioned in the blog post.

       

      Regards,

      J.

  • Hi,

    I have followed same process as you mentioned in this blog. But I m getting time out error in generate token method as like below

    can you please help me to resolve this error?

    /
  • Hi,

    I want to use ‘documentNumber’ value from extracted data in my next senario. how can I read ‘documentNumber’ from (res) i.e. success: function(res, status, xhr)

    • Hello,

       

      Have a look at the sample mentioned in the blog post. There is an example of how to access the data you retrieve from the service.

       

      Regards,

      J.