Technical Articles
Combining Document Information Extraction and Intelligent RPA to automatically extract data from PDF documents
Reading a PDF to extract and structure the data it contains can be a tedious work, especially when you got hundreds of files to process, and this can lead to many errors with all the consequences it can have.
In this blog post, we will see how it is possible to combine Intelligent RPA and Document Information Extraction to automate the processing of your PDF documents.
If you never heard of Document Information Extraction, let me introduce what Document Information Extraction is.
Document Information Extraction (also commonly called DOX) is a service you can use to process documents that have content in headers and tables. Typically, you can use it to extract data from invoices, or payment notes. With such a service you can upload a PDF document and get the extracted data as a JSON object.
You can find all kind of useful information regarding the service on this page.
But first things first : let’s set up the service so we can use it.
Setup the DOX service
You need to have a SAP CP global account and a CPEA license.
- Create a subaccount in your global account using the SAP Cloud Platform Cockpit (of course you can use the same subaccount than the one you’re using for Intelligent RPA)
- Create a Space
- Create a Service Instance of DOX (you might need to add a Service Plans for this service to your subaccount first)
- Create a Service Key (note that you can find more information about this procedure here)
- Once the Service Key is created, you should have a JSON object with the following structure:
{
"url": "",
"uaa": {
"uaadomain": "",
"tenantmode": "",
"sburl": "",
"clientid": "",
"verificationkey": "",
"apiurl": "",
"xsappname": "",
"identityzone": "",
"identityzoneid": "",
"clientsecret": "",
"tenantid": "",
"url": ""
}
}
Let’s save the url, uaa.clientid, uaa.clientsecret and uaa.url as we will need them later.
Create a bot
Create a new project and a new workflow.
Set up the variables
- Create input/output parameters in the context of the scenario
- Insert the Set Context activities as shown below
- For each one of them, set the value in the dedicated variable (see below) :
- Set the credentials (see below for the properties
Don’t forget to replace uaa.clientid and uaa.clientsecret with the values from the Service Key you created before.
- Set the tokenURL variable :
Don’t forget to replace uaa.url with the values from the Service Key.
- Set the doxURL variable :
Replace the url value with the one from the Service Key. The last part was inserted as requested by the documentation.
- Last, set the path of a PDF file. Ex:
- Set the credentials (see below for the properties
Web services to use DOX
- Insert a web service activity, which properties are shown below :
- Insert another web service activity. This one will be used to send the PDF document to the DOX service. Properties of this activity are shown below :
- Finally, insert a last web service activity, and provide the following properties :
Now, build the solution. We’ll have to make some adjustments in the JavaScript code which is generated.
Note : in this case the value of the clientid and clientsecret were hard-coded in the property value of the Set Context activity. To be more compliant with security, you can of course store these values in some Factory credentials variable, and retrieve it in the workflow, as explained in this blog.
Update the code
Since we cannot provide all the required options in the web service calls, we need to do it manually.
- WS call to generate the token so it is as shown below
// ---------------------------------------------------------------- // Step: Generate_token // ---------------------------------------------------------------- GLOBAL.step({ Generate_token: function(ev, sc, st) { var rootData = sc.data; ctx.workflow('ExtractDataFromPDF', '7e436b4c-226a-4a50-93c3-4948023e62db') ; // Generate token ctx.ajax.call({ url: rootData.WS.Input.ServiceKey.tokenURL + '/oauth/token?grant_type=client_credentials', method: e.ajax.method.get, contentType: e.ajax.content.json, header:{ Accept: e.ajax.content.json, Authorization: rootData.WS.Input.ServiceKey.client }, success: function(res, status, xhr) { sc.localData.token = 'Bearer ' + ctx.get(res, 'access_token'); sc.endStep(); // Upload_file return; }, error: function(xhr, error, statusText) { ctx.log(' ctx.ajax.call error: ' + statusText); } }); }});
The data attribute has been removed and the header has been inserted. Also, in the success callback, we need to extract only the information we need to set it in the sc.localData.token variable.
- When we upload the PDF file, we should use the following code instead of the one which is automatically generated :
// ---------------------------------------------------------------- // Step: Upload_file // ---------------------------------------------------------------- GLOBAL.step({ Upload_file: function(ev, sc, st) { var rootData = sc.data; ctx.workflow('ExtractDataFromPDF', '0a1f4176-8907-4544-a807-65095d30ea36') ; // Upload file ctx.ajax.call({ url: rootData.WS.Input.ServiceKey.doxURL + '/document/jobs', method: e.ajax.method.post, formData: [{ file:rootData.WS.Input.filePath, type:e.ajax.content.pdf, name:'file' },{ value:ctx.json.stringify({"extraction":{"headerFields":["documentNumber","taxId","taxName","purchaseOrderNumber","shippingAmount","netAmount","senderAddress","senderName","grossAmount","currencyCode","receiverContact","documentDate","taxAmount","taxRate","receiverName","receiverAddress"],"lineItemFields":["description","netAmount","quantity","unitPrice","materialNumber"]},"clientId":"c_00","documentType":"invoice","enrichment":{"sender":{"top":5,"type":"businessEntity","subtype":"supplier"},"employee":{"type":"employee"}}}), type:e.ajax.content.jsonText, name:'options' }], header:{ Accept:e.ajax.content.json, Authorization: sc.localData.token }, contentType: e.ajax.content.json, success: function(res, status, xhr) { rootData.WS.Output.docId = ctx.get(res, 'id'); sc.endStep(); // Retrieve_extracted_da return; }, error: function(xhr, error, statusText) { ctx.log(' ctx.ajax.call error: ' + statusText); } }); }});
We provide inputs as a formData, composed of the path of the PDF files, and options which are expected by the DOX service. Also, we need to provide the Authorization header so the service allows us to use it. When the upload of the document is done, a unique id is generated by the service. We will use this id so we can retrieve data from the document.
- Now comes the fun part : the service might take a while to process the document and to send back the data. While the service is processing the document, it will send PENDING as the result status. When the processing of the document is complete, we will get the status DONE. As we do not want to block to bot while it is waiting the result from DOX, let’s add a way to loop so it can periodically ask for the results.
- First, insert a new link between the last step and itself in the definition of the scenario:
// ---------------------------------------------------------------- // Scenario: ExtractDataFromPDF // ---------------------------------------------------------------- GLOBAL.scenario({ ExtractDataFromPDF: function(ev, sc) { var rootData = sc.data; sc.setMode(e.scenario.mode.clearIfRunning); sc.setScenarioTimeout(600000); // Default timeout for global scenario. sc.onError(function(sc, st, ex) { sc.endScenario(); }); // Default error handler. sc.onTimeout(30000, function(sc, st) { sc.endScenario(); }); // Default timeout handler for each step. sc.step(GLOBAL.steps.Set_credentials, GLOBAL.steps.Set_tokenURL); sc.step(GLOBAL.steps.Set_tokenURL, GLOBAL.steps.Set_doxURL); sc.step(GLOBAL.steps.Set_doxURL, GLOBAL.steps.Set_file_path); sc.step(GLOBAL.steps.Set_file_path, GLOBAL.steps.Generate_token); sc.step(GLOBAL.steps.Generate_token, GLOBAL.steps.Upload_file); sc.step(GLOBAL.steps.Upload_file, GLOBAL.steps.Retrieve_extracted_da); sc.step(GLOBAL.steps.Retrieve_extracted_da, GLOBAL.steps.Retrieve_extracted_da, 'loop'); sc.step(GLOBAL.steps.Retrieve_extracted_da, null); }}, ctx.dataManagers.rootData).setId('57146a42-30d7-47af-8aad-9844f008f7d8') ;
Using the name ‘loop’ we will be able to make the bot go to the step Retrieve_extracted_da (which is, in our case, the same step it was in previously)
- Then, update the code of the last web service call, so you get :
// ---------------------------------------------------------------- // Step: Retrieve_extracted_da // ---------------------------------------------------------------- GLOBAL.step({ Retrieve_extracted_da: function(ev, sc, st) { var rootData = sc.data; ctx.workflow('ExtractDataFromPDF', '6ae7f1e0-85e1-410d-9ce6-f9509d1d5242') ; // Retrieve extracted data ctx.ajax.call({ url: rootData.WS.Input.ServiceKey.doxURL + '/document/jobs/' + rootData.WS.Output.docId + '?clientId=c_00×tamp=' + new Date().getTime(), method: e.ajax.method.get, header:{ Accept:e.ajax.content.json, Authorization: sc.localData.token }, contentType: e.ajax.content.json, success: function(res, status, xhr) { if (res.status == 'DONE'){ rootData.WS.Output.data = res; sc.endStep(); // end Scenario return; } else if (res.status == 'PENDING') { ctx.wait(function(){ sc.endStep('loop'); },5000); } }, error: function(xhr, error, statusText) { ctx.log(' ctx.ajax.call error: ' + statusText); } }); }});
As described above, when the status is PENDING, we will loop over this step after a short delay (here, 5000 ms). This can be achieved with the sc.endStep(‘loop’) instruction.
- First, insert a new link between the last step and itself in the definition of the scenario:
Important note : if we take a look at the code, we can see that we give a clientId parameter in the URL when we retrieve the data. This parameter is also present in the options we send to DOX when we upload the service. In our case, this parameter has the value c_00. If you read the documentation of the API of the service, it is written that this parameter is mandatory. To make sure that you can use the service with this client id, you need to create one client using the Client API (details here).
Conclusion
At this point the data which were extracted from the document are stored in the rootData.WS.Output.data variable. So you can pass them as parameter of another scenario to process invoices for example.
As detailed in the documentation, the result is a JSON object. For each data extracted, there is a confidence score (number between 0 and 1) which can be used in the scenario when you need to work with these data.
One might imagine a scenario where an error is raised if there is a value with a confidence score lower than 0.8 for example.
And more…
You can find this content as a webinar or see this page to find it in the list as well as other webinar offerings.
Last you can watch this video presenting an end-to-end use case.
What’s next ?
Now… it’s up to you to build powerful bots combining Intelligent RPA and other services. You know what to do !
Hi Jerome, thanks for the blog, nice reading!
From the recent 2004 release, RPA is able to read data from a PDF document using PDF library. In a scenario where the customer wants to retrieve the “Order Number” or the “Amount to be paid” from a PDF file like an invoice or a payment note, why should the customer use DOX? Can the customer use the new PDF activities instead of buying another services (DOX)?
Thank you an best regards
Davide
Hi Davide,
the PDF library as described in https://blogs.sap.com/2020/05/12/how-to-extract-data-from-text-searchable-pdf-documents-in-sap-intelligent-robotic-process-automation/ will be able to extract some fields from text-based PDFs only.
Document Information Extraction accepts any kind of PDF file (textual or image-based) and in future will also support further formats such as image formats, Excel, CSV, Text, Email among others.
Please find more information in these blogs:
Best regards,
Tim
Hi,
Also, another reason is that you might have to process multiple formats of invoices. If you use the built-in PDF library, it might be difficult as you would need to think of all these formats when you're implementing the bot.
On the other hand, using the DOX service you won't have to worry about this aspect
Regards,
J.
Hi Jerome GRONDIN ,
Thanks for this nice blog. It was very helpful in implementing my scenario. I have some doubt regarding this. If we have pdf of different formats then what we have to do. Can we change below fields in upload document step.
Thanks
Hi,
Even when you have different formats, you don't have to change anything (assuming that all your documents are invoices, for exemple).
Regards,
J.
A very helpful blog for understanding this usecase. I am trying to achieve something similar through a UI5 app. I am trying to upload an image/pdf to this API. Would the formdata/ajax post call be something similar to the one mentioned in step 2 of code?
Thanks!
Hi Everyone ,
Thank for the Blog...
When we are trying this we are unable to find how to set up the variable in custom activity, as we are not getting option in desktop studio.
Please find attached screen shot
Hello,
just curious : what is the version of the Desktop Studio ?
This part (retrieve data from Factory and set variable) is quite common. Maybe you can copy/paste the code of these steps directly from the sample mentioned in the blog post.
Regards,
J.
Hi Dheeraj Agrawal,
You need to use "Set Context" Activity instead of Set credentials.
Thanks
Vijay
Hi,
I have followed same process as you mentioned in this blog. But I m getting time out error in generate token method as like below
can you please help me to resolve this error?
Hi,
The timeout might be caused by different reasons. You should try to set a breakpoint in your code to read the exact message of the exception thrown in the ctx.ajax.call error
Regards,
J.
Hi Jerome, I have the same issue around here. I have checked traces, logs, and everything I could find in the Desktop Studio but I couldn't find further detail about the error.
I'd appreciate any hint about this. Thank you!
Hi again,
I would appreciate any comments about the timeout issue. Thank you.
Hi Vilas Salunke
Were you able to fx the timeout error , I have exact same issue . Your inputs are appreciated
Thanks
Sandeep
I have the same issue around here. I have checked traces, logs, and everything I could find in the Desktop Studio but I couldn't find further detail about the error.
Does anybody has any update about this error?
Are you solve this error?
Hi,
I want to use 'documentNumber' value from extracted data in my next senario. how can I read 'documentNumber' from (res) i.e. success: function(res, status, xhr)
Hello,
Have a look at the sample mentioned in the blog post. There is an example of how to access the data you retrieve from the service.
Regards,
J.
Hello,
First of all, thank you for your post. It is very useful. I also watched one of the webinar video from youtube: https://www.youtube.com/watch?v=qAf7WGkJ-8w&ab_channel=SAPCommunity,
when i do same steps i got error. Error screenshots are below, could you please help me ?
Best Regards,
Hello,
The errors seems not to be related to DOX, but to the use of credentials in environment variable. I suggest you have a look at it in a simpler script to make sure you can retrieve credentials this way.
Regards,
J.
Hello,
I tried with a simple script, it works.
Thank you.
Hi Expert,
Can we able to create the same example in the trail account?
Regards,
Padmindra