Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 

Background:


SAP Intelligent Robotic Process Automation is an automation tool in which software mimic human-like manual mouse clicks, filling the details in the transactions, which helps mundane and repeatable business processes [1].


Most businesses use unstructured data(in pdf/jpg) in their day to day life, for example, invoices, scanned documents, handwritten scanned documents. In SAP Intelligent RPA, we might need to process these types of unstructured documents in order to fetch the data like invoice number, PO number, date, etc and feed into SAP transactions.


In this blog post, We are going to see how we can handle unstructured data in SAP Intelligent RPA. Let's see, how we can extract the information from these documents.

Data Description:

Most of the unstructured invoices will be in pdf, image or handwritten format.  The below approach will help you to process all these three types of data. I am going to choose the language: Vietnamese which is one of the toughest languages to handle in OCR. For illustration purposes, I have created a dummy Vietnamese text pdf that contains Company Name, Address, Tel phone number, Invoice Number and Account Number. We are going to extract the page from the pdf and convert it into the image and then apply OCR to the image.

 



Implementation:


Steps in SAP Intelligent RPA:


1. Create a new project in SAP Intelligent RPA Desktop Studio.



2. Go to Workflow and create a new workflow.



3. Drag custom from activities bar next to start in the workflow. You will find Custom under activities bar in the bottom right corner of the desktop studio



4. After step 3: your workflow appears like the below screenshot,



5. Now, press build to build this project.



6. Once the project is built, the code would have been created.



Approach for OCR:


We built the project in Desktop Studio, now let’s move on to the OCR side. I used deep learning pre-trained models-vie.traineddata from tessdata(https://github.com/tesseract-ocr/tessdata) to extract fields from the above Vietnamese pdf. Tessdata totally offers more than 120 languages. Hence, if you want to use a different language then you can download <<your language>>.traineddata from the above-mentioned link and place it in tessdata folder and follow a similar procedure as below.

Steps:


1. Clone the repository(https://github.com/Abinaya23/ocr_extraction.git) inside your SAP Intelligent RPA project folder. I have added all the required files for this blog post in this repository.

2. Once you cloned, your file structure looks like this,



3. Install poppler from this link(http://blog.alivate.com.au/poppler-windows/). Add the path of poppler to the path(For eg: C:\Program Files\poppler-0.68. 0\bin)

4. Install python by following this link (https://phoenixnap.com/kb/how-to-install-python-3-windows). Make sure the path for python has been added.

5. Now, we have to install the required libraries to run the python file.

6. In command prompt, change directory to the ocr_extraction folder. Execute the below command.






pip install -r requirements.txt

7. Inside the python file, set the appropriate file path for all your files.



8. Our invoice pdf only contains Company Name, Address, Tel Num, Invoice Number, and Account number, hence we are only extracting that information.  Later, you can modify this code and you can extract whatever fields you required from the invoices.

Steps to do in SAP Intelligent RPA:


1.  In SAP Intelligent RPA, we can execute our python as a shell command as mentioned in this link(https://contextor.eu/dokuwiki2/doku.php?id=lib:ctx:ctx.language#exec_command_timeout_callback).


 

2. Once the above script is executed, json file will be created under the data folder inside ocr_extraction.



3.  After this, we have to read the JSON file in order to extract the data.



4. Just for an illustration purpose, I have printed all the variables extracted from the pdf.



5. The output from the debugger.



Yay! We have successfully extracted the data from the unstructured files. After this, you can use these extracted fields to process the invoice/PO in SAP GUI or Ariba.

Related Articles:


You can check a few articles related to this blog post below,

Getting Started with SAP Intelligent RPA: https://help.sap.com/viewer/product/IRPA/Cloud/en-US

Understanding SAP Intelligent RPA: https://eursap.eu/2019/11/05/sap-intelligent-robotic-process-automation-rpa/

SAP Intelligent RPA in SAP GUI:https://blogs.sap.com/2020/02/05/mass-deletion-of-users-in-sap-gui-using-sap-intelligent-rpa-challen...

 

References:


[1] https://open.sap.com/courses/rpa1
19 Comments