How to Build Custom OCR in SAP Intelligent RPA
SAP Intelligent Robotic Process Automation is an automation tool in which software mimic human-like manual mouse clicks, filling the details in the transactions, which helps mundane and repeatable business processes .
Most businesses use unstructured data(in pdf/jpg) in their day to day life, for example, invoices, scanned documents, handwritten scanned documents. In SAP Intelligent RPA, we might need to process these types of unstructured documents in order to fetch the data like invoice number, PO number, date, etc and feed into SAP transactions.
In this blog post, We are going to see how we can handle unstructured data in SAP Intelligent RPA. Let’s see, how we can extract the information from these documents.
Most of the unstructured invoices will be in pdf, image or handwritten format. The below approach will help you to process all these three types of data. I am going to choose the language: Vietnamese which is one of the toughest languages to handle in OCR. For illustration purposes, I have created a dummy Vietnamese text pdf that contains Company Name, Address, Tel phone number, Invoice Number and Account Number. We are going to extract the page from the pdf and convert it into the image and then apply OCR to the image.
Steps in SAP Intelligent RPA:
1. Create a new project in SAP Intelligent RPA Desktop Studio.
2. Go to Workflow and create a new workflow.
3. Drag custom from activities bar next to start in the workflow. You will find Custom under activities bar in the bottom right corner of the desktop studio
4. After step 3: your workflow appears like the below screenshot,
5. Now, press build to build this project.
6. Once the project is built, the code would have been created.
Approach for OCR:
We built the project in Desktop Studio, now let’s move on to the OCR side. I used deep learning pre-trained models-vie.traineddata from tessdata(https://github.com/tesseract-ocr/tessdata) to extract fields from the above Vietnamese pdf. Tessdata totally offers more than 120 languages. Hence, if you want to use a different language then you can download <<your language>>.traineddata from the above-mentioned link and place it in tessdata folder and follow a similar procedure as below.
1. Clone the repository(https://github.com/Abinaya23/ocr_extraction.git) inside your SAP Intelligent RPA project folder. I have added all the required files for this blog post in this repository.
2. Once you cloned, your file structure looks like this,
3. Install poppler from this link(http://blog.alivate.com.au/poppler-windows/). Add the path of poppler to the path(For eg: C:\Program Files\poppler-0.68. 0\bin)
4. Install python by following this link (https://phoenixnap.com/kb/how-to-install-python-3-windows). Make sure the path for python has been added.
5. Now, we have to install the required libraries to run the python file.
6. In command prompt, change directory to the ocr_extraction folder. Execute the below command.
|pip install -r requirements.txt|
7. Inside the python file, set the appropriate file path for all your files.
8. Our invoice pdf only contains Company Name, Address, Tel Num, Invoice Number, and Account number, hence we are only extracting that information. Later, you can modify this code and you can extract whatever fields you required from the invoices.
Steps to do in SAP Intelligent RPA:
1. In SAP Intelligent RPA, we can execute our python as a shell command as mentioned in this link(https://contextor.eu/dokuwiki2/doku.php?id=lib:ctx:ctx.language#exec_command_timeout_callback).
2. Once the above script is executed, json file will be created under the data folder inside ocr_extraction.
3. After this, we have to read the JSON file in order to extract the data.
4. Just for an illustration purpose, I have printed all the variables extracted from the pdf.
5. The output from the debugger.
Yay! We have successfully extracted the data from the unstructured files. After this, you can use these extracted fields to process the invoice/PO in SAP GUI or Ariba.
You can check a few articles related to this blog post below,
Getting Started with SAP Intelligent RPA: https://help.sap.com/viewer/product/IRPA/Cloud/en-US
Understanding SAP Intelligent RPA: https://eursap.eu/2019/11/05/sap-intelligent-robotic-process-automation-rpa/
SAP Intelligent RPA in SAP GUI:https://blogs.sap.com/2020/02/05/mass-deletion-of-users-in-sap-gui-using-sap-intelligent-rpa-challenge-submission/
Thanks for your useful post!
Please tag it with SAPIntelligentRPA_2020TutorialChallenge if you want to take part to our Intelligent RPA Tutorials Challenge.
Hi Pierre COL,
Thanks. Sure will do it.
You do not need to put #, SAPIntelligentRPA_2020TutorialChallenge is OK.
Thanks for the nice post. Really useful.
I am getting an error in step 6 when trying to execute pip install requirements.txt. The error is ERROR: tesserocr-2.4.0-cp37-cp37m-win_amd64.whl is not a supported wheel on this platform.
Can you please help me here ?
Please find the tesserocr packages in this link: https://github.com/simonflueckiger/tesserocr-windows_build/releases.Based on your platform(Python version) you can provide the tesserocr wheel link in the requirements.txt. Otherwise, you can manually install the tesserocr package using below command after you download the <package_name>.whl,
Let me know if this helps.
Thank you . I will check your response and revert.
Hello Abinaya Seenivasan ,
Thank you for this post ! I miss something here. Where can I locate the 'tesserocr..' file? I got syntax error I guess it is about I coudn't locate this file properly. Could you please help me?
Please find the tessarocr package inside requirements.txt which you can find in the github repo here: https://github.com/Abinaya23/ocr_extraction.git
Hello Abinaya Seenivasan
Thank you. But I downloaded it manually because in 6th step I got error too like other friend.
Here is the screenshot of folder structure. Something wrong? Thank you.
I think you need to install the packages, you can find tesserocr package in this link : https://pypi.org/project/tesserocr/
You have to get latest whl package from the windows version according to your python version here: https://github.com/simonflueckiger/tesserocr-windows_build/releases
You can either download the tesserocr package or install it using pip install <<tesserocr.whl>>
I hope this helps. If you need any information, we can setup a call and discuss about the issue.
Thanks and Regards,
hi Abinaya Seenivasan ,
i'm facing the issue when i use pip install -r requirements.txt, i have added poppler to the Path and changed the file_path , can you pls help me to correct it ? thanks
Hi Rongxian Lin,
The command you used is not in the directory where requirements.txt are there. you have to change the directory to the path where the requirements.txt exist.
Let me know if this helps.
thanks, installed successfully, but didn't extracted the data
updated the pic
Hello Abinaya Seenivasan ,
I met a same problem with above, I have installed tesserocr package successfully, but when I run bot with an error, could you please help to check ? Thanks so much!
Solved !! install pdf2image,thanks.
what is the work around please explain.
I am getting the error at step 6, i have installed new and old version of tesserocr, have phython 2.9 and pip 2.3.3
though the latter code runs fine which reads the default json file from the folder, but i want to parse mine pdf , please help, i am also developing a POC for some client.