Skip to Content
Technical Articles
Author's profile photo Abinaya Rajeshkannan

How to Build Custom OCR in SAP Intelligent RPA

Background:

SAP Intelligent Robotic Process Automation is an automation tool in which software mimic human-like manual mouse clicks, filling the details in the transactions, which helps mundane and repeatable business processes [1].

Most businesses use unstructured data(in pdf/jpg) in their day to day life, for example, invoices, scanned documents, handwritten scanned documents. In SAP Intelligent RPA, we might need to process these types of unstructured documents in order to fetch the data like invoice number, PO number, date, etc and feed into SAP transactions.

In this blog post, We are going to see how we can handle unstructured data in SAP Intelligent RPA. Let’s see, how we can extract the information from these documents.

Data Description:

Most of the unstructured invoices will be in pdf, image or handwritten format.  The below approach will help you to process all these three types of data. I am going to choose the language: Vietnamese which is one of the toughest languages to handle in OCR. For illustration purposes, I have created a dummy Vietnamese text pdf that contains Company Name, Address, Tel phone number, Invoice Number and Account Number. We are going to extract the page from the pdf and convert it into the image and then apply OCR to the image.

 

Implementation:

Steps in SAP Intelligent RPA:

1. Create a new project in SAP Intelligent RPA Desktop Studio.

2. Go to Workflow and create a new workflow.

3. Drag custom from activities bar next to start in the workflow. You will find Custom under activities bar in the bottom right corner of the desktop studio

4. After step 3: your workflow appears like the below screenshot,

5. Now, press build to build this project.

6. Once the project is built, the code would have been created.

Approach for OCR:

We built the project in Desktop Studio, now let’s move on to the OCR side. I used deep learning pre-trained models-vie.traineddata from tessdata(https://github.com/tesseract-ocr/tessdata) to extract fields from the above Vietnamese pdf. Tessdata totally offers more than 120 languages. Hence, if you want to use a different language then you can download <<your language>>.traineddata from the above-mentioned link and place it in tessdata folder and follow a similar procedure as below.

Steps:

1. Clone the repository(https://github.com/Abinaya23/ocr_extraction.git) inside your SAP Intelligent RPA project folder. I have added all the required files for this blog post in this repository.

2. Once you cloned, your file structure looks like this,

3. Install poppler from this link(http://blog.alivate.com.au/poppler-windows/). Add the path of poppler to the path(For eg: C:\Program Files\poppler-0.68. 0\bin)

4. Install python by following this link (https://phoenixnap.com/kb/how-to-install-python-3-windows). Make sure the path for python has been added.

5. Now, we have to install the required libraries to run the python file.

6. In command prompt, change directory to the ocr_extraction folder. Execute the below command.

pip install -r requirements.txt

7. Inside the python file, set the appropriate file path for all your files.

8. Our invoice pdf only contains Company Name, Address, Tel Num, Invoice Number, and Account number, hence we are only extracting that information.  Later, you can modify this code and you can extract whatever fields you required from the invoices.

Steps to do in SAP Intelligent RPA:

1.  In SAP Intelligent RPA, we can execute our python as a shell command as mentioned in this link(https://contextor.eu/dokuwiki2/doku.php?id=lib:ctx:ctx.language#exec_command_timeout_callback).

 

2. Once the above script is executed, json file will be created under the data folder inside ocr_extraction.

3.  After this, we have to read the JSON file in order to extract the data.

4. Just for an illustration purpose, I have printed all the variables extracted from the pdf.

5. The output from the debugger.

Yay! We have successfully extracted the data from the unstructured files. After this, you can use these extracted fields to process the invoice/PO in SAP GUI or Ariba.

Related Articles:

You can check a few articles related to this blog post below,

Getting Started with SAP Intelligent RPA: https://help.sap.com/viewer/product/IRPA/Cloud/en-US

Understanding SAP Intelligent RPA: https://eursap.eu/2019/11/05/sap-intelligent-robotic-process-automation-rpa/

SAP Intelligent RPA in SAP GUI:https://blogs.sap.com/2020/02/05/mass-deletion-of-users-in-sap-gui-using-sap-intelligent-rpa-challenge-submission/

 

References:

[1] https://open.sap.com/courses/rpa1

Assigned Tags

      19 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Pierre COL
      Pierre COL

      Hi Abinaya,

      Thanks for your useful post!

      Please tag it with SAPIntelligentRPA_2020TutorialChallenge if you want to take part to our Intelligent RPA Tutorials Challenge.

      Kind regards,

      Author's profile photo Abinaya Seenivasan
      Abinaya Seenivasan
      Blog Post Author

      Hi Pierre COL,

       

      Thanks. Sure will do it.

       

      Best Regards,

      Abi

      Author's profile photo Pierre COL
      Pierre COL

      You do not need to put #, SAPIntelligentRPA_2020TutorialChallenge is OK.

      Author's profile photo Maximiliano Gonzales
      Maximiliano Gonzales
      Thank you very much for the example.
      I had doubts about how to use OCR.
      Regards!
      Author's profile photo Prasad Sundara Raghavan
      Prasad Sundara Raghavan

      Hi

      Thanks for the nice post. Really useful.

      I am getting an error in step 6 when trying to execute pip install requirements.txt. The error is ERROR: tesserocr-2.4.0-cp37-cp37m-win_amd64.whl is not a supported wheel on this platform.

      Can you please help me here ?

      Thanks

      Prasad

      Author's profile photo Abinaya Seenivasan
      Abinaya Seenivasan
      Blog Post Author

      Hi Prasad,

      Please find the tesserocr packages in this link: https://github.com/simonflueckiger/tesserocr-windows_build/releases.Based on your platform(Python version) you can provide the tesserocr wheel link in the requirements.txt. Otherwise, you can manually install the tesserocr package using below command after you download the <package_name>.whl,

      pip install <<path of package.whl>>/<package_name>.whl

      Let me know if this helps.

       

      
      

       

      Author's profile photo Prasad Sundara Raghavan
      Prasad Sundara Raghavan

      Hi

      Thank you .  I will check your response and revert.

      Thanks

      Prasad

      Author's profile photo Bengu Alan
      Bengu Alan

      Hello Abinaya Seenivasan ,

      Thank you for this post ! I miss something here. Where can I locate the 'tesserocr..' file? I got syntax error I guess it is about I coudn't locate this file properly. Could you please help me?

      Regards,

      Bengu

      Author's profile photo Abinaya Seenivasan
      Abinaya Seenivasan
      Blog Post Author

      Hi Bengu,

       

      Please find the tessarocr package inside requirements.txt which you can find in the github repo here: https://github.com/Abinaya23/ocr_extraction.git

       

      Thanks,

      Abinaya

      Author's profile photo Bengu Alan
      Bengu Alan

      Hello Abinaya Seenivasan 

      Thank you. But I downloaded it manually because in 6th step I got error too like other friend.

      Here is the screenshot of folder structure. Something wrong? Thank you.

      Regards,

      Bengu

      Author's profile photo Abinaya Seenivasan
      Abinaya Seenivasan
      Blog Post Author

      I think you need to install the packages, you can find tesserocr package in this link : https://pypi.org/project/tesserocr/

      You have to get latest whl package from the windows version according to your python version here: https://github.com/simonflueckiger/tesserocr-windows_build/releases

      You can either download the tesserocr package or install it using pip install <<tesserocr.whl>>

      I hope this helps. If you need any information, we can setup a call and discuss about the issue.

       

      Thanks and Regards,

      Abinaya

      Author's profile photo Rongxian Lin
      Rongxian Lin

      hi  Abinaya Seenivasan ,

      i'm facing the issue when i use pip install -r requirements.txt, i have added poppler to the Path and changed the file_path  , can you pls help me to correct it ?  thanks

       

       

      Author's profile photo Abinaya Seenivasan
      Abinaya Seenivasan
      Blog Post Author

      Hi Rongxian Lin,

      The command you used is not in the directory where requirements.txt are there. you have to change the directory to the path where the requirements.txt exist.

      Let me know if this helps.

      Thanks,

      Abi

      Author's profile photo Rongxian Lin
      Rongxian Lin

      thanks, installed successfully, but didn't extracted the data

      Author's profile photo Rongxian Lin
      Rongxian Lin

      updated the pic

       

       

      Author's profile photo Wei Ran
      Wei Ran

      Hello Abinaya Seenivasan ,

      I met a same problem with above, I have installed tesserocr package successfully, but when I run bot with an error, could you please help to check ? Thanks so much!

      Author's profile photo Wei Ran
      Wei Ran

      Solved !! install pdf2image,thanks.

      Author's profile photo Pooja Mittal
      Pooja Mittal

      what is the work around please explain.

      Author's profile photo Pooja Mittal
      Pooja Mittal

      Hi,

       

      I am getting the error at step 6, i have installed new and old version of tesserocr, have phython 2.9 and pip 2.3.3

       

      though the latter code runs fine which reads the default json file from the folder, but i want to parse mine pdf , please help, i am also developing a POC for some client.

       

      Thanks

      Pooja