Skip to Content
Technical Articles
Author's profile photo Simardeep Singh

PDF data extraction in Intelligent RPA – Part 1

This blog post is part of the SAP Intelligent RPA 2.0 Best Practices Series.

Introduction

PDF (Portable Document Format) is one the most used formats by numerous individuals and organizations to exchange information. It is widely used to create business related documents and therefore plays an important part in most process automations.

Sharing information is very convenient with PDF but extracting the information from the documents could be a hectic and tedious task. Most of the Intelligent RPA bots requires input data to execute the process which could also be supplied from a document. An example for such a scenario would be extracting data from Purchase Order and perform actions based on the data in a ERP (Enterprise Resource Planning) system.

To solve the problem, Intelligent RPA 2.0 introduced PDF SDK which allows you to extract data from the documents with the help of user-friendly and convenient activities. It part of the Cloud Studio and can extract text from machine readable/generated PDF’s.

PDF Activities are divided into 4 modules:

PDF%20SDK%20Modules

 

Sample Document

Above Application Form PDF will be used to demonstrate the activities.

Mandatory Activities

Open PDF and Close PDF and Release Resources activities are the mandatory activities and should be used before and after the PDF extraction activities as shown in the image below.

Open PDF activity accepts PDF path and password input parameters and extract the data which would be used in the following activities.

Core Activities

Core activities are the simple activities that returns the result as text and not in complex format. One of such core activities is Get Text that returns the complete text in the PDF as shown in the below image.

 

One of the most useful activities is Get Text After which allows users to fetch the text after a specified search string. The activity allows you to control the number of words to be extracted using the numWords parameter. In the image below, the activity searches for the String Job Situation and retrieves the value after the specified search string.

 

Another useful activity is Extract Text with Regular Expression. It provides the option to extract text by using regular expression. The activity returns the text matching the regular expression.

There are few more Core Activities which can be used to retrieve text from PDF.

  • Total Pages in PDF – It returns the total number of pages in a PDF document.
  • Get Page Dimensions – It returns the dimensions of a page in PDF document.
  • Get Text Before – It is similar to Get Text After activity but it retrieves the text before search string

Conclusion

By reading this blog post you learned about the new PDF SDK and it’s features. In addition to that, you got a basic overview of the core activities that can be used to extract data from PDF documents.

The following blog posts we will go deeper into detail on how to use filters or extract data from tables. We will also present invoice activities that can extract common fields from most invoices.

Thanks for reading and feel free to leave a comment with questions or feedback 🙂

 

Find more information on SAP Intelligent RPA:

Exchange knowledge: SAP Community | Q&A | Blog

Learn more: Webinars | Help Portal | openSAP

Explore: Product Information | Successful Use Cases

Try SAP Intelligent RPA for Free: Trial Version | Pre-built Bots

Follow us on: LinkedInTwitter and YouTube

 

Assigned tags

      6 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Werner Jacobs
      Werner Jacobs

      Hi Simardeep

      Thanks for the nice blog. Looking forward to the next one.

      I tried the Get Total Pages and Get Text After activities with and EWA Report(PDF).

      When displaying the extracted text from the Get Text After activity in a Log Message is also displays the page number and searchString instead of only the required text after the searhString. Is this normal behavior?

       

      PDF_EXTRACT

       

      Author's profile photo Simardeep Singh
      Simardeep Singh
      Blog Post Author

      Hi,

      Thanks for reading the blog post. It is not the expected behavior. Could you verify the output of activity Get Text After in the test mode. It should not display the page number.

      Author's profile photo Werner Jacobs
      Werner Jacobs

      Hi Simardeep

      Not sure if this is what you requested.

      Regards

      Author's profile photo Sujata Jena
      Sujata Jena

      Hi Simardeep,
      Thank you for the blog.

      I tried with simple activities like get total page and get purchase order number with a PDF.

      While running the automation getting the error as error to download dependencies from factory while I have added irpa_pdf in dependency column. Can you please help me with this ?

      Author's profile photo Sidney Woods
      Sidney Woods

      Nowadays you find plenty of tools and applications on play store by which you can convert the files into PDF but I haven’t found any reliable source from which we can extract the data from PDF file and users can get started now to convert their pdf files with CocoDoc services. This method is quite convenient.

      Author's profile photo Brock Hansman
      Brock Hansman

      Yes, pdf is one of the most useful forms of the document to move it from one place to another. All of us are using this form and are easily read the document at any place, also they can pay me to do your homework reviews to solve their complex tasks easily. So we are grateful to them for providing us with this opportunity.