PDF data extraction in Intelligent RPA - Part 1

former_member659766 · ‎12-22-2020

This blog post is part of the SAP Intelligent RPA 2.0 Best Practices Series.

Introduction

PDF (Portable Document Format) is one the most used formats by numerous individuals and organizations to exchange information. It is widely used to create business related documents and therefore plays an important part in most process automations.

Sharing information is very convenient with PDF but extracting the information from the documents could be a hectic and tedious task. Most of the Intelligent RPA bots requires input data to execute the process which could also be supplied from a document. An example for such a scenario would be extracting data from Purchase Order and perform actions based on the data in a ERP (Enterprise Resource Planning) system.

To solve the problem, Intelligent RPA 2.0 introduced PDF SDK which allows you to extract data from the documents with the help of user-friendly and convenient activities. It part of the Cloud Studio and can extract text from machine readable/generated PDF’s.

PDF Activities are divided into 4 modules:

Sample Document

Above Application Form PDF will be used to demonstrate the activities.

Mandatory Activities

Open PDF and Close PDF and Release Resources activities are the mandatory activities and should be used before and after the PDF extraction activities as shown in the image below.

Open PDF activity accepts PDF path and password input parameters and extract the data which would be used in the following activities.

Core Activities

Core activities are the simple activities that returns the result as text and not in complex format. One of such core activities is Get Text that returns the complete text in the PDF as shown in the below image.

One of the most useful activities is Get Text After which allows users to fetch the text after a specified search string. The activity allows you to control the number of words to be extracted using the numWords parameter. In the image below, the activity searches for the String Job Situation and retrieves the value after the specified search string.

Another useful activity is Extract Text with Regular Expression. It provides the option to extract text by using regular expression. The activity returns the text matching the regular expression.

There are few more Core Activities which can be used to retrieve text from PDF.

Total Pages in PDF - It returns the total number of pages in a PDF document.

Get Page Dimensions - It returns the dimensions of a page in PDF document.

Get Text Before - It is similar to Get Text After activity but it retrieves the text before search string

Conclusion

By reading this blog post you learned about the new PDF SDK and it's features. In addition to that, you got a basic overview of the core activities that can be used to extract data from PDF documents.

The following blog posts we will go deeper into detail on how to use filters or extract data from tables. We will also present invoice activities that can extract common fields from most invoices.

Thanks for reading and feel free to leave a comment with questions or feedback 🙂

Find more information on SAP Intelligent RPA: