Technical Articles
PDF data extraction in Intelligent RPA – Part 1
This blog post is part of the SAP Intelligent RPA 2.0 Best Practices Series.
Introduction
PDF (Portable Document Format) is one the most used formats by numerous individuals and organizations to exchange information. It is widely used to create business related documents and therefore plays an important part in most process automations.
Sharing information is very convenient with PDF but extracting the information from the documents could be a hectic and tedious task. Most of the Intelligent RPA bots requires input data to execute the process which could also be supplied from a document. An example for such a scenario would be extracting data from Purchase Order and perform actions based on the data in a ERP (Enterprise Resource Planning) system.
To solve the problem, Intelligent RPA 2.0 introduced PDF SDK which allows you to extract data from the documents with the help of user-friendly and convenient activities. It part of the Cloud Studio and can extract text from machine readable/generated PDF’s.
PDF Activities are divided into 4 modules:
Sample Document
Above Application Form PDF will be used to demonstrate the activities.
Mandatory Activities
Open PDF and Close PDF and Release Resources activities are the mandatory activities and should be used before and after the PDF extraction activities as shown in the image below.
Open PDF activity accepts PDF path and password input parameters and extract the data which would be used in the following activities.
Core Activities
One of the most useful activities is Get Text After which allows users to fetch the text after a specified search string. The activity allows you to control the number of words to be extracted using the numWords parameter. In the image below, the activity searches for the String Job Situation and retrieves the value after the specified search string.
There are few more Core Activities which can be used to retrieve text from PDF.
- Total Pages in PDF – It returns the total number of pages in a PDF document.
- Get Page Dimensions – It returns the dimensions of a page in PDF document.
- Get Text Before – It is similar to Get Text After activity but it retrieves the text before search string
Conclusion
By reading this blog post you learned about the new PDF SDK and it’s features. In addition to that, you got a basic overview of the core activities that can be used to extract data from PDF documents.
The following blog posts we will go deeper into detail on how to use filters or extract data from tables. We will also present invoice activities that can extract common fields from most invoices.
Thanks for reading and feel free to leave a comment with questions or feedback 🙂
Find more information on SAP Intelligent RPA:
Exchange knowledge: SAP Community | Q&A | Blog
Learn more: Webinars | Help Portal | openSAP
Explore: Product Information | Successful Use Cases
Try SAP Intelligent RPA for Free: Trial Version | Pre-built Bots
Follow us on: LinkedIn, Twitter and YouTube
Hi Simardeep
Thanks for the nice blog. Looking forward to the next one.
I tried the Get Total Pages and Get Text After activities with and EWA Report(PDF).
When displaying the extracted text from the Get Text After activity in a Log Message is also displays the page number and searchString instead of only the required text after the searhString. Is this normal behavior?
Hi,
Thanks for reading the blog post. It is not the expected behavior. Could you verify the output of activity Get Text After in the test mode. It should not display the page number.
Hi Simardeep
Not sure if this is what you requested.
Regards
Hi Simardeep,
Thank you for the blog.
I tried with simple activities like get total page and get purchase order number with a PDF.
While running the automation getting the error as error to download dependencies from factory while I have added irpa_pdf in dependency column. Can you please help me with this ?