PDF data extraction in SAP Intelligent RPA – Part 2 (Advanced Activities + Filters)
PDF SDK can be used to extract data from machine readable documents using various convenience activities. As discussed in the previous Blog Post,core activities are the simple activities that returns the result as text and not in complex format.
In this article, we will discuss about Advanced activities which returns output in complex format and how Filters which can be used to restrict the text extraction to a specific part of page.
We have seen previously that the fields can be extracted using simple activities but the problem arises when the data is in a complex format like table or matrix. Advanced activities solves such problem in a convenient way by providing customizable parameters.
The problem of extracting table data is solved using an advanced activity called Get Table Column Entries. This activity allows the extraction of table data using different configurable parameters.
Column header serves as the starting point of the extraction which in our case is Answer and text below table is the stopping point.
Text Below table is optional and is only required when there is some more text after the table. To make the extraction more precise, add some left and right offset.
As seen in the above example, Answer column is extracted with all the cell values inside it. textBelowTable parameter is empty because there is no text below the table.
Another advanced activity is Search Text Items activity. It returns a list of all text items with matching string. A Text Item contains word properties like dimensions, page number and index. These word properties can be used to create custom area filter which could restrict the extraction to a specific area of page.
In the above example, it grabs all the text items containing search string Name.Since text item is a complex return type, it provides different values associated with the text item like word, width, height and other properties.
There are few more Advanced Activities which can be used to retrieve text from PDF.
- Get Text Items – Retrieves list of text items
- Get Text Items in Area – Retrieves text items in a given area
- Get Text in Area – Retrieves text in a given area
- Get Text After Multiple Search Strings – Returns text after search string, multiple search strings can be provided
- Get Text Before Multiple Search Strings – Returns text before search string, multiple search strings can be provided
Filters restricts the extraction to a specific part of the page. In this, if we want to restrict the activities, which can be core or advanced to a specific area of the page, we can apply filter just before the activities and the underlying activities will be restricted to extract from a specific part of the page.
In the above example, Filter Upper Half of Page activity will restrict the extraction to only upper half of the page. So, all the activities below this activity will be restricted to work on the filtered part of the page.
Filters can be cleared by using Clear Filters activity.
There are few more Filters which can be used to restrict the text extraction.
- Filter Pages – Restrict extraction based on page selection
- Lower Half of Page – Restrict extraction to lower half of page
- Left Side of Page – Restrict extraction to left side of page
- Right Side of Page – Restrict extraction to right side of page
By reading this blog post you learned about the advanced activities and filters. The following blog post we will go deeper into fields extraction from Invoices or Purchase Orders.
Thanks for reading and feel free to leave a comment with questions or feedback 🙂