How to extract data from text-searchable PDF Docum...

former_member406613 · ‎05-12-2020

At times, we want to get some useful information from a PDF document for further processing. Let's say, for example, you want to retrieve the "Order Number" and the "Amount to be paid" from a file of PDF format.

Such an operation to read data required consumption of external services via an API call and writing a considerable amount of JS code lines to extract the relevant information earlier.

Well. No more worries now! 🙂 Thanks to SAP Intelligent Robotic Process Automation the steps to read data from a PDF document are simple & straightforward thus making the development effort drastically less from 2004 release.

Recommendation: Unlike Microsoft Outlook or Microsoft Excel library, there is no need to explicitly add a library.

As soon as you drop a PDF activity onto the workspace of the Desktop Studio Workflow Perspective, the PDF library is automatically included in your project. Thus, it is recommended to use the Open PDF activity instead of Custom Activity and in the end, use Release PDF activity in order to release all resources associated with it.

Note: The PDF files must be smaller than 15MB and contain fewer than 100 pages or 10,000 words. Failure to comply with these limits results in an error.

Example Code Snippet:

A scenario to open the PDF document and read the Order Number only from Page 2.

ctx.pdf.openPdf('..//Invoice.pdf', function(error) {

            if (error) {

                ctx.log("FAILURE: Opening Invoice.pdf failed");

             }

            var filter = ctx.pdf.createFilter('2');

            var orderNumber = ctx.pdf.extract(/Order Number: ([A-Z0-9]+)/, filter);

            ctx.log("Order Number: " + orderNumber);

            ctx.pdf.release();

});

Step 1: As shown above, the PDF is opened using the syntax

ctx.pdf.openPDF(filePath, callbackFn(error) {}, password)

In case if the PDF file is password protected, provide the password in String format. The function callback is executed as soon as the PDF is opened. In case if the PDF is not found, it will return an error in the error parameter. Thus, it is mandatory.

Step 2: It is possible to create a filter and thus search in a specified area in the PDF document. The syntax is as follows,

ctx.pdf.createFilter(Pages, {Object});

You could specify the number of pages in the first parameter (mandatory parameter) and an object containing four variables of type number (optional parameter).

top - the vertical offset from the top edge of the page

left - the horizontal offset from the left edge of the page

width - the width of the area or bounding box

height - the height of the area or bounding box

Example:

ctx.pdf.createFilter("1,2,10-14", { top: "100", left: "100", width: "200", height: "10" });

Step 3: You could now get your text or extract the data only from this filtered area using the following syntax.

a) ctx.pdf.getText(Filter)

b) ctx.pdf.extract(regex, Filter)

Example:

var regex = /Order Number: ([A-Z0-9]+)/;

var extractedTextRegEx = ctx.pdf.extract(regex, Filter);

Step 4: Final and mandatory step. Release the resources using the following syntax.

ctx.pdf.release();

Hope you liked it! 🙂

How to extract data from text-searchable PDF Documents in SAP Intelligent Robotic Process Automation?

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win