Groovy to extract text from PDF in CPI

samarjitsingha · ‎11-30-2022

Introduction

This blog helps you to solve a custom requirement to extract text from pdf with the help of groovy.

Note: This groovy will not work on formatted text files (images, bullet points. workflows).

Current Scenario: No blogs are available to extract text from pdf in SAP CPI.

Why we are doing so?

It gives us the flexibility to work with PDF files. Most of the time the content that will be coming to SAP Cloud Platform Integration will be in XML, JSON, CSV, and EDI. So, it can be easily extracted by this groovy and the rest transformation can be done as per the scenario.

PROCEDURE:

STEP 1: Download the pdfbox JAR file and upload it to your iFlow.

Download the pdfbox JAR file from the following link

Download the fontbox JAR file from the following link.

Upload the JAR file in the Resources tab of your iFlow.

STEP 2: Take a sample payload for PDF conversion.

This is the sample CSV payload used for conversion.

Material_Name,Material_ID,Material_Number



Iron,KAU145,240



Copper,KAU146,800



Zinc,KAU222,180



Cobalt,KAU338,546

Pdf screenshot of the above CSV file

STEP 3: Use Groovy Script in your iFlow to extract text from PDF.

I-Flow Explanation:

We are using a HTTP adapter to trigger the Integration flow with the pdf file.

Then we are using "Groovy Script" to extract the content of the PDF.

After that, we are using "CSV to XML Converter" to convert CSV files to XML.

Postman Configuration:

Do the Postman Configuration by referring to the image below.

Groovy Script:

import com.sap.gateway.ip.core.customdev.util.Message;

import java.util.HashMap;

// package org.apache.pdfbox.examples.util;

import org.apache.pdfbox.pdmodel.PDDocument

import org.apache.pdfbox.util.PDFTextStripper





public static String readFromPDF(InputStream input){

        PDDocument pd;

        try {

            pd = PDDocument.load(input);

            PDFTextStripper stripper = new PDFTextStripper();

            stripper.setStartPage(1); // Start Page

//          stripper.setEndPage(1); // End Page

            String text = stripper.getText(pd);

            if (pd != null) {

                pd.close();

            }

            return text.toString()

        } catch (Exception e){

            e.printStackTrace();

        }

        return null

    }







def Message processData(Message message) {

        def body = message.getBody();

        InputStream IS = body;

        String res = readFromPDF(IS)



    message.setBody(res);

    return message;

}

Groovy Script Explanation:

First, we are fetching the body and converting it to InputStream.

Then we are calling a function readFromPDF and passing the body.

If there are multiple pages in the PDF or you want to take content from certain pages then you can use stripper.setStartPage(1); & stripper.setEndPage(1); methods.

At last we are using stripper.getText(); method to read content.

Output:

<?xml version='1.0' encoding='UTF-8'?>

<Record>

	<root>

		<Material_Name>Iron</Material_Name>

		<Material_ID>KAU145</Material_ID>

		<Material_Number>240</Material_Number>

	</root>

	<root>

		<Material_Name>Copper</Material_Name>

		<Material_ID>KAU146</Material_ID>

		<Material_Number>800</Material_Number>

	</root>

	<root>

		<Material_Name>Zinc</Material_Name>

		<Material_ID>KAU222</Material_ID>

		<Material_Number>180</Material_Number>

	</root>

	<root>

		<Material_Name>Cobalt</Material_Name>

		<Material_ID>KAU338</Material_ID>

		<Material_Number>546</Material_Number>

	</root>

</Record>

More Sample PDFs and their Output:

850 EDI pdf to 850 EDI file

In the above image, you can see that I have used an 850 EDI in pdf format, from which all the text can be easily extracted by using this groovy and can be used in CPI as per your requirement.

You can use the same EDI payload from this link.

Sample PO to the relevant text

In the above image, we are able to extract the content of the image, but from this text, we will not be able to convert it to an EDI file as the format of each PO may vary, and writing a common script will be difficult.

So this groovy is not recommended for formatted pdf as shown in the above image.

You can visit this site to download the sample PO.

Conclusion:

So, to conclude, this blog helps to extract contents from the PDF using Groovy Script.

Check out the link for more helpful information about Cloud Platform Integration (CPI).

If you have any queries, please feel free to ask your question in the comments. I would request everyone to provide your feedback and like if this blog post finds helpful for you.

Thanks & Regards,

Samarjit Singha