Skip to Content
Technical Articles
Author's profile photo Samarjit Singha

Groovy to extract text from PDF in CPI

Introduction

This blog helps you to solve a custom requirement to extract text from pdf with the help of groovy.

Note: This groovy will not work on formatted text files (images, bullet points. workflows).

 

Current Scenario: No blogs are available to extract text from pdf in SAP CPI.

 

Why we are doing so?

It gives us the flexibility to work with PDF files. Most of the time the content that will be coming to SAP Cloud Platform Integration will be in XML, JSON, CSV, and EDI. So, it can be easily extracted by this groovy and the rest transformation can be done as per the scenario.

PROCEDURE:

STEP 1: Download the pdfbox JAR file and upload it to your iFlow.

  • Download the pdfbox JAR file from the following link
  • Download the fontbox JAR file from the following link.
  • Upload the JAR file in the Resources tab of your iFlow.

 

 

STEP 2: Take a sample payload for PDF conversion.

This is the sample CSV payload used for conversion.

Material_Name,Material_ID,Material_Number

Iron,KAU145,240

Copper,KAU146,800

Zinc,KAU222,180

Cobalt,KAU338,546

 

Pdf%20screenshot%20of%20above%20CSV%20file

Pdf screenshot of the above CSV file

 

STEP 3: Use Groovy Script in your iFlow to extract text from PDF.

 

I-Flow Explanation:

  • We are using a HTTP adapter to trigger the Integration flow with the pdf file.
  • Then we are using “Groovy Script” to extract the content of the PDF.
  • After that, we are using “CSV to XML Converter” to convert CSV files to XML.

 

Postman Configuration:

Do the Postman Configuration by referring to the image below.

Groovy Script:

import com.sap.gateway.ip.core.customdev.util.Message;
import java.util.HashMap;
// package org.apache.pdfbox.examples.util;
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripper


public static String readFromPDF(InputStream input){
        PDDocument pd;
        try {
            pd = PDDocument.load(input);
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(1); // Start Page
//          stripper.setEndPage(1); // End Page
            String text = stripper.getText(pd);
            if (pd != null) {
                pd.close();
            }
            return text.toString()
        } catch (Exception e){
            e.printStackTrace();
        }
        return null
    }



def Message processData(Message message) {
        def body = message.getBody();
        InputStream IS = body;
        String res = readFromPDF(IS)

    message.setBody(res);
    return message;
}

 

Groovy Script Explanation:

  • First, we are fetching the body and converting it to InputStream.
  • Then we are calling a function readFromPDF and passing the body.
  • If there are multiple pages in the PDF or you want to take content from certain pages then you can use stripper.setStartPage(1); & stripper.setEndPage(1); methods.
  • At last we are using stripper.getText(); method to read content.

Output:

<?xml version='1.0' encoding='UTF-8'?>
<Record>
	<root>
		<Material_Name>Iron</Material_Name>
		<Material_ID>KAU145</Material_ID>
		<Material_Number>240</Material_Number>
	</root>
	<root>
		<Material_Name>Copper</Material_Name>
		<Material_ID>KAU146</Material_ID>
		<Material_Number>800</Material_Number>
	</root>
	<root>
		<Material_Name>Zinc</Material_Name>
		<Material_ID>KAU222</Material_ID>
		<Material_Number>180</Material_Number>
	</root>
	<root>
		<Material_Name>Cobalt</Material_Name>
		<Material_ID>KAU338</Material_ID>
		<Material_Number>546</Material_Number>
	</root>
</Record>

 

More Sample PDFs and their Output:

 

850%20EDI%20pdf%20to%20850%20EDI%20file

850 EDI pdf to 850 EDI file

In the above image, you can see that I have used an 850 EDI in pdf format, from which all the text can be easily extracted by using this groovy and can be used in CPI as per your requirement.

You can use the same EDI payload from this link.

 

Sample%20PO%20to%20relevant%20text

Sample PO to the relevant text

In the above image, we are able to extract the content of the image, but from this text, we will not be able to convert it to an EDI file as the format of each PO may vary, and writing a common script will be difficult.

So this groovy is not recommended for formatted pdf as shown in the above image.

You can visit this site to download the sample PO.

Conclusion:

So, to conclude, this blog helps to extract contents from the PDF using Groovy Script.

Check out the link for more helpful information about Cloud Platform Integration (CPI).

If you have any queries, please feel free to ask your question in the comments. I would request everyone to provide your feedback and like if this blog post finds helpful for you.

 

Thanks & Regards,

Samarjit Singha

Assigned Tags

      20 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Souragopal Sethy
      Souragopal Sethy

      Thanks for amazing write-up

      Author's profile photo Sk Zaman
      Sk Zaman

      Thanks, brother, I needed this in my project.

      Author's profile photo Imran Shafiq
      Imran Shafiq

      Aoa, Dear I need to talk you regarding this article, I have tried but not success.

      Author's profile photo Francis Mullet
      Francis Mullet

      Thanks a lot, Samarjit, for this blog.

       

      If I have a customer's PO in PDF format, is it possible for me to process that PDF PO and create ORDERS IDoc or EDI (850)?

      Author's profile photo Samarjit Singha
      Samarjit Singha
      Blog Post Author

      Hi Francis

      I have tested it with 850 PO and it's working fine.

      Thanks,

      Samarjit

      Author's profile photo Francis Mullet
      Francis Mullet

      Hi Samanjit,

      What I meant is that the customer's PO is in PDF, not necessarily an 850 PO in PDF.

      Thanks,

      Francis

      Author's profile photo Francis Mullet
      Francis Mullet

      Hi Samarjit,

      Like this sample image below of a PDF PO.

      Thanks,

      Francis

      Author's profile photo Samarjit Singha
      Samarjit Singha
      Blog Post Author

      Hi Francis,

      This scenario will not work for your problem statement, as in this scenario, text can be extracted only from a pdf containing plain text characters excluding bullet points, images, tables, etc.

      I have one query regarding your problem statement, how are you planning to convert the extracted text to an EDI format ? As I think there is no capability in CPI to convert an actual Purchase Order (as in the above sample image) to  EDI or IDOC.

      If you have an actual EDI or IDOC in PDF format then this scenario will work.

      Thanks,

      Samarjit

      Author's profile photo Francis Mullet
      Francis Mullet

      Hi Samarjit,

      Up to PI 7.11, there was the SAP Conversion Agent.

      Thanks,

      Francis

      Author's profile photo Lucy Meng
      Lucy Meng

      Hello Samarjit,

      Could you please share the pdf files of the tested 850 PO and the Blog.pdf in above your screen?

       

      Thanks!

      Lucy

      Author's profile photo Samarjit Singha
      Samarjit Singha
      Blog Post Author

      Hi Lucy,

      I have updated the blog. Please go through it. If you have some more queries feel free to ask.

      Regards,

      Samarjit

      Author's profile photo Lucy Meng
      Lucy Meng

      Hello Samarjit,

       

      Thanks for your update! Could you please enclose the pdf file as attachment? so I can download it and use it in postman to simulate it in my iflow.

       

      I tried to simulate the pdf reading like in your blog, not yet get succeed. Please see iflow errors below.

      Error Details
      com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: org/apache/pdfbox/pdmodel/font/PDType0Font : cannot initialize class because prior initialization attempt failed. The MPL ID for the failed message is : AGOP8owhnoOeAa-6zuTQr4H7G111

       

      Error Details
      com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: org/apache/fontbox/afm/AFMParser. The MPL ID for the failed message is : AGOP8cloc76ZOQ2uAfR4qP2C9121

      Best Regards,

      Lucy

      Author's profile photo Samarjit Singha
      Samarjit Singha
      Blog Post Author

      Hi Lucy,

      Thanks for pointing out the errors, there was some issue with the JAR files of pdfbox and fontbox, So I have updated the blog with the new links. Please download the new JAR files and upload them to your Resources tab of the iflow.

      You can download the pdf files from this link.

      Thanks,

      Samarjit

      Author's profile photo Lucy Meng
      Lucy Meng

      Thanks a lot Samarjit! It's working fine now.

      Author's profile photo Raunak Barik
      Raunak Barik

      Hi Samarjit,

      Thank you for this blog. I'm working on a custom scenario and this blog helps a lot.

      Thanks,

      Raunak

      Author's profile photo Eurico Borges
      Eurico Borges

      Samarjit Singha for better understanding please add to blog a screenshot of the pdf you have used PDF so that everyone can see how it looks.

      Thanks

      Author's profile photo Lucy Meng
      Lucy Meng

      I agree with you. I want to know how the pdf file looks like to simulate the scenario to understand how it works.

      Author's profile photo Samarjit Singha
      Samarjit Singha
      Blog Post Author

      Hi Eurico,

      I have updated the blog with the same.

      Regards,

      Samarjit

      Author's profile photo Philippe Addor
      Philippe Addor

      Hi Samarjit

      Thanks for the blog post. Just as a general note: If I would get such a requirement, I would first of all challenge the requester or owner of the sender system and ask why in the world they would send a CSV file in PDF format and not simply as a CSV text string... If there's really a hard requirement for this, sure your script and the library would come in handy. However I think the more interesting use case would be processing of a formatted document, which is much more difficult as you point out.

      There is a creative solution using SAP RPA here (without Cloud Integration though): https://blogs.sap.com/2021/09/07/translating-pdf-documents-with-sap-intelligent-robotic-process-automation-and-the-document-translation-service/

      Philippe

      Author's profile photo Imran Shafiq
      Imran Shafiq

      Hi Samarjit,

       

      I am facing following error - would you guide me to resolve following issue.

      com.sap.it.rt.adapter.http.api.exception.HttpResponseException: An internal server error occured: XSD schema is incompatible with CSV payload. The XSD schema provided contains 3 records; CSV payload contains 1 records.. The MPL ID for the failed message is : AGOcCDK94khTaGo3bF9cOXSag9-C

      My .XSD file

      <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
      <xs:element name="Material">
      <xs:complexType>
      <xs:sequence>
      <xs:element name="Items">
      <xs:complexType>
      <xs:sequence>
      <xs:element type="xs:string" name="Material_Name"/>
      <xs:element type="xs:string" name="Material_ID"/>
      <xs:element type="xs:integer" name="Material_Number"/>
      </xs:sequence>
      </xs:complexType>
      </xs:element>
      </xs:sequence>
      </xs:complexType>
      </xs:element>
      </xs:schema>